Hallo Du!
Bevor du loslegst den Talk zu transkribieren, sieh dir bitte noch einmal unseren Style Guide an: https://wiki.c3subtitles.de/de:styleguide. Solltest du Fragen haben, dann kannst du uns gerne direkt fragen oder unter https://webirc.hackint.org/#irc://hackint.org/#subtitles oder https://rocket.events.ccc.de/channel/subtitles erreichen.
Bitte vergiss nicht deinen Fortschritt im Fortschrittsbalken auf der Seite des Talks einzutragen.
Vielen Dank für dein Engagement!
Hey you!
Prior to transcribing, please look at your style guide: https://wiki.c3subtitles.de/en:styleguide. If you have some questions you can either ask us personally or write us at https://webirc.hackint.org/#irc://hackint.org/#subtitles or https://rocket.events.ccc.de/channel/subtitles .
Please don't forget to mark your progress in the progress bar at the talk's website.
Thank you very much for your commitment!
======================================================================
So hi, welcome to my talk on mining for box with graph database queries, it's been a while since I last talked at CCC and I actually planned on submitting every year, but then turned out last time that I submitted that it's actually been five years already. You probably saw most of the people in the room probably did not see the last talk, but to make it short, it was mostly about our hard earned bucks, meaning staring at code for a very, very long time, trying to find bugs and exploiting those bugs in the code. And since then, my interest has been in making this process just a little less painful. And what I'm presenting today is the results of that. So I want to give you the big picture first of what I actually want to achieve on the long run. So what I'm trying to do here is I'm trying to combine two well, my main two interests, which are on the one side, the kinds of things that we see at conferences like this, you see a lot, which is buck hunting, exploiting bugs, things that are nicely represented by the books that you see there on the top. So the Show Coders Handbook, the book diary, the art of software security assessment. And in the back you see Freck shining over all of this. Right. And then it becomes a little more academic with principles of program analysis. And what with these kinds of methods that are proposed in principles of program analysis have in common, is that they are very, very exact. Now, my other interest is in pattern recognition and machine learning, these kinds of methods, and they are actually very inexact. So they approach problems more from the engineering kind of perspective. And the idea now is, can we take the tools that we have down there, the the pattern recognition tools and kind of build tools to recognize bugs, you know, are looking looking at bugs, looking at code similar in the way a person auditing code would look at it and then just, you know, do an inexact analysis. But which scales very well, because the problems that yo
u have with these exact methods is that scaling them to really large code base, like the Linux kernel, for example, is really, really hard. Now doing this. I want to have tools which are realistic, not those static analysis tools that give you like thousands of hits and you can't really tune them or do anything like that. But instead, I want something to assist auditors in their daily work. Now, zooming in a little, where you're going to be seeing today is I'm trying to look at two very different topics and showing that they actually have something in common, that they actually fit nicely together. So one of them is good old computer science, compiler construction, and the other is the shiny and new graph databases, or as some people like to say, big data. Right. And we're going to see that they actually fit nicely together. That's that's what I want to achieve in this talk. Now, to start off with let's take a look at an example back. So this is a bug found by Stefan Stephanson and presented in Siskin 13 talk. And well, in his talk, he presented a lot of different bugs. And this one is certainly not one of those. Well, this is more of a side finding, I would say, but it's interesting for us, as you'll see in a moment. So the one thing that's already interesting is that you see these kinds of bugs like these all over the place. So it's like a it's not a very unique bug. It's something that you will see again and again and again. Now, as you see, there's this variable, 32 bit wide called name Lynn. And it's it's produced by taking this point of data and then converting from network byte to order. OK, so that kind of tells you this data is probably attacker controlled because it comes from a network and you would not be converting from network troublespot order if it didn't come from a network. Now, what happens next is that there's an allocation. So this a buffer called exit signal is being allocated and they use this variable name, Lenn, that has just been initialize
d by the attacker and they add one inside the allocation. And clearly, the problem is if a name Lenn is the maximum size of an unsigned integer, then this will overflow and actually something close to close to zero bytes will be allocated. And now finally, there's the operation and we copy into this buffer, which is close to zero bytes and we copy namely in bytes into there. So that's about four gigabytes. So there's a typical he based buffer overflow. Now, what's interesting is how he found this, and that's why I took this example. So in academic research, there are a lot of different different methods being proposed and how you could find this kind of stuff. And they all sound very advanced. So you hear things about white box phasers and hansell, symbolic execution and the machine learning powered anomaly detectors, maybe Thierer improving or model checking. But in practice, he used the regular expression for grep. That was pretty amusing. So with all the with all the all the advanced methods that you could use, he chose to do this. Now, I think this tells us a lot. So first of all, I think it tells us that if you use it right, even primitive tools like Repp are very, very powerful. You just need to make sure that the the the person auditing the code can actually introduce his knowledge into the process. Now, the second thing that becomes clear is that these kind of queries, so things like this grep regular expression. They actually encode knowledge. So I'm going to go back one slide. Yeah. If you look at this regular expression, you see what it tries to match. So it tries to look for an allocation. And inside there there's some sort of arithmetic operation. So plus minus something like that. And in a way, you could say this is a model for different kinds of bugs where where he's saying, you know, these kinds of things happen all over the place. So let's let's just search for this. And then finally, it kind of shows that false positive rates for bug hunting, that'
s not so much of a problem. You know, if if you have if you have like 50 hits and in there there are seven days, well, then that's fine. You read a bit of extra code, but who cares? I mean, yeah. So that gave me the idea of, well, maybe we can build like a search engine for source code that can be used to find bugs. And that's what this talk is about. So overall, it's supposed to look like this, you take source code and you stuff it into some robust parser for the language and then you put it into a database and then the analyst, the auditor sits at one side and is able to query the database, see what comes back, and then eventually adapt the query to make well, to actually find what is interesting now in the back. You can also use this database for different kind of pattern recognition tools. But this is something that we're not going to be discussing today. Now, the question is, if we want to build a good search engine, what does our language need to be able to model? And overall, I think is the following are we need to we need to ask what does the statement look like and can we get from one statement to another? And how do the statements affect affect each other and. We can, of course, take a look at how people who design compilers have tackled all of these problems since when you try to write an optimizer or something like that, you need all this kind of information. And in compiler construction, well, they essentially have different graphic representations of code and all each of these highlights some aspect. So, for example, there's the abstract syntax tree that you see here in the top, the ESTIE. And this gives you like a hierarchical decomposition of the code into all of its language elements. Then there's the control flow graph. You probably notice that's essentially what you see in tools like ADA Pro, which tells you, well, the statements have been collapsed. Collapsed, right. And you see how you can get from one statement to the to the other and on condit
ion nodes, you see whether the conditions to be true or false to get to another statement. And then this thing here, this is not so well known. This is called the PDG, the program dependance graph. And this has edges to model data. So this says a variable that is being produced, that one statement reaches another statement. So it's not changed on the way. So there's actually data flow from here to there. OK. Now, as I said, they all highlight some aspect of the code that we want to be able to model in the search for it. But none of them are good can really do it. All right. And another problem is if if you take a look at a typical query, then it's going to sound a bit like this. Find a call TIFU in a statement where data flow exists to a statement that contains a call to bar. That means what you're actually doing here is you're kind of transitioning from one representation to the other. Right. So first, you're in the syntax world. First you're saying find a call to Foo, which means. Well, in in the syntax tree, there needs to be a node, which is called foo. But then you want to take a look at dataflow. Now you want to see can I get from this statement to to another statement. And here the the abstract syntax tree fails completely. And instead you want to use something like the PJI, so you want to transition from the E to the PJI and then again to the EU. And what we want to have is a representation that can do all of these things. And the the primary observation here is that in all of these representations, the CFG, the PDG and the ESTIE for each of the statements, there's actually one designated node. Right. So if you look at this indexical source, um, there's one node in the CFG, there's another in the PDG and also in the Høst here. It's just a declaration node. Of course there's something else beneath that, but that doesn't matter. So why not try to join these data structures at statement notes? And that's what we did. So this is what we call the property graph.
And what you see here is, well, you see parts of the høst which are still there. Right? So those little trees. But you can also get from from one of these statement subtree of the E to to another statement via data flow links or control flow links. Right. And now the idea is maybe maybe we can describe vulnerabilities as walks in this graph. Now, once we have that data structure, the question was, how are we going to store this? And it actually took me quite some time to to realize that, well, if you have a graph, then, you know, trying to map this to a classical relational database management system is not going to be much fun because you need to map a graph to tables. And it took me even longer because there are actually a lot of great reverse engineering tools which do exactly this. So for me, it seemed like the right way to do it. And then I started to look at different kinds of databases and I looked at document databases and mapping graphs through documents also kind of fails. And then I realized mapping graphs to graphs succeeds immediately. Now, um, why are relational database, relational database management systems not the right choice here? Well, as I already said, the underlying data structure that's being used here is the table. And the most obvious problem is that if you want to create relationships between nodes, as you have in a graph right just across from one note to another edges, then what you typically do. So pretty much all you can do is create a so-called joint table. And in this joint table, you will associate tables from items from a table on the left and a table on the right. And the lookup time will scale with the number of relationships. But what you really want is you want to be at one node and you want to have you want to have immediate access only to the relationships that are outgoing there. And that's what a graph database gives you. Now, I know that when you hear big data and cloud era noise, you're probably thinking, no, this can't
possibly make sense, because the only people who use these kinds of words, you know, the guys in the suits. But, you know, this actually makes sense from a technical point of view. And that's what I hope to show in the next couple of slides. Yeah, so the native storage format for Gref Database's. It's the so-called property graph that's different from the tables, so property graph is just a graph, really, but there are properties attached to the nodes, which just means, well, you can think of it as having Python dictionaries attached to each node or Perl hashes or. Yeah, just a key value pairs, really, and also the edges. And this is important for us as well. They're labeled. So it's not just there's there's an edge from A to B, but there is an edge labeled as knows or created or something like that from A to B, and that's all that a property graph really is. And that's the native storage format of graph databases. But now the nice thing here is once you have your data in this format, you can make use of different languages which have been designed specifically to query these property graphs. And there are two languages which are currently very popular. One is the cipha query language by the four guys. This is not so well suited for what we want to do most of all, because the cipha query language doesn't allow you to have like the equivalent of stored procedures. And we really want to have this. As we'll see. The other one is Gremlin and Gremlin is really awesome and I hope to show this. So a typical typical Gramling query looks a bit like this. You say choose the set of start notes, then you have some start notes in your graph and then you describe where to walk the graph from there. And if that's possible, then the notes that are reached in that way will be returned. So as an example, return all objects created by people over the age of 30. So it looks like this. So here's the Gremlin query. So you begin by saying take all nodes where H is bigger than 30. So here
in the example, that's Josh, who's two, and Peter who's thirty five. And now. Take all outgoing edges labeled by, create it, and then you reach this. And this note and now the nice thing is once that you've created a walk like this, you can give it a name, right? So instead of always saying filter, it is bigger, 30 out created, you can now say people we can fire now and then you can reuse this again and again. Yeah. And this is what a definition like that would look like. So it's very similar to to like a stored procedure. It means you don't have to ever write it again. And this is what we're going to make use of now, of course, for social networks. This is extremely useful. And the best example is Facebook Graph Search. So they actually made a UI where people can click together, different kinds of graph database queries. So you could look for people who like English, Defense League and country, for example, or you could look for mothers of Catholics from Italy who like Durex and. Stuff like that, right? But the amazing thing is you can store other useful things in graph databases and the property graph by by definition really is a property graph. Right. So you can store it in a in the graph database and now we can use this this trick here, you know, this this taking walks in this graph and giving them a name to define domain specific query language for vulnerability, discovery, Steria. And we did all this and made it open source so you can play with it. So this tool is called your E robust chord analysis platform for C and C++. Don't be fooled. It's mostly C, but if you throw something like Firefox at it, which is let's say nice C++, then it will also work for you. But you should throw something like the Stelle into it, then it will fail miserably anyway. By definition, you get an extensible query language because you have gremlin, right? And we provided a lot of different, um, traversal. They're called like these pipes that you can immediately use. It's repeatable
via Python because so many people like Python and I also wrote a couple of shell utilities that you can use for like day to day auditing work. And it's known to work on other people's machines, which I'm very, very proud of. OK, yeah, you can download here. OK, so let's take a quick look at what this tool looks like. So you have an importer, you start to import code that's here, OK, and then it creates a database called unindexed and this is then stored in the graph database. And then all you need to do is start the database server and it will open up. Well, it will make a rest API accessible. So this shows this kind of stuff comes from Web guys right now to make sure that you don't ever see that there's Web stuff underneath. I wrote a library called Python Uran that you can just include in your scripts and it will just communicate with the rest API for you and then you can run your scripts happily. So this is a complete working script except for the query right here. You insert your query, so you connect to the database, you run the query, you print the results, that's all. Alternatively, you can use the shell utilities. Now let's test this on some real projects. So the first project we're looking at is the Velzy media player. And here's just a short disclaimer about this. If you audit people's code and that doesn't mean that you hate them and it also doesn't mean that you disrespect them. Mostly, it means that their code base is interesting in one way or another. Right. So, um, the fact that this is popular made me want to audit it. I think the Velzy developers are doing a really good job and all of the bugs that are going to be disclosed now have actually been fixed now in the git, you know, in the jet version. And we'll probably then be fixed in the next versions as well. OK, so let's make this practical. You run the importer, you're on on the velzy code starts importing the code, you start the Java server and you can then point your browser to Port seven four
seven four and you'll get some basic statistics of what's inside the database. So you can now see for Velzy, it's created about two million nodes, five million properties, four million relationships and so on, and it uses about seven hundred and five megabytes. So it's storable. So all of these experiments are being done on a laptop. So you don't need a server farm to to operate this kind of stuff. Right now, if you look at the unindexed, the database that has been created, you'll see that there's actually an Apache Lucene index here, right? Apache, Lucene. It's just something that you can use to actually document database. And this is used as part of the graph database to give you fast lookups of nodes by their properties. So you can say things like give me all calls to Mellark and you'll immediately have them. Right. So this takes up about as much space as the graph database you could see. Now, let's start using this. So here's a very simple query with the same query language. So as I said, there are shell utilities. You can look up as one of the shell utilities that you can use to quickly pipe a query into into it and. We'll get the results in here, it says, Give me all the files with the file, PAF contains the word Dymocks, so the monsters and of course, you could have also done this with a find. Right. But now it gets more interesting. You can also insert these gremlin queries. And here I've created custom query known as a custom step known as query note index. And you can pass Lucene query in here to get a start note. But instead of just returning those notes, we can now start traversing the graph starting from those notes so we can say things like from those file notes, go out and filter all the functions. So this will give you all the functions and print their name. So suddenly you're going from files to functions and you get all the functions in files named Dymocks. Yeah, that's what that looks like. Now let's see how much time I have left. Okay. Yeah. So o
f course you're going to be asking. OK, nice. But how do I know what's actually in the database. So how do I get from one note to another. And the nice thing is if you work at a university, you eventually have some some master students and one of them has the students. They wrote a tool to visualize the graph database contents. Right. So you can get it. This is, of course, printed in a way that you probably can't see from the back. The point is, just in those notes, you see all the properties and you see the different edges between there. And you also have the labels and also attributes on those. Uh, yeah. So. Yeah, you can use these tools, you can plot graph, for example, here we say give us the dataflow edges and the contraflow edges and then it will print nice things that you can put onto slides or study to actually know how you can get from one place to another. Now, let's take a look at Bux, finally, so I first wanted to look at things which are in a way similar to the Lipsius bug that I disclosed. So I usually begin this by formulating in in words what I'm looking for. And in this case, this was I get calls to mallock where the first argument contains an additive expression and a call to meme copy is reached by dataflow, where the third argument, a seamount to copy, also contains an additive expression and the two additive expressions are not equal. So these are the kinds of things that you can easily find with the tools that we wrote here and now as a query, it looks like this. This is a custom step. Get calls to, um, and, uh, well, you'll find a lot of different ones. There's, for example, get definitions and get functions by name, stuff like that. This gives you the start notes. So get close to Mellark from there. Walk to the first argument. Right. It's zero because we're computer scientists and we start counting at zero. And as a side effect, store the code that you see there. Right. That's the first document stored in a variable called Seante. Now from th
ere, I'll look for an additive expression. And as I said, you know, if there's no additive expression, like an empty set will be returned here. So so those will be filtered. Yeah. And then from there, from all those Morlocks, where there's an additive expression in its first argument, go up to the statement that encloses this. And from there, follow dataflow links. Right. To calls to copy where the third argument is not equal to this C.A.T. that we just stored up here. Right. And there's an additive expression inside and that's it. And I can imagine that if you see something like this for the first time and it's really complicated, but you'll get very fluid at this kind of stuff, then we just pipe it into your own lookup. We sort it unique. We get the locations. You're on location, but. And we have a file containing locations that match, so getting that file gives us four hits and I immediately said, oh wow, OK, and before looks nice, I have some force. So let's look at the map for stuff and then you can pipe it into your own code, which will give you the code. Right. So this is what you find. You find a function called MP for Redbox name and. Well, yeah, I would have ask you how to trigger the bug, but I wrote it in the title of the slide, so it looks very similar to the bug that Stephane found. So here you have inside the malac there's an addition. You add one, you subtract eight. So you're subtracting seven, really. And down here in the U.S. use obstructing it. So now if you insert a seven, then you have seven plus one minus eight gives you zero. So you allocate something close to zero. Right. And here, if you insert seven, seven minus eight is minus one has to unsigned. Integer is Max. And there you go. You have your buffer overflow. So this is in the empty for the mixer. Let's see, yeah, OK. Now, interesting question to ask here is, well, how do you how do you come up with ways of so what am I going to search for in this code to actually find bugs? And there we
re some wise words set in the art of software security assessment. And I'd like to quote this. So they say, in fact, two or four authors typically start reviewing a new code base by finding the equivalent of the util directories and reading the framework and glue code line by line. And I found this really helpful. You know, if you start auditing a new project, take a look at the language that's being spoken in this code base, and those are the utility functions. So in this case, in the case of Velzy, I took a look at Velzy Stream H. This is the well, this is these are the this is the API that they use to analyze data streams of all kinds. So if we find fundamental difficulties in using this API, then we're going to find bugs. Now, what I found is there's a function called stream size, and as you might have suspected, it returns a 64 bit integer. That's because a stream, a data stream. So this might be just an empty for of course, you want it to be bigger than four gigabytes. Right. So any good HD movie is going to be bigger than four gigabytes and like that. So the problem here is that on 32 bit platforms, well, you can't really store forty two bit in a register. So the size of the stream is going to be larger than the max on the platform. And that means that all the locations that depend on stream size, you're going to be very you going have to be very careful of those. And it's realistic for an attack because know somebody well, if somebody downloads a movie off file sharing application and the caption is right, then people won't be thinking, oh, this is four gigabytes. I'm not going to start this. Right. So it's four gigabytes because it's a movie. Go. So, again, formulating this in your head, you get give me statements containing calls to stream size and the symbol in 60 40 where data flow exists to statements containing the symbol Malac. Right. So you get locations depending on the stream size. And this query looks well, the sentences not so long. And the query
isn't either. So it's again, get close to what you just saw, a stream size and then go up to the statement and now you're leaving the the AST part and now filter all those statements which contain in 64 and then follow the data flow to statements that contain the word malac. That's it. And you don't get so many hits. So those are all the hits that you get and well, they're all not so interesting. But this one is nice because this is the updater. And so I read plus one depending on stream says. So let's take a look at this code. So this is the Velzy automatic update to this thing that checks in the background whether new updates are available. And you can see it connects to some new URL. We'll see which one and then it can stream size and stores this. And I read and this is the 64 bit integer. Then it costs mallock oneiric plus one, not one on a 32 bit platform, I read plus one. Well, this will first succeed and they will do a calculation with a 64 bit most probably. But then no matter what happens, this is going to be truncated to six to 32 bit. Right, because the size T on a 32 bit platform is four byte as opposed to eight. Now if I read is maxing out, then again, you get a buffer close close to the size of zero and now they call stream rate, which is an API function that is similar to mem copy in a way, but much better as you'll see in a moment. And the read I read characters into this buffer. So again, you have an overflow. Now you're going to be saying, yeah, but you know, your update is going to be verified. Right. Here's the nice thing. They actually do signature checking after they've downloaded using HTTP only. Um, but this buffer overflow happens before the signature is being checked. Right. So executer before verification. Nice. This could be something that we could be interested in. Now let's look into the binary and we'll see. OK, this is a static URL. It should be update vidalin and so on. And this year is. Yeah, if you follow the link you'll see this
is the call to stream size. It's been in line and here you see the truncation in. It's in a very obvious form. So here the return value is actually stored in two registers because you can't really do it differently on a 32 bit platform. Right. But then as they as they add one, they really just add one to one of the registers because they're going to be using a 32 bit value anyway. So they might as well truncated here. And then this flows into the Marlock and there you go. You have your integer overflow leading to a heap buffer of. Now, I set up my little Web server and pointed ATC host to update at points to update Verlin or to that Web server and created a file that is four gigabytes long and contained all A's. I attached to debugger and I check for updates. And here's what happened. So that's very convenient because typically from the crash, you need to do all sorts of work to actually get VIP equals or your value, right? So this is forty one forty one forty one forty one, meaning it's all A's. So you control where the program jumps next. And the nice thing is, if you have this, then arguing in favor of patching just gets very much easier. OK, so some notes on exploitation stream read is your friend, I just I just said this already. It's like memoires piece of copy stayed in your buffer, but much better. So first of all, the size depends on the content. Langfield So you don't have to actually transfair four gigabytes of data or you need to say as I am eventually going to transmit four gigabytes of data and here's some data that the attacker fully controls the amount of data to actually copy. And you also control how data is fragmented into different blocks. And now the nicest thing of all, and this is why we control IP on this in between those copy operations stream read dereference as a function pointer. OK, so if you overwrite this function pointer then you control the IP. Now, SLR and is enabled, but there is a bit of position dependent quote that we can use to
build a small Ruppe chain for the exploit, the downside is this is not so stable, at least for my poque exploit. So I'm going to show you a pocket exploit now, a proof of concept, but it's well, I only took a couple of days for this, so it's not very stable. But yeah, it will work at some point. Let's see. OK, so this is the platform. Now I start Velzy and then I use the automatic updater, and it didn't work, and I tried again. And it didn't work. Also, pork exploits have stage fright, as you know, and it didn't work. Can you give me five tries? Almost. That's very close, that's very close. Just a second. Yes, there we go. So the reason this didn't work straight away is I did know all sorts of work to ensure that the heat is in any controllable state. This is really just these were a couple of luck shots and then eventually it hit the right address. Yeah, I mean, it it shows that you can execute arbitrary code. So this has to be enough to patch it. Right. Good. Next. Yeah. So our second test subject was the Linux kernel. This is also well suited for static analysis because there's a lot of drivers in there. And to actually force these drivers, you would actually need to have the hardware to to fuzz this stuff. So you find a lot of things statically as well. And also it's a logical base for large users. So it's interesting right now. Yeah, I want to show you something and I know this is a lot of text on the slide here and show you that we can take very complex, we can define very complex traversal and then just hide them in a single word, which in this case is unsanitized, and then reuse them again and again. So for unsanitized, what I want to do is I want to be able to find cases where I have a flow from a source to a sink, but certain checks are definitely not in place. So if you formulate this in complete form, it's the traversal return control sources if and only if there's enough missing. If they meet the following two conditions, there exists a path from the s
ource statement to the sync statement in the control flow graph such that no code on the path matches any of the sanitizer descriptions. So these descriptions are something that you pass in there. Now, a second, a variable defined by the source and used by the sync reaches the sync via the control flow path, i.e. the variable is not re defined by any node on the path. So that's pretty complex. Right, but you can implement it once and then put it into your step called unsanitized and reuse it again and again. And that's what we did here. So here you see typical. So we say a query here we look for buffer overflow and write handlers. So we start off by characterizing the Sync's that we want to reach sensitive operations. So get functions where the name matches. Right, OK. And from there on, go to the arguments or arguments it is now and. Yeah, no, go, go, go to calls to copy from user or meme copy and to their third arguments. Yeah. And now filter those for count because count is typically something that's that's a variable you can control in the right hander. And then here comes the unsanitized step that's the important part, so the unsanitized step here, you characterize the different kinds of checks that you want to filter. So if this stuff occurs, then you don't care for this sink. So one of them is, is there some sort of comparison in here? We've got this is check or does the code contain ELAC? That means that the buffer is probably allocated to have the correct size. Of course, we don't know this, but, you know, let's filter all of these. Let's look for the cases where it's very clear that this is a bug and Min is not used. So Min is often used instead of instead of having a check. Yes. A filter those as well. And make sure that the sources match count and we have a similar one. We're not going to go for this in detail. But again, the unsanitized part is exactly the same, right? All we do now is we have a slightly different characterization of the Sync's and the
sources, and we ran these two queries and, um, well, we had seven audits and 11 hits using this. So that was very cool. Now, here's one of them just to show you what these kinds of bugs look like and what what you can use this unsanitized step for. So this is an actual handler of Q F that. If that card you see here call to copy from user, which is being used to initialize the variable Rick Lence request. And that means that this variable here is controlled by userspace. Now there's an allocation here on the way, but it's absolutely not concerned with Rachlin. Instead, Rocklyn is next used in this copy operation down here as a third argumentum copy. And this buffer, well, the size of this buffer has nothing to do with Rocklyn. And it's a it's a static buffer. So you have a classical stack based buffer overflow. And the nice thing this is why there were only 11 hits is that we were able to actually filter those cases where there are different kinds of checks to ensure that this kind of stuff doesn't happen. Right. So stack based buffer flow, we did not write and exploit what looks looks pretty exploitable now for the practical evaluation of all the stuff. I mean, I was just able to show you a couple of samples here. We did a larger evaluation and Niko worked on this for a great deal. So he used this at well in a real company for Qualcomm and found about a hundred issues in an internal audit. Now, I don't know exactly what an issue is here, so can can comment on the quality of those findings. But then we also did an evaluation. So he did most of the work with this on the Linux kernel mainline and we formulated different kinds of queries. So two queries, buffer overflow, as you just saw them, although they were slightly rewritten. But yeah, we just saw them then we had zero byte allocation, memory mapping box and some memory disclosure box. This one was rather complicated and need to publish it at some point. Yeah. And we found a lot of blocks kind of proud of that. So
using the. Yeah, and they were also acknowledged by the developers, so that's always nice to know that actually, you know, these are not just thugs for security folks, but actually they would also say it's OK, 10 minutes. Yeah, that's good. Uh, OK. So conclusion. Yeah, I showed you a system to mine, quote, basis for vulnerabilities on the long run. Um, well, here I've already built a bridge between program analysis and graph databases on the long run. It's you know, the larger effort of this is can we have typical pattern recognition, data mining techniques to discover vulnerabilities? And this is one part of this. And finally, as I just said already, we found real exploitable bugs. So, yeah, it seems seems to work in some cases. Thanks. Okay, thank you very much. As usual, we have some time for Q&A left, so we again have four microphones in the room, please. All the people in the room will have questions lined up behind the microphones and for the people on the stream. We also have a signal. Angel in the room will take your questions on Twitter Iasi and all your online well means. And we're going to go with the first question from room for microphone number two, please. Yeah. So do you have a database with Curis ready? So I don't think it would be a good idea to integrate this in continuous integration system. Yes, we have a database already with such Curis. And is there any way to annotate called. So if I go and if this report is a problem, can I annotate it in some way that we'll skip it the next time if it goes through? Um, we currently don't have a database of like a lot of queries. We have some examples. Right, that you can look at, but we definitely want to have this. This is more of a question of time that you're able to invest into this. But I also think that the the idea that you're I mean, essentially what you're saying is, if I understand correctly, how can we not build a database of things that we know that have broken in the past. Right. So that we can
immediately scan the new code for this kind of stuff, if I understand correctly. Um, yeah, this would definitely be very awesome. But needs some some. Yeah. Work to be done. Just writing queries in the annotation part. I mean, can I mark some code that this was checked and it's safe so it doesn't go again. So yeah. If something interesting to say about this, but I'd rather take this offline because it's concerned with the work that we haven't got published yet, but yeah. About the annotation. Yeah. Maybe we can discuss this after the talk. I thank you from your next question from microphone number three please. Yeah. Give them that. You support C and C++. I assume you're obtaining the tier from CRANN compiler for example now. Yeah, this is another story, but I was asked by several people to skip it because it's a bit boring. The thing is that if you I said something about a robust Pausa. Right. So the problem really is the job of compilers is to also check whether the code that you're giving it is actually correct. That means and C, that you need the complete code base and you need a working build environment with all the libraries and stuff like that. I wanted something that's more like a search engine. So I actually built a parser for C that does robust parsing. So it assumes that the code that you give it is valid C++ in some weird dialect, some place on Earth. Right. And then it it will just pass the stuff that it understands and stuff that it doesn't understand throws away. That's part of the project. Um, yeah. The problem that you have with the clang, I mean you could definitely write an interface for this, but the problem is typically as soon as it sees some code like a definition, it doesn't know. So the rest of the source file is not parsed and we don't want that. Okay. And a question from number three, please. Hello. Are there any errors that one would intuitively think one should be able to find using this method, but which are not possible or, you know,
too hard, too expensive? Do you have any interesting examples of that? Well, in a way, you take a look at this. This is actually pretty intuitive, and I would have thought that it had been found a long time ago. I think that what's problematic here is maybe that you can only trigger the bug if you insert a seven. So if you take a further to find these things and I think most people face this stuff, then you're probably not going to hit the seven. Right. But in a way, I think this this is this looks so intuitive. But we still see that using these simple methods, we can find bugs. So, yeah, I don't know if that answers your question. Probably not. Yeah. I think you as you answered the other way around. OK, so I think that the system is good and I was asking for things that the system is surprisingly bad. It are surprisingly bad at. Um, well, I mean, they are very obvious limitations, right? You know, we're not really evaluating the expressions at all. As I said in the beginning, it's more like you try to model what people do, who look at code manually, look for additions, inside allocations or stuff like that. But you can't you don't really evaluate the expression to see. Yeah, that's still very abstract. I can't name anything concrete, but I mean, that's that's the biggest problem, especially if you compare it to like exact methods such as symbolic execution or so where you actually calculate what the real values could be, which we don't do here. Thank you. Okay. Next question from microphone number two, please. Oh, that's OK. I can just put my head up thinking I'm wondering, we have now seen that you evaluate code being C or C++. Well, or you said that a C then C++. Is it that that you kind of use the grammar to pause when you said you wrote your Pausa? Because I'm mainly working with Java and I'm interested in searching for bad coding and possible bugs in Java. Yes. So we wrote a grammar for the ontology for parser generator, which is a really awesome tool. So it'
s it's if you compare it to stuff like Flexin Python, then, you know, it's I mean with X and bison, you spend lots and lots of time making flexin bison. Understand what you want to pass with onto Lauritz. It's really simple. And yet we've also had somebody take a look at adapting this to Java. There are people actually using this for Java right now. But clearly, um. Well, yeah, you're feeding Java to see powers. So, yeah, most dataflow inside the function, for example, that works very well. But like, if you want to take a look at class Hirak class hierarchies and stuff like that, probably be a good idea to just adapt the positive to Java. Yeah. OK, I close my front, I'm afraid. Can you say something about the performance? Do you need to tweak the query to make them faster or is it OK? Yeah. So it depends a lot on the amount of stock notes that you choose. Right. Because you're if you have a lot of start notes where you need to start to do and this is also in the documentation is you need to start to to to track this because it will it will run all of these it won't run these queries. Well, it will run them all in parallel because at some point it could be that you want to merge the results again. So you need to split this. The queries that we had here, the those we presented, the slowest were about took about five minutes to execute on on my laptop. So I guess that's acceptable in if you do an audit, the the part that takes most of the time that was this unsanitized step where you actually start traversing the control flow graph again and again and again to see if if there's any path, any reasonable path that you're interested in for simpler stuff like the, uh, like the like this back here, for example, or also the updated bug. This is pretty instantaneous. So it takes, um, like, well, maybe not instantaneous. It takes 10 to 15 seconds or so. Yeah. Oh, OK, I see we don't have any more room, is there maybe a question from the Internet? OK, no. So please, thank you ag
ain for your very interesting talk.