/299$32c3-talk-7419

Hallo Du!
Bevor du loslegst den Talk zu transkribieren, sieh dir bitte noch einmal unseren Style Guide an: https://wiki.c3subtitles.de/de:styleguide. Solltest du Fragen haben, dann kannst du uns gerne direkt fragen oder unter https://webirc.hackint.org/#irc://hackint.org/#subtitles oder https://rocket.events.ccc.de/channel/subtitles erreichen.
Bitte vergiss nicht deinen Fortschritt im Fortschrittsbalken auf der Seite des Talks einzutragen.
Vielen Dank für dein Engagement!

Hey you!
Prior to transcribing, please look at your style guide: https://wiki.c3subtitles.de/en:styleguide. If you have some questions you can either ask us personally or write us at https://webirc.hackint.org/#irc://hackint.org/#subtitles or https://rocket.events.ccc.de/channel/subtitles .
Please don't forget to mark your progress in the progress bar at the talk's website.
Thank you very much for your commitment!

======================================================================

Introducing the next talk, just getting starting the looking stories behind the numbers, Mr. Stephan Wehmeier will introduce talk to us about computing numbers with an application to the problems of our society. Please give him a warm applause. Yeah, thank you. Obviously, this is a reference to a touring paper. I've got many questions, but those of you got it thumbs up. And this talk will basically combine computer science and and journalism. I'm currently a data journalist. I joined a newsroom about one and a half years ago. It's called Collective. It's a nonprofit newsroom based in Berlin. We do long term investigations and we are a member based and currently a foundation funded. And we are doing investigative journalism, investigative journalism. What is that? One good example now from popular culture will be this one. This is the movie spotlight that just came out in the U.S. will come out soon in Germany. And it's the story about child abuse by the Catholic priests in the Boston metropolitan area. And this is the team that basically uncovered it, or rather, it's not. These are the actors that play the investigative journalists. And the whole film is actually quite a good representation of how investigative journalists work, especially. So this film depicts a story from 2003, a 2001, 2000, I think 2001. And yeah, it's a slightly overdramatized, of course, because it's like a Hollywood film, but it actually depicts a investigative journalists quite accurately. Of course, it also depicts the gender balance quite accurately, as you can see here, but it is getting better. And so in Germany, for example, the leading data teams from Spiegel and SRF in Switzerland are led by women. And also the organization that represents investigative journalists in Germany has mostly women on its board. So still, many investigative journalists, like too many, are still men. So women get into the field as well. And the spotlight team. So what they did is they got a tip and then they
looked at data and they collected data about priests who were moving between the different parishes. So the different districts in the metropolitan area of Boston, every time there was an abuse scandal, they got a sick leave or something similar and got moved to a different district to cover up the scandal and to make it appear that it's just like a single case. The truth would be that they discovered many more cases were present, were present, and they they basically uncovered that it was like a systemic problem. And this is actually one of the core pieces of investigative journalism that you don't show and that a single thing is wrong, like this man did something bad, but that the whole system is set up in a way that many people do many bad things. And so they actually they used books that displayed where priests were moving like where's the priest in which year that like a book for every year. And then they went through it and typed it in a computer and they actually nice spreadsheet in the end, which displayed where priests were moving. So investigative journalists and computers are like a perfect match. Of course, computers are used in many other areas of journalism. Every every publishing is now every major newspaper also has a website, of course, and there's no robot journalism coming up where sports teams so sports events are covered by by computer programs, not by humans anymore. But what I'm specifically talking about is investigative journalism, a term that started as precision journalism or database journalism, computer assisted reporting, data driven journalism is the current term and there's also the computational journalism. All these terms basically mean that you use a computer to do an investigative story. And Philip Meyer, one of the first investigative journalists who used the computer, said a journalist has to be a database manager. And so we can't quite compare our database with journalist, but it's getting closer. So a journalist has to have it
s facts, they have the facts. And, of course, there are too many facts to keep just in your mind. So you have to put it in the computer. And now I will basically present a couple of fields in computer science that investigative journalists use to make their stories happen. One of the big ones is, of course, a natural language processing. You know, the Snowden leaks or if you might remember, the offshore leaks and a couple of leaks that follows that followed. When you have a big leak of data or you got a big leak of documents or you've got documents via a Freedom of Information request. And these are like thousands, maybe hundreds. Maybe even more documents that you get either in paper form or as PDF and what you do with them, you can't possibly read them on the current newsroom. They don't have enough staff. They don't have enough time to spend on this on these investigations. And so they have to use computers to make this job a bit easier. Natural language processing is perfect for that. You put all the documents you have inside the computer, you have to possibly OCR them. And then a couple of things might work in your favor. So that's energy extraction, of course, that finds out, OK, these documents contain these entities. And so it's not only Mr. Obama, but also President Obama, Barack Obama. And then you can basically extract these entities and know which documents talk about which entities you can extract, who, for example, in an email dump, who's talking with who or also company names. This is like easily extracted with energy extraction techniques as deduplication as topic MODELING'S. So, you know, OK, these documents talk about these topics and so I don't have to read them. If I want to focus my story on this specific topic, I go down this path and only look at the documents that are basically automatically categorized in a certain category. And as part of speech, tagging is often also quite useful. When you look at the document, for example, the debates, yo
u want to find out when who's talking about what, in what kind of way you can find that out with part of speech, tagging. And of course, basic search search is always quite useful, not that many advanced ways to search. And journalists have to use them to make sense of these big documents. Stacks, of course. And we have some documents, such as being a part of computer science since the 70s, 80s, and now nowadays with like solar or elasticsearch or other search engines that do that quite easily. But these are made for computer programmers, right? We use them as developers and set them up and configure them and build our own back and front end on top so that other people can actually use the search behind it. And journalists have a couple of more and they want to use a couple more features in there. And we have a couple of applications that help us there, namely document cloud, which is a service where you can upload lots of documents and they automatically made searchable entities extract it and then you can also publish them for your readers to also look at them. Overview docs, which does topic modeling so you can dove into your documents more easily. That's the Project Blacklight, which is basically a solar front and that you can so they can use to give to my journalist colleagues and they can use a solar search and an easy way. And that's also, of course, Google refind, which is usually used for tabular data, but it also has a very good reconciliation back end and also clustering where you can do duplication. So if you have like a list of company names and they are like very dirty, you can reconciliate them or basically duplicate them and make all the company names match again. And that's also pro software and namely new IBM Watson Analytics. And they are very expensive and most journalists have never seen them and probably can't use them. It's very difficult to get your hands on them. And that is quite sad because we have to so journalists have to rely on these o
pen source tools, which and they are only a few that are actually made for investigations. And to the computer science part, I mostly talk now about, um, uh, basically English language models. It's very difficult to find good German models that are already integrated in some kind of software so you can use them in the German speaking world. And that's kind of I hope that changes soon. Then there's machine learning, which is mostly mostly used for classification tasks as another big field, of course, in the computer science area in statistical analysis to find out what belongs to which category. And of course, there's a bit of neural net deep learning coming now away. You can see there's been some deep dreaming in this, uh, in this picture. But I haven't seen, like any journalistic piece that has used that yet. Um, I worked a bit on something where you like neural nets to create some captions for some databases to scrape them better. But this is still in the making. One story that actually used natural language processing and machine learning in order to identify police reports of the Los Angeles Police Department, which misclassified over 25000 crimes. So when a policeman comes to a police officer, comes to a crime scene and he writes down a report, it gets later put into a database and then classified by another clerk and they classify as classified to the crime that happened as a minor or serious or another category and based on the description. The Los Angeles Times wrote a machine learning classifier that looked at basically the description of the crime and a a proper classification and trained it on a on a training data set and then used the whole data set to look if all the other crimes were properly classified and they were apparently misclassified over twenty five thousand of them. And of course, you can't go through all these records and, you know, classify them by hand and machine can do that much more easily. And it's also been confirmed by the police dep
artment that there has been an misclassification going on. And the result is that many of the crime statistic is much lower and less serious and more minor crimes and serious crimes. And you can basically cover that up through misclassification. And the Los Angeles Time Times could uncover that through machine learning. Then there's this big field of social network analysis, a favorite topic of Mr. Lindback. Second row and social network analysis is basically the bread and butter of every journalist's work. We are collecting information about certain entities and we are trying to find the connections. And this is then basically put into you can put that into a a job like that. So this is like a network graph. And the problem with that is the result is mostly not journalism, it's just a research database. You collected some facts and then they are and then you can display them like that. But it's also very subjective data collection because you only cover the connections you think are important and you possibly don't see any others. And it's more like a knowledge management tool where you can collect everything, you know to better collaborate with your fellow journalists. But as a result, it's it might not be journalism. So we can't say, OK, I got this big graph and now I do and I can eigenvalues try to measure. And then I found the bad guy that doesn't work like that. So you can't, like, compute the bad guy out of such a such a graph. What you need to do is to like proper journalism on top. So you have to have like a knowledge graph and then you can look at it and then you can interview people. You can find out more through like proper like like old school, let's say investigative work. And so what you hear and see in the background is the lobby, which has been now shut down. It used to be run by the ATF. And but this is more like a like like a piece of art then, you know, gives you an actual inside here. It's, um, it's difficult to make social network networks appe
ar and make them understandable, let's say. And then there's a brand new field of algorithmic accountability. We also heard, like, for example, to talk about the VW Diesel Gate scandal is a topic of algorithmic accountability. More and more algorithms are put into every device we know and making decisions that affect all of our lives. And now we have some hackers that do some reverse engineering and that is great. And then they present a Congress. But of course, this is basically journalistic work and we need to bring these techniques into the newsrooms. The newsrooms need to understand, the investigative journalists need to understand how this stuff works and how to reverse engineered. McTear Capoulas, a researcher in the in Washington, I think did a lot of work on that. And for example, one thing was a stock trading plans of executives that are preplanned. You can analyze how the plan works and if insider trading is behind it. Or for example, how does the iPhone out of work? And you can observe the output, you can observe the input, what what's happening, what is happening inside. And another example would be how are prices displayed in retail sites for different geographical areas, for example, and. That is not an easy task, but it's becoming more and more important, especially if there's not much transparency around how these things work. So journalism becomes gets closer to science. Let's say the investor investigations in journalism, they use the investigative method. You also have an hypothesis like in science, you make up something like these. These kids are underprivileged because of corruption going on in the school system. And then you have to prove that hypothesis. And so it's very similar to science. And science also moves now into a more reproducible and transparent manner. The story I told you earlier about the LAPD, this is the code that was used for this story. You have a like a classifier, a machine learning classifier of a support vector machine,
and you can basically run the code yourself to train the classifier and then classify some of these reports. And they only publish like a tiny training data set and only parts of the data. But they basically make their methodology transparent. And this is basically also where science is going. Many, many research papers nowadays are not reproducible, but there should be and this is a Jupiter notebook and you can basically create a process and code mix and then executed and look at the result. And anyone else can also reproduce your work. So this is Python, but also our is are like favorite languages of investigative journalists in the data area. Then one big thing I discovered was that social engineering in the newsroom is not that easy. First of all, of course, there's I.T. support and that's the problem as a CMS. And so the content management system is always a problem. As a software engineer, you basically always fight the CMS and in in big organizations like The New York Times, they basically create their own hacks just so that they can put their beautiful graphics. And the rest of the CMS, they like big hacks going on there. But this is this is not what I want to talk about. Suffering in the NEWSROOM is basically also building tools for your fellow journalists. And that hasn't fully it doesn't have its roots, so it doesn't have its roots in the newsroom. And that's why it's a bit difficult at the moment. Right now, journalist writes an article and then it's published and then you can forget about it. You never touch it again. So there's no technical debt in articles and codes. And so sometimes code in newsrooms is also written for a single story. So you write code for that story, you publish it and then you forget about it. But of course, as engineers, we learn that is not how to do things we want to. We don't want to write the same code again for the next story. We want to have something reusability. We want to fix the only ones here. We don't want to fix it a
million times of all of our articles. And that means we need to clean up a bit and develop some kind of some kind of method to write software in the NEWSROOM. It's currently quite a quite a hack, as I perceive it. And then computer science papers. I love to read them. They have very interesting ideas, but mostly they don't come with code and when they come with code, it's not running code. It's difficult to actually make that run. And when I actually compiled that C library to make a machine learning a bit faster, it's still not usable software. So I can't give it to my colleague to actually use what I compile them. And this is definitely something. And also so I hope that when you publish something in computer science that you give me something that I can use to actually bring it to my newsroom to make their lives easier. And also collaboration, which is something that is basically innate to the open source software thing, is a bit more difficult in newsrooms. And there's always the competition going on. The investigator, especially investigative journalists, are used to be perceived as lone wolves. If you are onto a story and someone else has heard of it, you better publish soon because the other guy might, you know, scoop you on it and then your story is burned and you can't publish it anymore. And all the work you did for that was in vain. Instead, on the other hand, in open source software, it's great if many people are. Collaborate on a piece of software, the higher the background is, the better. And so we need to basically bring this idea of collaboration into the newsroom and this is also still a problem. That is it is not quite there yet. There are some collaborations now between The New York Times and The Washington Post, for example, or the ProPublica news organization, and another bigger publication in the U.S., I think. And as a collective, we also collaborate with many other news organizations to they publish our stories with us together. And we hope
that this idea of collaboration that is basically a software idea, as I perceive it, is also coming into the publishing of news stories. And another big problem is that we have some software and we might as well going to use it, and if there's some other software, I can only use what I have. So the hammer nail problem is definitely something that is in the newsroom. Have you ever seen, like, a map in in a in some kind of news article with lots of points on them? And that's because the journalist that did the story had this basically this map mapping tool that they could put in like a bunch of data and then it put it on a map. And even though it might not make any sense, we got into the story. They just used the tool that they have, or, for example, a timeline, a timeline. Also, like, that's an easy tool to make a timeline. And then you have a timeline, even though it might not be the best way to present your story, it's just the tool that is there and developed. Developing another tool might not fit the deadline or like your resources. So I basically say we need more applications for our society and many advances in computer science are quite slow to benefit the public at large. If there is like a big jump in, let's say machine learning at Google knows that first, because they do the research, they develop the applications and other other big companies like Palantir, the NSA or Internet companies, they basically use the latest research to do better use of tracking or better targeting. And so they benefit quicker from these developments. And so because they do their own research, because they have more resources and many cutting edge research space, it comes out of these corporations like Google, for example, recently tends to flow from from Google brain to release. It's a like a machine learning library, the other machine learning, learning libraries. But this is like one that is very usable and it is an advantage that it's better supported, better documented, easie
r to use, but it might not exactly fit the journalistic use case. And so journalism needs more resources to develop their own tools. The tools I mentioned, like document cloud overview docs are quite good that basically targeted at journalists that develop and develop by journalists. And they they fit the use case quite well. But it took like six figure amounts, as I recall, to develop them over the years. It was very difficult to basically get the use case right. And for example, Google refine like an invaluable tool to many journalists to work with tabular data and clean it. And it's really used a lot. But it was developed by Google and then open sourced. And that basically means it hasn't seen a release in two years. And that is kind of a kind of bad that we don't have the resources to basically work on the tools that we use in the journal trade every day. So my call to you as support journalism, as a service to the public and help journalists develop, develop the tools. What we have here is basically a public good journalism. We try to make we try to be in the service of the public. So, for example, join a newsroom if you can. It's really a fun work. So when I joined the 800 NEWSROOM, simply because I think it's basically the the best political activism I can do with the most impact and not only focused on technical topics, we hear a lot about data retention and I don't know other data topics. But when you work in a newsroom, you get to like a very broad range of topics from all over society and you can still help with data literacy and another and another hint. So if you want to get in touch with journalists, there's a thing called Hacks Hecker's, which is a meetup, and that is in every big city in Germany, I think it's only in Berlin and Hamburg. But but I think that that's like data journalism also in Australia. And but if you're from any place else, like in New York, London, they all have access. Meet the journalists and hackers. Yeah, we have the hackers an
d they come together. They are to meet and talk about technology and journalism. And so if you want to have, like, an idea of what's going on in that world and join a circus meetup and I don't know, improve journalism by contributing to your ideas. Thank you. So I think we have time for questions. We have a question from the Internet, please. Wethington Internet. Yeah, I, um, the Internet is asking many things, and from actually the most important question is, is your data mining software available as free software? And please mention some of the names of the tools you have used. My data mining software and I didn't write the was I always write data mining software, basically, and that is a problem for every story you write like a script that does it and that has advantages because you can customize it. It has disadvantages because you have to write the software and it's not quick and easy. For example, we as an organism, we publish all our work on Get Up, Get Up, dot com slash collective. And you can have a look basically at software that is there. Mostly it's just front end stuff. But we also publish more basically backend data. So data analysis pieces in the future. And many news organizations have repositories on GitHub that explain how they do their stuff and you can find that software there. And the other question was tools or something. So I mentioned Tens Flow is a machine learning library, but there are like many, many tools for journalists that are mostly tools and a lot more library. So I'm using pandas for Python, but they are also very, very a lot of our packages that you can use for data analysis. But the problem is that Nertz a minority in the newsroom, and that means that and if you want your journalists to use these tools, use these techniques, you have the right tools to make them usable for like the normal people think, normal people. OK, thank you. Not that nerds are not normal people, but yeah. We have another question, please. Thank you very mu
ch. You've been talking a lot about language processing tools, machine learning tools and all of those, of course, no known to fail to produce errors to McAuliffe mis classifying classify. That's right. Yeah. And even if they classify correctly, it's not always easy to see what the classification actually means. You alluded to that shortly when talking about graphs and saying you don't only look for the central person in the graph and that's the bad guy. So how how do you deal with that? Risks with misclassification, with with like and also the illusion that the data could provide you some knowledge inside that actually is not in the data. Only apparently is there that cross crosschecking like normal crosschecking you do with data like check your data before you put it in there for certain things. Aquanauts recently published a long list of how you interview your data to make sure it's up to a certain standard or that you at least aware of its failures many times already. The input data is flawed in many ways. And then your methodologies, of course, double check it, talk to experts that know more about this field than you do. And by publishing the methodologies, you basically make yourself vulnerable, but also transparent. So if there's something bad going on, your readers or any other interested party can basically run what you did and then tell you about what you did wrong. And so any of these machine learning things does not lift the journalistic. As a journalist, you still have to validate your your findings through the second means or at least do a check on a bigger sample. So it's definitely and the result is not coming out of the computer. The results is coming out of the human mind. So they're the result of her research. It's not basically the output to stand it out. Thank you so much. I think we are all done now for the minutes we have for questions. Thank you so much. Thank you, everyone.