31c3-talk-6214 Latest text of pad 31c3-talk-6214 Saved Jan 13, 2021

Hallo Du!

Bevor du loslegst den Talk zu transkribieren, sieh dir bitte noch einmal unseren Style Guide an: https://wiki.c3subtitles.de/de:styleguide. Solltest du Fragen haben, dann kannst du uns gerne direkt fragen oder unter https://webirc.hackint.org/#irc://hackint.org/#subtitles oder https://rocket.events.ccc.de/channel/subtitles erreichen.

Bitte vergiss nicht deinen Fortschritt im Fortschrittsbalken auf der Seite des Talks einzutragen.

Vielen Dank für dein Engagement!

Hey you!

Prior to transcribing, please look at your style guide: https://wiki.c3subtitles.de/en:styleguide. If you have some questions you can either ask us personally or write us at https://webirc.hackint.org/#irc://hackint.org/#subtitles or https://rocket.events.ccc.de/channel/subtitles .

Please don't forget to mark your progress in the progress bar at the talk's website.

Thank you very much for your commitment!

======================================================================

So we welcome you want us back this talk attribution revolution, give a big hand of applause and say yes, thank you very much. Thank you all for braving this early morning and coming to attend this talk as well. I'm going to talk about the attribution revolution and why I think that we have a possibility here of turning copywrite upside down or inside out. If you want just a quick show of hands, just to give me an idea how many of you have heard me talk before? Not so many. Excellent. Then you're going to be surprised by this one. I'm going to show you an image and I would like to have a quick show of hands as well to see how many of you in this room can recognize where that image is from. If you can identify the author of it or perhaps identify the series, which is from Show Hands, where is this from? OK, fairly good. I had a talk last week in London at the Open Document Foundation's meeting. And where are you talking to people who are used to writing word processors? You can imagine that this joke went off quite well with them. OK, most of you recognize this. This is a CD is drawn by Randall Monroe. He Randall has a peculiar style of drawing, so it's quite easy to recognize it whenever you see it. He also has a sense of humor that attracts many of us. Now, let me show you another one, though, and I'll ask you the same question. Quick show of hands afterwards to see if you recognize where this image is from. OK, one oh, OK. To you have seen this before. OK, two people recognize this one. So the rest of you might be surprised when you learn that this image is, in fact, also by Randall Monroe. It is also part of the universe comic number seven in the series. Now, why I'm showing you this is because knowing that this image is part of a CD, it probably changes the way that we relate to it. It changes the way we feel about this. Before we knew that this was drawn by Randall Monroe, this was just an anonymous sketch that I could have taken from my own sketchbook or found

anywhere on the Internet. But once we learn that this is by Randall Monroe is part of the CD universe, that suddenly we have the context that makes this image valuable to us. It gives us some meaning. And I can almost guarantee you that if we tried to sell this one, if Randall would sell the original of this, you would obviously get a lot more money as well if that knowledge was conveyed to the potential buyer. So knowing where things come from. Knowing who created something, who's the author, when it was created, where it was created, now all of those things are quite relevant and we see them all around us. We see them in Wikipedia. Citation needed. We see them in science. Right. Obviously, everything that we've done and for as long as we can remember in terms of scientific advances builds upon what people have done before. And we were used to crediting those people by attributing them when we write our papers or journals in politics. Well, know you can claim that in politics is maybe not too common to attribute to some known source that people can actually check. But as a politician, you do this all the time anyway. You attribute your statement to somewhere you attribute to farce from somewhere in culture, in art. We have this. We attribute we build upon something from before in food. Well, OK, I admit this is a bit of a stretch, but if you take up food cotton, one of the first things I do is I usually flip it around. I look at the list of ingredients. Now that's, you know, that's attribution and it tells you what does this actually contain? Where does it come from? What made into making this product? And this is the provenance of a work. This is the history of something. This is where something has been before, where it was created, who created it, when it was created, for what purpose was created and then what has happened with it. Until we see today, if you walk into a gallery and you look at paintings on the walls. You're most likely going to be interested in

the information about those paintings as well. You're not just going to look at paintings, you're going to look at the provenance of those paintings. You're going to look at who actually painted them when they were painted, perhaps why they were painted. The title of it can give you some information, can use some knowledge. It gives the paintings some meaning now. Provenance. Is also connected to the aspect of reputation, reputation, obviously be, you know, something that we all have around us today as well. If you look at LinkedIn, Facebook, Twitter, everything is about building our reputation. Go on GitHub. You're talking about reputation. Your every contribution that you make contributes to your reputation, your standing in society. And that's facilitated by the attribution is facilitated by people knowing what you have actually created is knowing what you have contributed to cyto. Now, let me ask you one more thing so quick, so has this world see who is the avid reader in this room? If I say Woofy show hands, how many knows what a goofy is? OK, I'm going to send you to the library straight after this over here is a reputation based currency that was first envisioned in down and out in the Magic Kingdom by Cory Doctorow. Now, in this story, down and out in the Magic Kingdom, Cory Doctorow hypothesizes about a potential future in which the currency that we had today is replaced by reputation. What do you do? And what you create contributes to your woofy, which you can do in exchange for other things in turn. Now, when Cory doctor of wrote this, this all obviously seemed quite a lot like science fiction and it's written a science fiction is Cory Doctorow after all. I would argue, however, that it's actually not science fiction. We actually have a reputation based currency today as well, maybe not exactly in the larger scale that Cory Doctorow envisioned, but we do have it nonetheless. And this morning I was reminded of one example of it, and I took the liberty of

slotting that in here. How many noble advocates? Not so many as well. So I'm going to introduce that to you as well. OK, this idea, advocates, not orga advocates, is one of the earliest, very earliest attempts at creating essentially a social network. It pioneered the concept of blogging on the Internet, sharing your experience with other people, and it developed a system of trust. It took the web of trust and implemented it in its own system so you could certify other people according to their experience in this case, within the free software community. Now, it was founded in 1999. So that's quite a while ago. Right. And it's been quite frequently cited since because it really was one of the first to try to do anything similar to this. Now, as you can see, I was on Adubato already in 1999. Now, that gives you a hint about how old I am. But it also gives you an understanding of how long I've been working with this now. Is that important? Well, to some extent, you know, it's not really. Um, but if I look at other people that are on advocator did this morning because I was curious. You got Bruce parents joined in early 2000. Richard Stallman joined in mid 2000. Bradley Kuhn joined in early 2001. And then you got me joining in 1999. And of course, that's then part of the story. I was before everyone else. Right. Do I feel proud about that? Well, you know, I'm human, so of course, I feel proud about that. I was an advocate for all these are big shots. Doesn't mean anything. Not really. But it's part of the reputation mechanism. And I'll introduce to you another product as well, which came to my attention until fairly recently. So, again, see how many of you know about this project as P2P value, how many recognize P2P value. OK, one person, Chris, is not here. Is he here? OK. Peer to peer value is an European Union funded project, so it means that is huge research portion within this project. But what I find interesting when I look at this project and I look at what they

're promising to deliver or at least what their objectives are, I highlighted two things here that's coming up. They want to deploy a federated platform in which real world communities will interact, participate and collaboratively create content to the Commons based peer production school. And they want to develop a set of value metrics and reward mechanisms that incentivize the participation of citizens and so-called commerce based peer production. Now, to me, that's a reputation based economy. That's the budding stages of taking what we have on LinkedIn, on Twitter, everywhere else where we're talking about reputation and trying to put it in a larger context, trying to create some platform that can actually facilitate this, not only in the sort of reputation because I'm publishing something, but in terms of reputation, because I'm creating something. When I started thinking about attribution, I started thinking about the attribution revolution. I started talking to people and I started talking to photographers primarily because I saw them obviously carrying quite a bit about being attributed for the photographs that they take. And if you look around at newspapers, you see pretty much all the photographs are attributed to Getty Images, AFP or some other agency. And you might even have the name of the photographer there. Now, something that I realized when talking to people was that everyone seems to agree that attribution is important and was talking to some friends of mine who are photographers. They keep telling me that, you know, I know the direction in which the world is turning. I see the way the people are taking my photographs. They're sharing them online. They're publishing them on Twitter, on Facebook. And I'm OK with that because I know that I can't actually change the course of history and I can't change the way the people behave. But if we can make sure that whenever my photographs get published, I at least get attributed. That would solve a lot of con

cerns that people have, unfortunately, we're rather bad when it comes to actually giving credit where credit is due to actually attribute photographs when we do use them. Now, Creative Commons licensing as one example, they stipulate that whenever you reuse a work, you must attribute it in a manner reasonable for the medium of publishing it. Still, we see a large part of the commons which is not attributed when it gets shared. So two years ago, I started working on a project called Commons Machinery, which is an organization that aims to make attribution information, metadata about creative works, visible and actionable, now visible. Means that we should actually be able to see the meta data that are connected to the works that we're sharing. Unfortunately, that's not always true. And of course, a bunch of issues along the way in the early 2000s when at least the Swedish government and I'm sure other governments as well, started publishing like court proceedings and similar documents online, usually did it as word documents. And when it tried to hide someone's name, they would just take the marker in in Microsoft Word and strike something out of black. And obviously, people figured out that you can just open Disneyesque control said a few times and undo that. And then you got a name right. If you put it as PDF, it might look OK. But if you look underneath everything, you know, you still have the text and you still have a block of black above it and you just need to separate the two. And then you have the name, which means that today people are so afraid of publishing anything detailed that most of the time they print something physically market and then scan it again, which is obviously ridiculous because you lose a lot of information in doing that. But they do that for one particular reason because they don't know. It's not obvious to them what information is conveyed when they're publishing something, when they're sending their files around. They cannot trust that

what they see on the screen is the only thing that is available there, even if they take painstaking effort to remove all the names completely, you know, clear away the history of the document, it's very easy to just leave out the fact that you can just go to file properties and you maybe have some names in the title or the description of that document. So with machinery, we wanted to make the metadata visible so that people actually are aware of the meta data that gets passed around. They are aware of the information that their documents and files contain. Obviously, there's a privacy issue in that as well to make that visible. But then the other part is to make it actionable and by actionable, it means that we need to have a way to actually develop our software so that it can act upon that metadata to give us helpful advice and to give us helpful information about the works that we're using to allow, as an example, a word processor. When you're inserting an image to automatically tell you that this image is from this particular author. Would you like me to put an automatic attribution to that author in there? That would be helpful, but we can only do that if we have actionable metadata. So fortunately, we are funded by the Shuttleworth Foundation for a period of two years, which is now just coming to close. Because they were interested, as I was, to see what would happen if we practically start putting our ideas in practice, what would happen if we start implementing systems that supported retention of metadata for digital works? Where would that lead us and where would the problems occur along the way? And we've learned a lot since he started working on this. So for the remainder of this presentation, I'm going to take one small step back and I'm going to talk a little bit about a retrospective to talk about where we came from and what we did in the process. Up to now, I'm going to mention where we are now, what we need to do next. And then I'm going to come bac

k at the end to talk about what does that actually mean for copyright? Because you remember that as part of the title to any copyright upside down. And I hope that I will live up to that promise. This is an image from one of the first white papers we produced. Now, this shows you the different standards that are available to convey information about works. So these are all metadata standards, different levels. It can be difficult to read from the back. But we've got EXIF standard, which is a meta data representation, but it also is information about the work itself, like the author and the license that fits in there. IPC is a similar standard like EXIF, but it's created by the International Press and Telecommunications Council specifically for images we got. S&P was coming out of Adobe as well, doubling core odiously specifies the licenses problem, which is provenance standard and all these other standards. Is actually a bunch more. You'd be surprised what you find when you start looking at this. It seems that everyone has been thinking about this at any point in the past, has decided that whatever standards are available are not suitable for them. So we figured out quite quickly that there are simply too many standards. There's no way that we can make this work if we have, you know, even five percent of the works using excessive standard to describe themselves, five percent using AP, DC, five percent using five percent using something else, there's going to be a nightmare to actually try to implement that. And each of them don't really see enough use either. Even EXIF, which is probably the most use standard for images to convey information, it doesn't really have enough use. It has no tool support to a very large extent. Load something into a photo editor, change it round and save it. And that information is very often just lost because the tools don't actually support retention of metadata or passing it along. There's an embedded metadata manifesto that came out

or the international press and telecommunication councils. They did a study of social media platforms and it was fairly easy. I mean, you just took an image with EXIF and IPTV metadata embedded within that image. They uploaded it to social media platform and then downloaded it again and they saw what happened to the metadata. And lo and behold, in almost 80 percent of the cases, the metadata was just lost. Thika, 500 pics, Twitter, Facebook, probably one of the worst of them. They just ignored metadata. They took it away. Google was one of the better ones. They actually took some effort to retain at least EXIF and BTC information, but some of the other information was still lost as well. So. Retaining metadata by hoping that whatever you embed within the file will get retained. It's not going to happen. It's a panacea at best. So we're sort of thinking about what other ways are people using creative works. And we came up with a case of copy paste, which is the very simple procedure of someone finding an image online that they like clicking copy on it and then going to, you know, like a presentation editor and clicking paste. And we're sort of thinking about how can we make the metadata of that image information that we need in order to attribute accurately be carried over in that operation. And now I'm going to slide into slightly technicalities of how we did that, but I hope you'll follow along anyway. It's not too technical indeed. So the first thing we did was we simply split the clipboard in two, essentially on the clipboard by default. If you copy an image, you might place an image JPEG resource available for the recipient application. And it just looks and say, oh, here's an image, grab that one. What we did when someone clicked copy was that we put not only the JPEG image on the clipboard, but we also put an RDF fragment containing the meta data as a machine readable meter reader. And then it would up to would be up to to receive the application. Once I click

paste to say, you know, I can get either the image of self, I can get a meta data or I'll get both. So that was a first attempt. Later, we changed that completely again, we realize that a bunch of issues for this, I'll get back to them. And in our more recent. Prototypes that were built, we instead of putting the image on the clipboard, we actually put an HTML fragment on the clipboard, which has RDF metadata embedded within it. So you can see, for instance, the title here and the license together with the image and the source of it. Now, we implemented this hour variations of this and quite a few tools. And Gady Teekay, Jim Inkscape, Lee Ralph is a little higher ed media goblin and I'm quite proud of and I'm quite happy that we were able to bring this sort of copy paste scenario to a close to the point where we could find an image online. We could find it on Flickr with click copy of it, would get it into LibreOffice, would kick paste and will insert the image together with the attribution. And then if you copy again from there and get it into a Web based locator and could paste the attribution carries with it as well. Now the problem we had when implementing all of this is that in most cases, whenever you're talking about copy paste, that's an operation. It involves changing the core of the applications. It's not possible to do this in the general case with Yuson additions or an extension to a program. You need to actually change the core. Or alternatively, you need to implement your own copy and paste functions, but obviously very quickly gets messy. You also have the UI visibility issue. End of most applications don't really show the metadata. They don't care about it, so they hide it away. And as well, we realize that there are significant clipboard differences. What worked for us on an X based Linux system did not work on Windows and did not work on Mac OS X, so we were kind of stuck in that path. So that's why we went back and we did the e-mail approach as w

ell, because that works on all platforms. So we did a copy paste, we even got to the point where we got LibreOffice Press Center to accept images that repace it into it, and you could paste as many as you want and you can move them around. You could remove them or add new images. And then at the end of the presentation, you would ask you to insert new slide and then insert credits. And it will give you a list of all the images that you used in your presentation. All of that code is up on our GitHub, so please feel free to check that out if you want. Now, doing this, however, is a very massive effort because it involves changing every single application that we use. And that's quite a few and it becomes very application specific whenever you want to do something for LibreOffice, it was different than doing it from a lower ed, even if you can abstract parts of it away and make use of some common libraries, it was still quite a substantial effort to actually get this working at all. So we're sort of thinking what would be the Unix way of doing this? So if the problem is to retain and manage metadata, why don't we solve that particular problem? Let us not solve the issue of making this work in an application, but let's solve the simple problem of retaining and managing metadata. So we started working on what became known as a logio. Elijo is a distributed catalog of creative works that's a glorified term, what we're really talking about, honestly, is a meta data database. It's a database that is specifically crafted to hold information about creative works. And it can look like this. You'll get the. The identifier at work, which in this case points to our catalog and the identifier of it, and you will get a Jassam structure in this case back, which gives you a locator, for instance, saying that this image is while in this case Alexanderplatz in Berlin. It has a block cache, which I'll show in a little bit, and it has a particular license. In this case, it's not a licens

e. In itself, it's just a public domain. Mark now Eulogio uses W three C media annotations as its way of recording information about works, which, you know, it's a fair enough metadata standard that most other metadata standards can be mapped into, like EXIF and Ibtissam and provides an API. So for any image that is part of this catalog, you can easily look it up using the jewel of that image. Or if it's an image indeed, then the block hash of that image. And I'll explain Bloxwich in a while. The way it works is that you have a work record which explains the image itself, gives you the author, gives it license, and then you have multiple media records, because we realized quite quickly as well that if someone takes an image posted to their own website, it will most likely get a different URL. So it will be the same work but will have a different media. It might have a different resolution, might have been changed in some way. So you can have multiple mediums connected to each creative work. And we've seen in the database with 22 million images from Wikimedia Commons. So essentially for any image that as part of the Wikimedia Commons, you can look that up in our database and we'll get you back to metadata. We also developed to browser plugins, one for Chrome, one for Firefox that can interact with this API and you're asking yourself, so what does it really do? Well, this is one of the things it does. If you're out browsing the Web, you got a logic plug in installing Chrome or Firefox. If you see an image or define interesting, you can open the allowed your sidebar and you identify the emerging query. And if that image is part in this case of a logo, meaning by extension, that is part of the comments at a moment, then it will get you the information about that image. It will show you the title of it, who authored it and give you the appropriate license for it. If that is not an. It will even Greenmarket licenses that are free cultural licenses because we love them. No

w, it also offers you the opportunity to copy this image as an e-mail fragment, and you can take that image and paste it into LibreOffice as an example, and it will copy over not only the image, but also the attribution and at work straight off without anything except a browser plugin, which is nice. Now, what's the catch of this? Well, there's obviously a catch to this, which is that. Identifying an image that has been resized as an example. How we can do that depends heavily on the algorithm that we used to do that matching, and for Lajo, we wanted to have an algorithm that was very lightweight, that didn't take a lot of resources that. Could be calculated quickly within the browser, and that would generate some kind of value for an image that would not change even if you resize the image and ideally it should generate as few as possible false positives or false negatives. So the way that our algorithm works now run you through the algorithm as quickly. So did you see it before I talk about where it does not work? So this Alexanderplatz, this Alexandrov person, seventeen hundreds in Berlin and you'll see that I've taken this image and we've split it into 16 by 16 cells. So it's a matrix here, 16 by 16 happens to be 256. So that's a number of bits that are hashes actually generate what we do with this image after we segmented it into this way, into this matrix, is that for each cell we calculate the sum of all the pixels within the cell, and we do that for all of the cells. So we'll get something looking like this, a bunch of numbers across the board. We calculate the median of all those numbers and then we go through each cell in turn and we see is the value within that cell above or below the median. And then we assign either a zero or a one to that one. So then we get to a hash looking like that, and then we just wrap that up and pack it as a hexadecimal number. And that's our hash. OK. So it's very simple. It's very efficient. Takes almost no time to compute. A

nd you end up with hashes looking like this. So the first one is a hash that I made of Alexanderplatz in a six hundred forty eight times three hundred twenty six pixel resolution. And the second one is the same image but rescale to two hundred by one hundred and two pixels. So about one third the size. And you'll see that they do indeed look similar. They're not identical because obviously some things might change when you rescaling the JPEG. They're not identical, but they don't differ that much either. They differ. If you expand this into the bit feel that we have, they differ in six positions. So six bits are the difference between this larger size and a smaller size. So when we apply the block hash algorithm and we come from experience to say that if something is six bits or lower or ten bits or lower, then we can be fairly confident that we're talking about the same image, even if it has been resized. Unfortunately, however, reality comes and bites you in the ass. So this is my son, this in Greece a few days ago, and it represents something that people love doing. And they take pictures in this case, our kids, they take pictures of Skyline's. And all those pictures have a common denominator that they have a very bright upper half, usually white or blue sky, and then they have a very dark contrasting lower part. And what happens if you have a very bright part on top and a very dark part of the bottom? Well, what happens when you do the numbers? You end up with an upper half being with very low numbers and the lower half being with very high numbers. And if you take the median, it will be somewhere in the middle of this. But when you then check if something is higher or lower than the median, you end up with the hash that essentially a bunch of zeros followed by a bunch of ones because the contrast is so great between upper half and the lower half that all the differences within those regions are simply lost. Their overpowered by this. So this was the original bl

ock hash algorithm, the way it worked when it was implemented in straight off from the research literature. We changed this algorithm and we changed it in a very easy way. We simply split it up, split this field up into four distinct horizontal blocks. And we do the median calculation not for the entire image, but for each block itself, which means that even if the first block is only blue sky, even a blue sky will have slight variations in it. And if we calculate the median on that and then due to the calculation, then we'll get a lot more contrast or got a lot more details of it. So that's the way the algorithm works right now and gives us hashes like that. So it gives us much more detail for essentially the same images. Now we're still. Getting collisions, that's unavoidable, we're getting collisions in about one percent of cases, we got about 100000 images from the Internet. We ran our our algorithm on them and we compared them to each other in a crosswise manner. So there's one percent collisions. Collision here means that two images or even more images generate the same hash, identical hash. Now, however, in 84 percent of those collisions, we are talking about two to three images generating the same. So we figured that this is fairly OK. This is also one hundred thousand random images, which means that ClipArt maps and various other things, which maybe differ in very small details. They're also counted as a collision here. We also get a number of false positives. However, these are. Images that are recognized as being similar without actually being similar, because the algorithm generates close matches for them, if we set the maximum distance and allow up to 10 bits variation between two images to classify them as the same. Then we get about one point eight percent false positives. We can get that down quite substantially by us lowering the distance that we allow. If you say three or five, we're down to less than zero point two percent. So somewhere there, we

feel that we're doing quite well. Now, what about derivative works? What about ClipArt? Well, in one word, forget it. Derivative work, meaning when you take an image, you add a border to it as an example. You take an image, you crop it in some way. Now, if you can imagine the algorithm, you'll quickly recognize that if you drop an image, it will generate a very different hash from it. So we set the bar and we set the limit for ourselves, saying that we will do our best to match verbatim copies of what work you can resize it as much as you want. You can change the format from A to JPEG to give them back again if you want, and we'll do our best to match that. But if you make a derivative work, if you add a border or you try to change around an image in some way, then all bets are off. We're not going to guarantee you a match on that one as well as ClipArt or any other diagrams or graphs where you have large areas of white or black or some other color and a few lines. It will do a generally a rather bad job at that as well, because, again, you have this high contrast areas, but instead we do get something that's blindingly fast and we're very small hashas with few false positives. And this all up on blockhouse, on IO, if you feel like implementing this yourself, we have a look at what it does in practice. Now, unfortunately for our case, 22 million images is not nearly as much as we need to actually make this useful. Now, 22 million images may sound like a lot, but it's a very small fraction of what we actually need now. Creative Commons. But a month ago released their sort of state seat of the year and saying that about 800 million images, 800 million works out there, which are openly licensed. That's the size of the Commons at the moment. Now, not all of those are images, but a fair portion of them are. I'm estimating that there's probably about a half a billion images out there that are openly licensed. That should be part of a larger but which is not there today. N

ow scaling up to half a billion images, it's doable in terms of databases. Doesn't add as much as we would fear. So we can easily do that. However, we're talking about searching by a perceptual hasher. We're talking about searching for a hash value, an image where we allow up to 10 bits of difference. Now, if we say that, we're not going to allow any difference. That would be a very easy search. We can do any kind of database, can search for unique values. That's not a problem. But if you're searching for something that is similar to something else, that becomes a very different problem. So we found, again, some research to help us along our way. This algorithm, surprisingly not perhaps comes from Google. It's called Items Search, and it partitions the hashes in a way that you avoid doing a search of all 22 million for any hash you throw at this algorithm, it will give you back maybe a few thousands of possible matches. And then you need to sift through those to figure out which are real matches and which are not. Again, also available on GitHub in search of IoE. Now, where are we going from here? Well, the first thing we want to do beyond scaling to half a billion works, which we should do, is to flip the reed right bit. Because at the moment, while logic and while the API has provisions that makes it possible or would make it possible for someone to edit information that's all within a LAJO, we haven't actually enabled that yet. So far we have just taken information from Wikimedia Commons and put that into the database and sort of read Only Repository. And we rely on people updating information on Wikimedia Commons and then we get that information into a larger. But flipping that bit and making it read right, that's what's going to change things. We also need to extend a larger to support non images so any other kind of creative works. Again, scales quite massively beyond a half billion even. And we want to implement support for the API directly in applications. S

o, again, going back to the application side and figuring out, OK, now that we have solved retention and editing or meta data separately, how can we then make a link to the application? Now. How the heck does this relate to copyright, as I promised from the beginning? Well, it's easy to think of a logo as a copyright registry, and I promise you that it is not a copyright registry is something that I personally detest. A copyright registry is an attempt by someone to provide an authoritative database and authoritative information about who owns certain creative works. LIGO is not an authoritative source of information, and logic is not a copyright registry, Eladio is billed as a community curated repository. In this case, the Wikipedia community to start with, with an implicit agreement or respect. So this is something that we learn from Wikimedia as well, that there's a reason that people keep contributing information to Wikipedia. There's a reason why people take painstaking efforts to actually keep the meta data on Wikimedia Commons up to date and reliable. And that's because there's an implicit agreement that we actually want to respect the author. We want to respect the author enough to give accurate credit where credit is due. We don't want to lose that. Now. Joe. In this way takes a slight side step away from an initiative like Creative Commons. Now, obviously, Creative Commons is the licenses themselves and Creative Commons was an attempt to work within existing copyright regime to show that given the situation at a time in 2001, we have copyright you want to share. We can work within this system to give you the tools, the legal tools that enables you to do that. I believe that we are coming to the end of copyright as we know it right now. Quite recent phrase within the software community has been POS was posted open source Sweida, where the guiding light is essentially the phrase fuck licensors put it on GitHub and. Yeah. And I think we're seeing the same in

the criticism as well. Copyright is losing its importance day today. And we're coming to a place in time where, you know, within five, 10 years, I'm very sure that the European Parliament and other parliaments around the world will take steps to make additional exceptions to copyright as it is today, to allow even more private use as an example without hindering people in their day to day activities. So copyright is changing and a logo is one of the tools that we need along the way because the logo is post copyright licensing, it doesn't really care about the license itself. It obviously implements support for the free media annotation standard. It gives you the tools that you need if you want to record information about a license. But the license is not terribly important. The important part here is who actually created it was the provenance of a particular work, was the details about that work. Meaning that from your side, we take very great care to respect the author, but not the institutional copyright, because just as the friends photographers that I'm talking about before yesterday saying, as long as we make sure to attribute accurately. We're good so we can take control of the provenance of creative works by using tools like a Elijo doesn't need to be allowed to be tools like it. And we can show the world that we care about authors. We just don't care about copyright. But we care enough about authors to take as a collective effort control over that information, to control the provenance, to keep a record of where creative works come from, what happens to them, and make sure that we attribute the author's value. It's my firm belief that if we respect authors, if we attribute the authors, if they record their contributions, and if we're honest about all this, it will make it easier to contribute to the Commons. It'll be much easier for someone to say, here is my image. I'm going to upload it here. Do what you want with it, just make sure that I get credit for

it. If we respect, attribute and record information about images that will help raise the value and meaning of digital works, just as I showed you in the beginning, just as knowing that the sketch I showed was by Randall Monroe changed a value and changed the meaning of that work for you, it will change the value and meaning of other works as well. And if we do this as a community, then copyright holders will eventually be devoid of their currently exclusive right to dictate, because that's what we're doing with copyright registries. They're telling us that we are the owners or they are the owners of the culture that we have around us with tools like Lajo. We're coming together as a community and saying that. We know who authored this, and we will take care to recognize that you don't need to tell us, we'll keep track of that ourselves. Thank you very much. And thank you for listening. Thank you, Eunice. So we have 15 more minutes for questions. Do we have any so microphone four and then we have one online as well. Hello, thank you very much for your talk. I'm very interested in this function, but I don't see how this function in desktop applications is still useful. I cannot use a LibreOffice plug in. I need something in WordPress. Com. I need it and I need it in Facebook. Have you talked to these platforms? Yeah, OK, you're right. And that's what I handed at the beginning as well, that in order to make this truly useful, we really need to support us while we need support for this in the applications that people use day to day. And that was one of the reasons why we decided to change our approach to passing information on the clipboard by passing information on the clipboard as an HTML fragment. We've actually shown that this works in LibreOffice, it works in WordPress, it works in Microsoft Office as well. It works on a whole range of tools by default because most of the tools today can handle e-mail right now. That's not the whole story, though, because in order

to actually make use of the meter reader that gets passed along and to do something intelligible with it, then again, you do need application support. So we've started to have those discussions. We started talking with the libraries community as one example, and they are catching up. But unfortunately, the the awareness of metadata, Sanders awareness of what could be possible is a little bit lacking today. So it will take quite a long time until we actually make something sensible out of it. OK, so the next question from the online world, thank you. There's one question. Are there any plans to to make images trackable after they have been cropped? And have you looked into how YouTube does it? Because there seems to be very good in it. Yes. And Google it as well. Yes. So we looked at a number of different ways of doing the calculation so that we could potentially detect images which have been cropped or changed in other ways as well. Unfortunately, from our perspective, most of those algorithms that are available are either kept secret or they're patented, which means that implementing this and free and open source software is a no go zone. It will get better. There are research underway to make this possible and we're continuously looking at changing the algorithms, updating it according to what we learn. But it's quite far from having something that would detect a derivative work as well. OK, thank you. Microphone number two, hello, thank you for your work on your underfunded site and the workflow side of things. I have a question about you talked about distributed database and communicate curated direction, but now the focus seems to be on very specific projects and very things. So what could be scalable things to to work on, to get more sources involved or to create those kinds of kinds of things? And would you be open to other contributors like, I don't know, libraries, archives, European projects, whatever you can think of. So what would be the long game in kin

d of distributing and community creating for. So there's two communities are two groups of repositories that we're talking to to get their information as well. Within a larger one is Europeana, which obviously captures a lot of the galleries, libraries, archives and museums around Europe. That would be one. They are safe, creative, which is in fact, you know, part Kopra registry to get their information within LIGO as well. But still, you know, at that point, we're only talking about specific collections. We're talking about read-only information. So the logical next step is indeed to flip, to read roadbed, to make it editable, but to think through because we're honest and not quite sure how that would look like. Because how do you deal with potential conflicts when people keep editing the same information? So we'll need to go again to see what Wikipedia is doing in this, you know, see what policies they have in place and how that works and see if we can replicate that on our side. And in terms of scaling beyond this, in terms of distributing this, we made very sure from the beginning that the identifier that we have for individual works within the larger catalog is a URL, which means that anyone can essentially set up a catalog and have their own URL scheme for that catalog as long as they don't change the API. If you have the URL, it doesn't matter in which catalog you actually look up the information, you'll get it anyway. Microphone number four, yeah, there's one comment on the euro that we probably want something which can survive like 100, 200 years, and whether that's going to be solved by using euro as we have today, might create some problems. I was also interested about this on if all this can be applied to not just pictures, but books, music, whatever. And then there was one technical comment. The all functions should be one pass because if you go otherwise, you will go to the RAM so twice. So that was the technical part of the comment. OK, I see if I can

remember those three things. OK, so on that bit, yes. We were quite a lot with the specific algorithm during this. And now that we feel that we settled on, you know, the way it works best in our environment, we've documented this as well as NFC, which was submitted to research the namespace for it. And then it has a very specific definition which makes sure that, you know, if you want to call something like hash, then you need to follow this particular specification now. OK, I'm sorry. Could you go back to the first question, one from the other point? Yes. The Euro. Thank you. Um, yes. I skip over a very important piece when I say that everything will be solved by euro, because as I said in the talk as well, from the beginning, we know that any kind of metadata gets stripped very easily from a work. And even if we say that, you know, all we need is a URL, all we need is an identifier, which is a URL that's going to be stripped as. So that's not the final solution to anything. We need to to work on different approaches of identifying works. Um, the only thing I was saying there is that at least with the URL, we can make sure that this could potentially be distributed across different catalogs and not just be one single monolith. OK, and the second point was. Oh. I think I think probably the survivability. I remember correctly. Although, yes, lighter works. Yes, thank you. You're applying to other non images, right? OK. One of the reasons why there are so many meta data standards is that there are so many different kinds of works. What is relevant for images in terms of metadata is not relevant for classical music. As an example, what is relevant for classical music in terms of authoress and who plays what instruments and what instruments are available is not relevant for pop music. So that's reason why all these standards have come up already is one of the reasons. And we believe that using media annotations allow us to cover different sorts of works, but it really

need to be thought through a bit more before we start working with it actively to figure out what information is actually important to convey about different kinds of works and what they would do, the meta data centers looks like for those kind of works. So that's a larger piece of work. So there's one question from the Internet. Yes, thank you. Why don't you work with data from other sources like Flickr that also provides good metadata, contains a lot of free work and info about the author? Yeah, we are. In fact, we don't have the place yet. We have spoken to Flickr. We have a communication going to figure out how we can get that information, how we can integrate it in our system depending on how things go. I'm quite confident, though, that we'll be able to integrate that and to make it available through the same API. Unfortunately or fortunately for for us, Flickr is a huge resources, but 300 more than 300 million images, which means that even if we took, you know, took a year to do that, we're still talking about incorporating about a million words per day. At a moment, we can do about six million words per day, adding to our database, which is a fairly large number. We can probably scale a little bit beyond that. But still, we're talking about a number of months works when we actually start working on that. So one more question, there are additional one from the Internet. If you have questions, I'll be available up here for a little bit more after this talk. We also have some information about Comus machinery and the work that we've done, which will be available down here from Lainer. And I will have it up here as well. So feel free to Chrebet. OK, thank you, your honor. Thank you very much for the kind of applause.

Viewing latest content
Link to this version
Link to read-only page
Edit this pad

Download as

HTML

Plain text

Microsoft Word

PDF

Server Notice:

31c3-talk-6214 Latest text of pad 31c3-talk-6214 Saved Jan 13, 2021

Download as

Authors