Commons:Requests for comment/Technical needs survey/File verification

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
Previous proposal Overview page Next proposal

File verification[edit]

Description of the Problem[edit]

  • Problem description:

Source websites from which content is uploaded to Commons may cease to exist over time. Once it happens, files that originate from them could easily (specially when certain conditions are met) be mistakenly taken by copyright violations. Also, even when the source website still exists and has the uploaded file available, there can be mistakes that that lead to a file being deleted by mistake (just have a look here). Another problem is vandalism: if the file page was vandalized, file's source could be missing or have been changed (yes, file history should be reviewed before deletion, but work overload could lead to it not being reviewed with due care).

  • Proposal type: bugfix / feature request / process request

feature request

  • Proposed solution:

Implement a mechanism to verify uploaded files. As a file uploaded to Commons is patrolled (by a user who has privileges for it) it could also be publicly marked as verified (it could also be done for already existing files over time). This proposal is something similar to what is already being done for images from sites such as Flickr, but now for all files from external sources. A verified file would be more than a simple verification or attribution template (for example, verification couldn't be removed by a vandal, only by an administrator if needed). Of course, we can never be 100% sure, but having a file verified, it would require an exhaustive investigation before considering it a copyright violation, so the risk of mistaken removal is greatly reduced. Also, users could trust verified files with greater confidence before using them.

  • Phabricator ticket:
  • Further remarks:

If not feasible, an intermediate solution could be not allowing attribution template removal to unpriviliged users (but this would only be a solution for files to which an attribution template applies).

Discussion[edit]

Does this amount to placing a request for license review on every upload that comes from a third-party site? That seems excessive. Consider especially material old enough to be out of copyright on that basis, or an PD-ineligible logo. Similarly, a U.S. government doc with internal markings that show it to be that; I'm sure there are many other cases. You'd be taking "patroller" (presumably actually image-reviewer) time to verify something that has nothing to do with the source site. - Jmabel ! talk 19:33, 17 December 2023 (UTC)[reply]
If the patroller/image reviewer has indeed verified that the image (or other media) has been published under a free license, I think it would be a very good thing that he/she could mark the file as verified, and this could be visible to anyone. This would even save work for the future: the file is not a copyright violation, so if somebody tags it as such, the deletion request can be quickly dimissed unless some breaking new evidence has been found (this would happen very rarely, if things are well done). Many files are in fact verified (any reviewed media from third-parties that is not found to be a Copyvio, has been verified, but we can't be aware of what files have been reviewed). As an uploader or many files from Spain's National Geographic Institute, most of these files include a text "© Instituto Geográfico Nacional. All rights reserved. Total or partial reproduction banned", because they were published before IGN released them under CC-BY 4.0 license. I'm sure those maps (or at least, most of them) were reviewed and everything was found to be OK. But if in the future, the URL from which they were downloaded ceases to exist, someone could tag the file for deletion as Copyvio. The administrator who reviews the deletion request, would then see that there's an "All rights reserved" text on the image, that it's only a few years old, and that no evidence of it being CC-BY licensed can be found on the source website, because it doesn't exist anymore. I think that allowing to mark a file as "Verified" would solve this. On the other hand, as I also said, not allowing unpriviliged users to remove attribution templates from files, would be another way to prevent that kind of things from happening. MGeog2022 (talk) 19:53, 17 December 2023 (UTC)[reply]
That could be a problem. For example, the assumption that files on the Nobel Prize site were taken in Sweden, and thus came under Swedish law was later proven false for a number of files. It's one of those things where carefully justifying the copyright status does need to happen at time of upload, but a simple tag that "I checked" is unlikely to be enough. Adam Cuerden (talk) 06:10, 3 February 2024 (UTC)[reply]
@Adam Cuerden, in the case I mention, the source site makes very clear that those files are licensed under CC-BY. But since they are maps that originally weren't freely licensed, they include a copyright tag with "All rights reserved" text. Sincerely, I don't know about any IGN map ever deleted or even tagged for deletion in Commons, because if you check the source site, it's clear they are free. Also because, if you check it, there are thousands of those maps in Commons. But the potential danger is still there. Perhaps the case you're talking about is much more complex, and it isn't so easy to be 100% sure about licensing, but in the case I mention, it is, and it would be helpful to have an option to mark them as verified (of course, never by any random user, but by a user with some privileges). MGeog2022 (talk) 15:02, 3 February 2024 (UTC)[reply]
Understood. Just, y'know, it'd need more details than a simple "I checked this". Commons:Deletion_requests/File:LWA_Picture_Final.jpg, for example, was checked by admins, and still came to the wrong answer. Adam Cuerden (talk) 19:15, 3 February 2024 (UTC)[reply]
Basically, the idea's good, but it's underdeveloped. Adam Cuerden (talk) 19:15, 3 February 2024 (UTC)[reply]
@Adam Cuerden, yes, I understand, there are really complex cases where a verification by an admin isn't definitive at all. Although I think those files, after a comprehensive research, are the ones to benefit more from a hypothetical future file verification functionality. For cases such as Spanish IGN's maps, simply a kind of "official safe source list" could fulfill this purpose. Cases in which there are many doubts would probably be the ones that have the most to gain from file verification, but the necessary investigation must be properly carried out before. MGeog2022 (talk) 20:27, 3 February 2024 (UTC)[reply]
Agreed. I think it's very much worth doing, and the Wayback Machine may help somewhat with it, but there are issues. We lost a huge archive leak because no-one bothered to get an employee of an NHS research department that was shutting down to do OTRS until it suddenly became very important about two years later and everything got deleted. But I think it needs a little more discussion as to how to get it right, and not just a "someone said it was at the time" without sufficient details to be meaningful. Because I don't want another MRC National Institute for Medical Research situation, where everyone at the time thinks it's sufficiently documented. Adam Cuerden (talk) 07:39, 4 February 2024 (UTC)[reply]
You have given a problem description, but it this an actual problem ? Sure there are lots of things that can happen and happen in small amounts of cases, but is it worth it to complicate everything else for such a case ? The flick case is being done because it is so easy to change licenses on material (in bulk). It is not because Flickr can disappear. Additionally we have our upload date/times and page history to deal with any age questions. And as far as I know, we have never had legal problems because of any of this. I think this is a LOT of overhead we are adding, for very little return. —TheDJ (talkcontribs) 12:14, 18 December 2023 (UTC)[reply]
What about having a list of safe sources, where only administrators can add sites, after verifying them? Or, as I said, disallowing unpriviliged users to remove attribution templates from files uplodaded by other users? Certainly I don't know about this ever happening, but I think it's sad to risk losing valuable material due to potential confusions. I think it's specially risky when the media includes a copyright tag from a relatively recent date, with "All rights reserved" text, such as the case I mentioned. I think my proposal is no complication for patrollers: if I understood well this page, most uploaded files are patrolled in search of possible copyright violations. If the file is found to really be under a free license, it would only be a click or 2 away to have it verified by the patroller (much less work than requesting its deletion, if it was a Copyvio). Older files could be verified on demand. On the other hand, perhaps I'm being a little paranoid here, and all that can be found about the source (even in Wayback Machine, if the site exists no more), file history, etc., is always carefully checked before file deletion. But even if this is the case, my proposal would greatly reduce research work for administrators, if we have files verified in advance. MGeog2022 (talk) 13:35, 18 December 2023 (UTC)[reply]
The patrol user right is a specific user right that enables a user to mark edits, file uploads and page creations as patrolled
What I propose is only a publicly visible way to "mark as patrolled", but only at the file level (it could even be fully automatic, when a patroller marks a file that is not an own work, as patrolled). Once a file was marked this way, it should never be considered a Copyvio, unless very clear evidence is found that it was wrongly verified. MGeog2022 (talk) 14:55, 18 December 2023 (UTC)[reply]
It might be more useful to codify what the "certain conditions" are, in guidelines or policy. I tag a lot of files as copyright violations, and a very common scenario is that an image is clearly an old stock image since it's being used on dozens of websites, that usages predates whatever date the uploader claimed it was (i.e. the uploader says it was their work from 12/18/2023, but it's showing up on the web as early as 2013), but the stock site no longer exists so it may well have been under a free license. I would suspect that sites disappearing causes more false negatives than false positives (another common scenario is that an uploader has several files, a few of which appear online before the date they claimed and a few of which did not, and while the other images are probably copyvios there's no proof). But I don't have any proof of that, and by nature it's probably impossible to. At any rate I think the most likely result of this would be creating another backlog another hundreds of thousands of files. Gnomingstuff (talk) 23:11, 18 December 2023 (UTC)[reply]
@Gnomingstuff, the "certain conditions" which I was referring to, were, as I mentioned later, for example, CC-BY licensed maps that include a "© Instituto Geográfico Nacional. All rights reserved. Total or partial reproduction banned" text, because their initial publication date predated when they were released under a free license. I hope that before deleting a file, its history is carefully checked, that it's cheked if there are other files from the same source in Commons, etc. But anyway, the risk still exists, and administrators can have to do much research work in case of such a deletion nomination.
Also, talking about stock image sites, as you mentioned, it has happened that people have uplodaded public domain photos from Commons to stock photo sites, without mentioning they were public domain (this is absolutely legal: public domain imposes no obligation), and they were later deleted from Commons as Copyvio, because they were present at a stock photo site (I read about this sime time ago, sorry but I can't find the source now). If they were verified as public domain in the moment when they were included in Commons, this wouldn't have happened (its presence on a stock photo site should raise the alarm, but then who was right should be carefully investigated, not automatically admitting the uploader to the stock photo site was right, without even consulting the site's owners). MGeog2022 (talk) 13:08, 19 December 2023 (UTC)[reply]
creating another backlog another hundreds of thousands of files: that is the last of my intentions: having many unverified files is no problem, since all files are unverified now. The idea is to have as many verified files as possible (it would be easier for new files: they could be patrolled and verified at the same time), priorizing those who may potentially have more problems, and those that are deemed most important, or whose uploader (or other user) requests them to be verified. MGeog2022 (talk) 13:13, 19 December 2023 (UTC)[reply]
To clarify my example, I'm not saying that IGN will cease to exist tomorrow: I know this won't happen. But please think about the following scenarios:
• It's deemed that https://centrodedescargas.cnig.es/ URL is too long: it's changed to www.cnig.es, so the original URL exists no more.
• IGN decides that there's no need for a separate institution (CNIG) for cartography distribution. CNIG integrates into IGN, so https://centrodedescargas.cnig.es/ ceases to exist as well.
• (I hope this never happens) The government considers the production of new maps too expensive, so it charges a tax on commercial use of new cartography, and new maps aren't CC-BY licensed. Only a obscure notice at the website says: "Maps published before 20XX are CC-BY licensed", while "All rights reserved" text is clearly visible.
• EU countries join their national mapping agencies into a unified European one. IGN ceases to exist as such.
In any of these cases, if an administrator who is not familiar with IGN sees a map that includes "© 2011 IGN. All rights reserved", could possibly delete the file (it perhaps could happen even now, though I hope due care is always taken). IGN maps are in Commons thanks to talks between Wikimedia Spain and IGN (see here; in Spanish), and thousands of maps have been uploaded since by many users. I think we should avoid to risk losing any of them.
Apart from file verification for third-party works, I think that a notice such as "This image wasn't found in Google Images as of 22 December 2023" would be a good thing for user-created works, to avoid such things as photos being "stolen" by uploaders to stock photo sites from happening (with a user's own work, we never can be 100% sure, but this would indicate that a more detailed and calm investigation is needed before deletion). MGeog2022 (talk) 13:57, 22 December 2023 (UTC)[reply]
Another case of where we should finally start using software. Files should also be scanned with a bot that does a tineye and/or Google image reverse search to check if it's likely a copyvio, especially for new uploaders. The bots should populate some categories which people then via some tool quickly review. Here you can see an example (new study): Category:Wikipedia citations improvement AI-based system SIDE. In this case the tool should scan the source link if it a) is still online (if not add an archived version link) and b) whether it supports the file claim of license & source (if not flag as needing semi-manual review). --Prototyperspective (talk) 22:47, 28 December 2023 (UTC)[reply]
@Prototyperspective: Google and Tineye would probably balk at this unless we got permission first.   — 🇺🇦Jeff G. please ping or talk to me🇺🇦 01:06, 29 December 2023 (UTC)[reply]
Good point. Maybe that permission could be acquired though and it seems chances are high that they're okay with it. I don't think semi-automatic scans aren't allowed at these sites so one could also have a semi-automated way like a button "Scan this whole category+subcategories via image reverse search" with each file getting a new datetime field for last tineye/GIRsearch so it's not scanned twice (for now). That scan could run for a long time but finishes eventually. Prototyperspective (talk) 11:28, 29 December 2023 (UTC)[reply]
I might understood the issue wrong, but isn't the wayback machine the choice to have? It takes a snapshot of website and can be used as citation --PantheraLeo1359531 😺 (talk) 09:40, 5 January 2024 (UTC)[reply]
A bot would need to make sure there's archived versions of all links in file source / descriptions. Or are you going through 100 million files by hand to manually create snapshots where there are none? It's rare (that's why I wouldn't consider it top priority now) to see files with dysfunctional unarchived links but it happens and could happen more often if some larger sites go down within the next few decades. Prototyperspective (talk) 10:17, 5 January 2024 (UTC)[reply]
Not all sites are necessarily archived on Wayback Machine. In any case, my proposal goes far beyond source sites disappearing (perhaps I put too much focus on that part). I'm also talking about possible human errors when a file seems to be a copyright violation but there is an obscure proof that it isn't, or wasted work when a file from a safe source is nominated for deletion (I say this not knowing if a safe list exists, I'm not aware of it and I couldn't find it). From another point of view, I do know that I might have to "defend" (that is, providing more evidence about source and license) my uploaded files over the next days or weeks, if someone doubts they are rightly licensed.... but (at least for me) it makes no sense having to "defend" your uploads years after you uploaded them (and uploaded files can be nominated for deletion at any moment). MGeog2022 (talk) 13:13, 5 January 2024 (UTC)[reply]
  • Why I consider this proposal (mine) important: files with high value can be lost from Wikimedia Commons (and potentially, in some cases, be eventually lost forever for humanity) due to human errors, specially when license is not obvious and requires review work (it can be even worst if the original uploader isn't there to support his/her viewpoint). The same can happen when an own work is stolen from Commons by a third party. Potential wasted time investigating licensing issues of files that could have been verified in advance. MGeog2022 (talk) 14:41, 20 January 2024 (UTC)[reply]
  • We could consider developing a bot, similar to the one used in Wikipedia, to automatically archive links in archive.org. This would help ensure that the content remains accessible even if the original link becomes inactive. However, verifying the files poses a significant challenge due to the vast number of images involved. It would be an enormous task to find and assign selected contributers to verify each file, making it an ongoing and time-consuming process Riad Salih (talk) 23:57, 1 February 2024 (UTC)[reply]
    @Riad Salih, I think the bot to archive links in archive.org is a great idea. But, as I said several times in this page (for example, my response to Jeff G's vote below), my proposal is to have the possibility to manually verify some files, not to verify all/most/even an important part of files currently in Commons. Perhaps (if this proposal is implemented), new files shouldn't be systematically verified, as I suggested. But I think it would be good to have at least an option to manually mark a file as verified, when it is deemed that it has some special danger of being mistaken for a copyright violation (for example, as I mentioned on discussion above, files that include a note such as "© 2011 All rights reserved" on the image itself, but that was released under a free license by its creator on a later date). On the other hand, this proposal also includes to automatically verify new uploaded files marked as "Own work" aren't found on the web at the moment they were uploaded, to prevent photos (or other media works) made by a Commons user from being stolen by third parties, which has happened in the past, and often resulted in a genuine own work being lost from Commons due to this theft (see Prototyperspective's mixed Support/Oppose vote). MGeog2022 (talk) 20:04, 2 February 2024 (UTC)[reply]
    I understand that for any file that is displayed on a web page next to a CC or public domain notice, archive.org solution would be absolutely OK. But for files that come from a download option in a website, that are never displayed on the website itself, archiving the website wouldn't necessarily be an absolute proof that the specific file that was uploaded to Commons, is freely licensed. Perhaps I'm talking here about a very specific problem that should be approached in a very different, less generic, way. MGeog2022 (talk) 20:13, 2 February 2024 (UTC)[reply]

Votes[edit]

  •  Support I recognize the complexity of doing this for all cases but I would like to see more formal and stable processes for cases which are easier, as we have done with Flickr verification. Even gaining technical documentation describing the nature of the challenge could be a useful exploratory step. Bluerasberry (talk) 16:44, 16 January 2024 (UTC)[reply]
  •  Support, as proposer. MGeog2022 (talk) 20:07, 16 January 2024 (UTC)[reply]
  •  Support per Bluerasberry, but this looks like a lot of work.   — 🇺🇦Jeff G. please ping or talk to me🇺🇦 14:09, 22 January 2024 (UTC)[reply]
    @Jeff G., please note that what I'm proposing is to be able to mark a file as verified, not to mark all files in Commons as verified (for example, every document older than 100 years or obviously published by US government doesn't need it at all). This part I believe it would be very simple (the feature, not verifying the files). In addition, an automatic system to verify "Own work" files aren't found elsewhere, yes, this probably won't be so easy to implement. MGeog2022 (talk) 14:58, 22 January 2024 (UTC)[reply]
  •  Oppose per Riad Salih's comment above. However,  Support for a bot that adds archived links next to hyperlinks (maybe like so: [aw] (especially in the source field) and a bot that parses source webpages (live or archived) to identify likely copyvios for contributors to review as well as tagging via Wikidata files as 'license-verified-according-to-source-in-description'. --Prototyperspective (talk) 13:45, 2 February 2024 (UTC)[reply]
  •  Oppose As I explained above. Riad Salih (talk) 16:49, 2 February 2024 (UTC)[reply]
  •  Unsure It feels like this needs a few rounds of revision before it's an actionable proposal. Good idea, but not developed enough. Adam Cuerden (talk) 16:52, 5 February 2024 (UTC)[reply]