Commons:Bots/Requests/EatchaBot 3

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

EatchaBot (talk · contribs) 3

Operator: EatchaBot (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Categorize files waiting for LR to make license reviewing easy!

  1. Category:Files with no external URL source
  2. Category:Files with archived source and marked for license review

The temple will be automatically removed if the file gets reviewed.
If after categorization humans fails to cope up with the backlog, I will extend it to reviewing files (with another BRFA + proposal at COM:VPP ). Images will will be matched using Perceptual hashing. Audios with Acoustic fingerprint. If it's safer to download download videos from a site with getting blocked (like the YouTube problem), videos will be matched with Digital video fingerprinting.

Automatic or manually assisted: Unsupervised - Automatic (Precision would be high enough to ensure that it doesn't vandalize anything, will not edit dubious files)

Edit type (e.g. Continuous, daily, one time run): Continuous, daily

Maximum edit rate (e.g. edits per minute): < 10

Bot flag requested: (Y/N): Bot is already flagged

Programming language(s): Python

Eatcha (talk) 08:49, 8 February 2020 (UTC)[reply]

Discussion

Primary targets : Category:License review needed, Category:License review needed (video) and Category:License review needed (audio). Files marked as Own work will be removed and I can mention the uploader in the edit summary or post a little message on their talk page. All external links in the source parameter of the Information template will be archived, which will be beneficial if source link dies. I will replace the dead links with direct link to oldest archive. -- Eatcha (talk) 09:20, 8 February 2020 (UTC)[reply]

Some older discussions : here and here -- Eatcha (talk) 09:48, 8 February 2020 (UTC)[reply]

  • I can't wait!
    I support sorting files waiting for review in different categories. It makes it much easier. For example reviewing files from Flickr and YouTube is easy as the license is clearly mentioned and always the same place. Also you don't have to understand the language because the license is clearly marked. Other websites can be messy so you have to look for the license and if it is in local language it can be very hard for users who do not understand the language.
I also think it would be nice if the bot will check the license for still images from a video and note what the license was. The bot can't decide if the still image is from the video but it should be able to verify the license. If the license is non free then we should mark the images "no permission". If there is no video at the source then we should mark the images "no source". For all images with a link to a video with a free license we now know that the license is good (even if it is later changed to an unfree license) and we only need to verify that the still did come from the video. --MGA73 (talk) 13:12, 8 February 2020 (UTC)[reply]
I intend to support the following sources as in the beginning  :
  1. Category:Files with no external URL source
  2. Category:Files with archived source and marked for license review

The temple will be automatically removed if the file gets reviewed. -- Eatcha (talk) 11:16, 9 February 2020 (UTC)[reply]

Sounds good. It sees review category is a mix of 2 kind of reviews. One where reviewer can compare the license on Commons with the license on source (the easy review) and another where reviewer needs to judge if a file is PD for some reason (the harder review). --MGA73 (talk) 11:32, 9 February 2020 (UTC)[reply]
  • How should I categorize the files ? I can add categorization templates, only categories but the problem is the license reviewing script will not remove the template or the category. Is there any more conservative approach to this sorting problem ? I can create galleries, these galleries should be advertised to license reviewers. Or I can remove the categorization template / category after the file has been reviewed by a reviewer. -- Eatcha (talk) 03:03, 10 February 2020 (UTC)[reply]
  • I was planning to add categorization templates like to tge file page, after the file has been reviewed the bot would remove the template. But it would generate a lot of edits and a better approach would be galleries for different sources and licenses. -- Eatcha (talk) 03:07, 10 February 2020 (UTC)[reply]
Can we do it like flickr files? I think we at least need one for unreviewed files, one for files needing human review and one for reviewed files. If possible the files needing human review should be sorted in one for files where the source is licensed unfree, one for files where the link does not work and one where the license and link is good but the bot cant match the file (for example still from video).
If that is a problem we can start with a gallery of files where the source is licensed unfree and one where the link does not work. Those file should probably all be deleted. Perhaps the bot could add a wayback link to the files where license is (now) unfree in case that can help reviewer see if license was once free. --MGA73 (talk) 09:25, 10 February 2020 (UTC)[reply]
MGA For YouTube screenshots it gonna be like the examples at User:EatchaBot/LR/YouTube/screenshots. The bot can generate daily gallerys. Or if everyone wants categorization like Flickr, I need to add some text into the file-page most probably template like {{FlickrReview}}. But the license reviewing User scripts + Edit filter should be updated accordingly. -- Eatcha (talk) 15:12, 10 February 2020 (UTC)[reply]
@Eatcha: I think the gallery works fine. I passed one and nominated the other for deletion. If there are more than a few files in a gallery perhaps it can sort them under Free, Unfree, Not found? --MGA73 (talk) 15:21, 10 February 2020 (UTC)[reply]

I think it would be nice if Eatcha have the bot save all pages in Wayback Machine that is mentioned in source in file pages with a {{Licensereview}} etc. because in many cases we need a human to review and history tells us that it can take a long time before a human review the page. If there is a link in permission perhaps that should be saved too. --MGA73 (talk) 06:56, 17 February 2020 (UTC)[reply]

MGA see Commons:Bots/Work_requests#Review_all_files_in_Category:License_review_needed, I will implement it after next 3 months if no one does that before me/InternetArchiveBot starts operating here. It's a huge task and will take weeks (without doing another work, I checked InternetArchiveBot's source. it's not very easy to do.) to write a bug free code. And I also need to ensure that my bot does not archive any illegal stuff, like child-porn/Drug-store-sites/Hitmen/oragan-selling-sites, etc, archiving these illegal sites == Trouble for the bot developer. -- Eatcha (talk) 10:44, 17 February 2020 (UTC)[reply]
Eatcha yes the internet is full off shit. But if all your bot does is to ask Internet Archive to save a copy of a website don't you think they have some sort of filter or cleanup? But of course I agree it would be nice not to ask them to save too much shit to minimize the risk that they block the bot. --MGA73 (talk) 14:15, 17 February 2020 (UTC)[reply]
They won't block the bot in my opinion as they are paying for InternetArchiveBot (as they are paying implies they want more external links from here), and it runs from the tool-forge (Both bots share the same Ip address). The bigger issue here is the time that it requires to develop a multi-site review+archive bot for video+audio+audio, and of course avoiding archiving illegal stuff. -- Eatcha (talk) 14:38, 17 February 2020 (UTC)[reply]

How should I implement it

You did not ask me ;-) but if the bot can somehow separate files in those with broken links and those with working links it would be nice.
I also think that it is nice to treat files from YouTube (and Flickr) specially because it is very easy to review files from YouTube (and Flickr) because the license on those two sites is very clear and easy to find. If you find many files from other specific sources we should perhaps separate those too. --MGA73 (talk) 16:34, 22 February 2020 (UTC)[reply]
MGA I asked them as they will decide whether this task should be approved or not, it's a policy here. This request was created only because you asked me on my talk-page. What do you mean by dead link ? 404 + no archive or just 404 links ? And what category should be added ? Suppose there's a video from website of VOA news with 404 link and a archive what categories would you prefer ? And what if link dead with no archive ? And if link alive + archived ? It would be best if you write the exact changes you prefer after comparing with the example (kathy pham's portrait) I linked above. And also what about if the file's a video/images/audio or other ? I just calculated that 2^4 combinations of categories are possible if for a single website that supplies at least video if what your asking is implemented. Without implementation of your suggestion it's 2^3 now. It's double. -- Eatcha (talk) 17:14, 22 February 2020 (UTC)[reply]
Its cool Eatcha! I know why you asked them :-)
You asked what to do with file with no external URLs. I think they should go in their own Category:Files with no external URL source
By dead links I mean files where the link is not working today and where there is no working link in archive. They should go in Category:Files with no working URL source.
If there is a working link today but no link in archive will the bot not force archive to save a link? If not we could create Category:Files with working source only and have all reviewers hurry up and look at those :-)
Categories like Category:White House images review needed is good. Especially if there are many images from the same source. --MGA73 (talk) 17:31, 22 February 2020 (UTC)[reply]
It turns out I was assuming a complex categorization system like Category:Videos from YouTube with no working URL source or Category:Images from YouTube with working URL source, nice to hear that in link's status there are only to possibilities and I was wrong. -- Eatcha (talk) 03:49, 23 February 2020 (UTC)[reply]
Eatcha I think we should start simple. If it turns out that 90 % of the files end in the same category you could perhaps add more categories later or use User:EatchaBot/LR/YouTube/screenshots etc. --MGA73 (talk) 09:42, 23 February 2020 (UTC)[reply]

Example edits


Please report when the request is ready to be approved. --Krd 18:23, 4 March 2020 (UTC)[reply]

Should be approved now. -- Eatcha (talk · contribs) 14:49, 7 March 2020 (UTC)[reply]
tl;dr: Approved. --Krd 07:39, 20 March 2020 (UTC)[reply]