Commons:Bots/Requests/TheSandBot 2

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

TheSandBot (talk · contribs) 2

Operator: TheSandDoctor (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Related discussion: Commons:Bots/Work_requests/Archive_15#Corrupt_jpg

Bot's tasks for which permission is being sought: For the following, every time a new image is examined, its name, data/time of scan, and (if applicable) date/time to follow up (see #3) is recorded in an internal database. This is for tracking and to prevent scanning the same file twice (excluding follow up, of course).

  1. Per the discussion above, the bot will cycle through all images in the File namespace checking if they are corrupt. In the event that any given image is corrupt, it is noted in an internal database, the image page tagged with {{TSB image identified corrupt|<date>|day=<day_to_nom>|month=<month_to_nom>|year=<year_to_nom>}}, and the latest uploader is notified using Template:TSB corruption notification. Template:TSB image identified corrupt will add the image to a category of other images identified and leave a visible notice on the image page. This part of the task is hopefully a more-or-less one-off. Given that it never scans the same file twice (excluding #3 below), this could theoretically be a fairly quick check after its initial run as it would only pick up new files (e.g. could be a periodic check as a backup to #2 below).
  2. Recent changes monitoring: monitor recently uploaded images to ensure that they are valid. If not, tag (see #1) per the bot request and follow up (see below). This part of the task and #3 below would "survive" the longest and be continually running.
  3. Follow up: Whenever an image is identified as being corrupt, it is tagged (above 2) and subsequently requires follow-up after X time increment to either identify it as corrected or nominate for deletion. This part of the task I have yet to write the code for (but have a good idea how I will/will start within next day or so).

I am happy to answer any questions that you may have about this.

Automatic or manually assisted: automatic

Edit type (e.g. Continuous, daily, one time run): continuous for recent changes monitoring, preferably rarely to a one time to check of full Commons, then follow ups for any images identified

Maximum edit rate (e.g. edits per minute): The rate at which it is allowed to save edits could be controlled to any number as desired.

Bot flag requested: (Y/N): N/A (already have it)

Programming language(s): python

Source: GitHub

TheSandDoctor (talk) 08:05, 7 November 2019 (UTC)[reply]

Discussion

As I understand your code, you are going through every file here, download then with mwclient, and corruption is determined by Pillow failing to read & decode the file. I see a few issues with this:
  • Non-image files. Pillow cannot understand every file format here and you probably need a whitelist of formats that gets sent to Pillow.
  • File download failures. As far as I can tell mwclient does not verify the downloaded file's hashsum against the recorded hashsum from API. If the TCP connection prematurely dies, would it cause a false positive?
  • 404 images. Can you add special handling for them?
  • Gigapixels like this 1.49GB one. image.tobytes() could fail due to insufficient memory.
Also, if the bot is intended to be run on Toolforge, please make sure to put the temporary files in tempfile.TemporaryDirectory so it doesn't consume NFS bandwidth, and is also faster than NFS-bound directories anyways. --Zhuyifei1999 (talk) 08:50, 7 November 2019 (UTC)[reply]
  • (Edit conflict) The code will only go through image files? docscode
  • That is a good point, but the API doesn't appear to give you a hashsum either?
  • phab:T124101 - presumably I would be able to add special handling, yes. I am just not sure how/what I'd specifically handle it given I never (to my knowledge) access upload.wikimedia.org and it appears that some of the images mentioned in the tickets now have all of their revisions. The way I'd imagine the notification working would be just notifying the latest uploader. Judging by these two examples from the phab ticket, the API still recalls the correct user to last upload?
  • Gigapixels - that's a good point. Do you have any suggestions, Zhuyifei1999?
As for the NFS bandwidth point: good point. I am currently undecided as to whether a cloud VPS would be worth pursuing for this task. If I do use Toolforge, I will definitely look into tempfile.TemporaryDirectory. I wasn't aware of it until now, so thank you for bringing it up. --TheSandDoctor (talk) 16:58, 7 November 2019 (UTC)[reply]
  • Allimages definitely returns all files. API name is very rarely changed; when this part of the MediaWiki API was designed, probably the File namespace was still called Image and there are barely anothing other than images in the File namespace.
  • You can query for a sha1 hashsum from prop=imageinfo with iiprop=sha1.
  • You are accessing upload.wm.o when you download the file contents. My concern is that if the latest revision on upload.wm.o is a 404, then you could run into issues.
  • I thought of two methods:
    • Files larger a certain threshold are skipped, they probably take a long time to to download and process anyways, and it might be unlikely that they are corrupt compared to smaller files.
    • Hook into Image.tobytes (you might need to do AST transformation or create your own tobytes, since it doesn't seem easily hookable with monkey patching), so that the data is continuously cleared, or have data.append skipped.
--Zhuyifei1999 (talk) 17:28, 7 November 2019 (UTC)[reply]
I have implemented the sha1 hashsum check for local vs remote and will verify that the implementation is correct/functional later (in meantime, I've pushed my current implementation to the github repo). I've just run out of time at the moment. I will look at the rest later. I am wondering how to avoid the 404 situation. Probably just watch out for/catch the exception(?). The issue there primarily seems to be around user notification?
That's a good idea - I will see about writing my own tobytes or something similar to avoid the append. Making my own modified version of pillow would probably be the quickest, but I'll see if there is another way....
Anyhow, I've gotta run now. Will look more in a bit . --TheSandDoctor (talk) 19:03, 7 November 2019 (UTC)[reply]
@Zhuyifei1999: Alright: the hash verification appears to work just fine . In regards to the gigapixels running out of memory, I have forked Pillow and removed the append line. I then verified that the code still works just fine and have linked my custom fork as a git submodule (so when you look at the github, you can click it to go to the custom Pillow repo). In regards to images failing to download, I have (arbitrarily, number can be easily changed) made it so that if it fails 10 times to get a sha1 match (as in re-downloading, checking, re-downloading if failed), it spits the file name into a local text file for further investigation. Is there a list somewhere of all the (image) file formats that commons supports? I have compiled 55 file extensions that I have been able to find that PIL supports, but am wondering if I can narrow it down any.
I have yet to address the 404 error for missing image revision details (e.g. performing user) and figure that this will be specific to user notification. I will work on user notification asap. --TheSandDoctor (talk) 05:50, 8 November 2019 (UTC)[reply]
Awesome. My favorite list of Commons-supported file formats are at Special:MediaStatistics. There are also COM:FT and Special:Upload. --Zhuyifei1999 (talk) 06:01, 8 November 2019 (UTC)[reply]
@Zhuyifei1999: I've really gotten into modifying my own Pillow . I might have just switched the exception given for failed images (generic IO exception) to a custom subclass that I can actually catch. With this approach, I might not need to worry about supported formats as it just tries (download will get anything) and when PIL can't make heads or tails of the format I have it set to just remove the file from the OS and skip onto the next (exception handling throwing it up to the main loop, cleaning up along the way). I might combine the two approaches just to be sure, but it might not be needed. --TheSandDoctor (talk) 06:18, 8 November 2019 (UTC)[reply]
Hmm... Okay. If it works then it's cool. --Zhuyifei1999 (talk) 06:21, 8 November 2019 (UTC)[reply]
(Edit conflict) I should mention, you don't really have to fork the whole Pillow; I would monkey patch the function in concern, so that I could receive Pillow updates more easily. (Actually, if I have time to write very future-proof code, I would do an AST transformation so as long as the AST transformer can find the line, the patch is applied. For example this AST transformer is one that I wrote that replaces an inner function in this method.) I think writing an AST transformer is probably going to be a waste of time in this case, but I would still recommend monkey patching rather than forking the whole repo. --Zhuyifei1999 (talk) 06:19, 8 November 2019 (UTC)[reply]
@Zhuyifei1999: I have opened up an "issue"/ticket in the repo for seeing about switching to a monkey patch. I will have to do some more reading and figure out how that would work in this case, but it seems doable. I've learned something new today. Until now, I'd never heard of a monkey patch or AST transformation... In the meantime, the current approach works (your above comment) so I think that it is okay for now? --TheSandDoctor (talk) 07:03, 8 November 2019 (UTC)[reply]
ok --Zhuyifei1999 (talk) 07:42, 8 November 2019 (UTC)[reply]
User notifications have been added to the recent changes/upload log monitoring (wrapped in an exception which will log the file - this is until I figure out what specifically will be thrown and then refined). All that is left at current(? @Zhuyifei1999 and EugeneZelenko: ) is the follow up script after X days. --TheSandDoctor (talk) 17:20, 10 November 2019 (UTC)[reply]
New uploads monitoring is reasonable compromise solution, until MediaWiki will not be improved. --EugeneZelenko (talk) 15:00, 11 November 2019 (UTC)[reply]
@EugeneZelenko and Zhuyifei1999: I plan to do more vetting on this within the next 24 hours but at current I think we are basically ready for a trial. I'll let you know more when I've done some more vetting with a fresh set of eyes/in the morning. Please let me know what you think, I just thought that you deserved an update on this. --TheSandDoctor (talk) 05:48, 13 November 2019 (UTC)[reply]
@TheSandDoctor: You're free to do the trial at any time after the BRFA submission. We generally don't require approvals for short trials (it's step 3 ;) ), unless there are objections to the task. --Zhuyifei1999 (talk) 06:23, 13 November 2019 (UTC)[reply]
I'm looking at your code and see that the getMostRecentUploads is done by querying 50 most recent uploads in the logs. What is the run frequency of the bot? I'm afraid the bot might not be able to catch up; 50 uploads are done in minutes are some people also perform batch uploads. --Zhuyifei1999 (talk) 06:39, 13 November 2019 (UTC)[reply]
@Zhuyifei1999: How many would you suggest requesting at once? Another (or supplemental) option would be to occasionally run corrupt.py as it would theoretically be a lot quicker (after its first run) as it would skip (not even download) any files already found in the database and could catch any stragglers missed. --TheSandDoctor (talk) 07:16, 13 November 2019 (UTC)[reply]

I would generally avoid going for an arbitrary threshold just because of its arbitrariness and en:Zero one infinity rule. If I were to do this, I would probably go for one of the two ways for RC:

  • Use wikitech:EventStreams for real time RC streaming, as a continuously running bot process. This was used by User:Embedded Data Bot, which dispatches events to multiple workers; and User:SignBot (the code was abstracted away to media-dubiety by the time SignBot's code was restructured). The downside of this is that, well, if you restart the bot, you lose events. In theory you could rewind EventStreams to start streaming from the past but I've never got that to work, yet. It also once streams events from... like long time before that run; I couldn't figure out why.
  • Keep a timestamp / identifier of the last log event that was fetched, and fetch the log events from API from there. As far as I know (I didn't read all its code) User:CommonsDelinker uses this method to read deletion log entries and when it fails and gets revived it would try to catch up. The downside is that, well, you have a state to maintain.

By the way, since you are avoiding processing the same file twice, how would it handle re-uploads? Say, if I overwrite a good and checked file with a corrupt file, what would happen?

As for the initial run, well, you have two hundred terabytes to process, which I imagine... will take quite a while. Even if all the images are processed iterating through all the file names from the API could still take quite a while; it's on the order of tens of millions. I'm not objecting to the use of the full scan; just saying that you might need to be prepared of the scale of this operation; it's mental --Zhuyifei1999 (talk) 08:05, 13 November 2019 (UTC)[reply]

You could stop bot after finding 5-10 problematic files. --EugeneZelenko (talk) 15:13, 13 November 2019 (UTC)[reply]

Template:TSB corruption notification and Template:TSB corruption CSD notification have been created to allow for some on-wiki customization of the notification messages to be left (I welcome any input or editing from anyone watching). The ideal would be for these two templates to be substituted by the bot (and have been implemented code wise as such). In response to the point raised by Zhuyifei1999 and after some quick IRC conversations about the fact that image hashes accessible by API change after a file is re-uploaded, the bot now also logs the sha1 hash of the image on the server and compares with that prior to downloading (code). That was a great point, thank you for bringing it up.
I have also been working at transitioning the whole project from mwclient to pywikibot in order to more easily access EventStreams per EmbeddedData's example using redis (a fairly ingenious way IMO - hats off to those who wrote it). I have so far made pwb clones (with some improvements) of the mwclient methods and will have to do some code review shortly. I welcome any input on anything as well by the way. I am learning a lot from this task in particular, which is great! I love to learn new things .
@Zhuyifei1999: "It also once streams events from... like long time before that run" could you please elaborate/reword? I'm not sure I am understanding this bit right.
@EugeneZelenko: Are you suggesting stopping it after 5-10 corruptions found for a trial? Just wanting to be sure I am "getting" this correctly.
Partly brought on by this and partially due to my own curiousity/wanting to do this for a while, I have set up a phabricator install on my website which I am considering using now to track any issues/feature requests for this project (and potentially any others). I welcome anything over there as well.
This is coming to a difficult time of year for me scheduling wise (though clearing up after December 13th), so my activity regarding this may slack a bit when compared to the recent flurry of commits and work. I'll work at this when I can. --TheSandDoctor (talk) 20:11, 20 November 2019 (UTC)[reply]
I was referring to m:User_talk:Zhuyifei1999#Bot_edit_on_dewiki --Zhuyifei1999 (talk) 20:20, 20 November 2019 (UTC)[reply]
Yes, I understood correctly. --EugeneZelenko (talk) 15:03, 21 November 2019 (UTC)[reply]
@EugeneZelenko: Do you mean that I understood correctly? Sorry, I am just kind of confused as I didn't ask if you understood anything correctly? --TheSandDoctor (talk) 15:06, 22 November 2019 (UTC)[reply]
You asked Are you suggesting stopping it after 5-10 corruptions found for a trial? Just wanting to be sure I am "getting" this correctly. about - I answered this question. --EugeneZelenko (talk) 15:23, 22 November 2019 (UTC)[reply]
@EugeneZelenko: Sorry. Was just a typo/grammar getting in the way there. I see what you mean. Thanks for clarifying. --TheSandDoctor (talk) 16:26, 22 November 2019 (UTC)[reply]


  • Could anybody please elaborate what is the problem with the incomplete uploads? I assume none of them is in use, because a corrupt image would not have been inserted to any article nor used externally. Additionally, jpgs with missing parts at the bottom could be cropped and should not neccessarly be deleted, but it's a lot work to process all that for images with no actual use; it's early enough to do that when the image is really needed. For me the only remaining thing is to notify the user and encourage them to reupload, but I don't expect much here as most reactive users likely had noticed the problem themselves. To my opinion all this effort is a bit too much compared to the expectable outcome. Am I mistaken? --Krd 08:14, 1 December 2019 (UTC)[reply]
    I don't see a major specific downside of doing this. Yes, there are a lot of traffic flowing around, but we generally don't care about traffic within Wikimedia network do we? On the plus side, if someone's upload fails, they could get notified of the broken-ness earlier. Some use upload tools which may not be immediately clear that the upload is broken, and some might delete the originals once the file is uploaded. With this, it would be more likely to have the broken file corrected before file gets lost, deleted, or uploader retires, goes on break. Besides, it's the reusers who decide whether to use the image, and they are probably not the uploader and might not have the image in hand to fix them, right? --Zhuyifei1999 (talk) 04:25, 2 December 2019 (UTC)[reply]

I'm going to call this approved, not least as most of it is read-only traffic anyway, and the bot already is flagged. --Krd 12:43, 12 December 2019 (UTC)[reply]