Commons:Bots/Requests/MifterBot 4

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

MifterBot (talk · contribs)

Operator: Mifter (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Check newly uploaded images to ensure they have a license tag. Reactivation of a former task Commons:Bots/Requests/MifterBot and a revival of my first reflag request (which went stale) here. This bot is identical to en:User:MifterBot

Automatic or manually assisted: Automatic - running on WMF Labs

Edit type (e.g. Continuous, daily, one time run): Every half-hour or every hour, depending on image upload volume

Maximum edit rate (e.g. edits per minute): No more than 4 edits per minute (2 tags and 2 notifications or some combination of log entries - log entries should decrease over time as any semi-common templates that I have not previously handled are detected and integrated)

Bot flag requested: (Y/N): Yes.

Programming language(s): Python - A custom fork of pywikipedia

Discussion

Per the instructions on the bot request page I am currently running a supervised test of the bot. I tested the new code with a -simulate flag prior to this test and their do not appear to be any major issues however I will be watching to ensure nothing goes wrong. Mifter (talk) 01:34, 8 February 2017 (UTC)[reply]
One image was correctly tagged in the test (the first edit) however the bot mistakenly tagged two images as having "depreciated" license tags, this is a remnant of old code relating to deprecated licenses which I am removing prior to continuing the test. Mifter (talk) 02:11, 8 February 2017 (UTC)[reply]
I have removed the code that triggers "depreciated" tagging. Now resuming the test. Mifter (talk) 02:13, 8 February 2017 (UTC)[reply]
The bot appears to be working as intended so far. It has been erring on the side of caution (if it is not sure if an image has a license tag it logs it and moves on) as it is supposed to. The two log entries generated so far are from other image deletion templates that I have since integrated into the code as acceptable templates. The only issue is that we do not have many images uploaded without license tags (hardly a bad issue to have) so it takes time to find and tag those that are. Mifter (talk) 02:52, 8 February 2017 (UTC)[reply]
It'll be reasonable to combine multiple notifications for same user. See User talk:Fæ as example.
Please aslo fix edit summaries.
EugeneZelenko (talk) 15:04, 8 February 2017 (UTC)[reply]
I agree. As I mentioned in the prior reflag request the way the bot is currently coded it doesn't have the capability to dynamically detect how many times it has notified a user in a timeframe. During the prior request I tried to add such a function in a few different ways but was unsuccessful. However, I believe I found a solution that accomplishes a similar outcome. Through some reconfiguration of custom notification templates I should be able combine warnings under the same header. For example, each notification starts with a level 2 header stating something like "Image uploaded without a license" by reconfiguring the notification template to below that use a level 3 subheader (or simply large bolded text) saying something like "File:Sample.jpg" I should be able to combine the notifications. If the bot sees the level two header as the last one on a user's talk page it will then combine the warning under the same level 2 header (eg "Image uploaded without a license" and below "File:Sample.jpg" and below "File:Example.jpg". If someone has left an intervening message on that user talk page then it will treat it as a new notification but this should allow for some grouping of multiple notifications. It's not perfect but as I also mentioned in the prior request single users uploading a large volume of untagged images is relatively rare. Later today or tomorrow I will make the changes to facilitate this grouping and assuming their are no objections resume testing to ensure it combines the notifications correctly. On your other question, what about the edit summaries would you like changed? Best, Mifter Public (talk) 17:11, 8 February 2017 (UTC)[reply]
I've made the changes to the templates utilized as to combine multiple notifications and am resuming supervised testing to verify that the templates and bot behave as expected. Best, Mifter (talk) 21:56, 8 February 2017 (UTC)[reply]
I don't think will be good idea to add template for each file. It'll be better to add template with list of files like User:CategorizationBot did in past or MediaWiki:VisualFileChange.js does.
Edit summaries complain about source, when license is actually missing.
EugeneZelenko (talk) 15:00, 9 February 2017 (UTC)[reply]
I will switch the edit summaries momentarily. In regards to adding a list of files as opposed to separate notifications for separate images without licenses I do not believe it would be practical in this case. If the bot finds a second or third image without a source a few hours or days later (likely after the initial notice has been viewed) I am concerned that a user might not notice that another image had been added to a previously read notification or that they will not understand the continuing importance of correctly tagging and licensing your images. While CategorizationBot was able to integrate multiple tags this bot does not possess the same capability (my previous attempts to integrate such a capability all regrettably did not succeed). Additionally, CategorizationBot was arguably tackling a less time-sensitive issue in that adding categories can be done at a user's leisure while images without license tags are deleted within a week. From what I can see VisualFileChanges combines multiple notifications that were tagged/identified at the same time and in the same action, if you perform two separate actions it generates separate notifications (my unscientific test) just as would be done if a user tags multiple images at different times. Over the past few days of testing the bot has performed exactly as designed and I believe how the bot handled two untagged images here addresses both the concerns you raised as well as the current technical limitations of the bot considering that it is very uncommon for users to upload large swaths of untagged images (this is in contrast to uncategorized images where many users uploaded many uncategorized images over time) and the norm is only one or two images. Best, Mifter Public (talk) 17:11, 9 February 2017 (UTC)[reply]


Because this task was originally intended for Commons:Bots/Requests/YiFeiBot_(13) and eventually changed to adding tracking categories instead of tagging (which are done semi-manually within the tracking categories), would you elaborate on the internal logic of the bot? More specifically:

  • newly uploaded images -- from which source of information does the bot get the pages from (db query? rc?) and what's the threshold in which the bot considers the image new?
  • ensure they have a license tag -- what logic does the bot use in order to determine if the file description page contains a license tag?
  • (extension) A custom fork of pywikipedia -- if I understand your "fork" correctly, and you seem to be the only maintainer that has shell access to the bot's tool labs account, how confident are you with maintaining the fork and the bot from being abandoned, considering the fairly frequent breaking changes to the api?
  • And most importantly: what has changed since the last bot request?

--Zhuyifei1999 (talk) 18:05, 9 February 2017 (UTC)[reply]

I'd be happy to. To your first point, I wasn't previously familiar with YiFeiBot though from reading the bot request I don't believe that the task I'm requesting is the exact same though it does appear pretty similar. I began running this bot (in a much much older form) in 2008 doing substantially the same task as I'm asking for reapproval now. Specifically from looking through your bot request it looks like your bot was designed to analyze all images on Commons to verify that they are correctly tagged. This bot only checks recently uploaded files (though it could analyze all files on Commons if set to do so with some minor reconfiguration). To accomplish that it queries the API for the most recently uploaded files in the upload log. I currently have the bot set to check the last 600 new files (which is roughly the last half hour of uploads) for testing purposes as the threshold of newness, that number can be adjusted as needed of course based on upload volume, run frequency, etc. To determine if an image has a license tag it analyzes the text to see if it has a template that is included within Category:License tags. If it does not find a license template it then analyzes the other templates on a page subject to a whitelist of templates like Template:Information and the like which are widely used but not actual license tags. If it finds a template that it does not understand to be a license tag but not a whitelisted template it then reports the file to the log for verification. This is designed to minimize false positives as it is conceivable a legitimate license template is present but not correctly categorized, etc. Only if the bot detects no license tag and no other nonwhitelisted templates does it then tag the image as not having a license tag, notify the uploader, and move on.
Your second question is highly relevant as in the past this is what has stopped the bot from running both back in 2008, 2011, 2013, and 2014 (this also includes the English Wikipedia where the bot also runs). The pattern was not lost on me when I set about resurrecting the bot (again). Being perfectly honest, with my prior schedule I didn't have time to keep up with the API changes causing the bot to break as by the time I would get it back up something else would be changed. This time, being aware of this, I have held much closer to the main pywikibot source which allows me to integrate changes from their end much more quickly. I have done some testing and I should be able to handle changes to the API with considerably less effort than in the past as I can plug-and-play a decent chunk of the vanilla pywikibot source without having to reinvent the wheel myself each time the API changes.
To your final point, not that much has changed. The goal is still the same, though the script is better able to detect licenses and requires a little less in terms of resources to run. The biggest change is my decision to hold as close as possible to the vanilla pywikibot source to ensure the bot is able to stay operational.

Let me know if you have any other questions. Best, Mifter Public (talk) 19:39, 9 February 2017 (UTC)[reply]

Sounds sane. I looked at your source code and found checkimages.py. If you'd like I can file a patch upstream regarding having a setting of max # number of notifications in a single bot run. I personally hate spammy emails :) --Zhuyifei1999 (talk) 20:00, 9 February 2017 (UTC)[reply]
Thanks, that would be great so long as its not too much trouble for you. And honestly I think everyone hates spammy emails :). Best, Mifter Public (talk) 20:27, 9 February 2017 (UTC)[reply]
Filed phab:T157728, will do the coding tomorrow. --Zhuyifei1999 (talk) 20:45, 9 February 2017 (UTC)[reply]
I submitted a patchpatch 2. Are you able to test it out? They haven't been merged to master yet (i.e. not approved), though --Zhuyifei1999 (talk) 11:01, 10 February 2017 (UTC)[reply]
I'll grab the code now and run some simulated batches and then a few actual ones. Thanks again for writing this :). Mifter (talk) 20:08, 10 February 2017 (UTC)[reply]

The simulated runs did not have any issues so I am now running a series of supervised live tests. I've set the limit at three notifications for testing but that can be adjusted as needed. Best, Mifter (talk) 18:54, 11 February 2017 (UTC)[reply]

@Mifter: Sorry I forgot to note here, the newest patch currently is patch 3, fixing some issues found during code review on patch 2, you might want to use this instead. This patch 3 is still under review. Patch 4 may be uploaded once I get clarification on the current negative comment regarding variable naming. --Zhuyifei1999 (talk) 19:00, 11 February 2017 (UTC)[reply]
No problem, thanks for the heads up. I'll look at patch 3 and wait to see what comes up with a potential patch 4. Thanks again, Mifter (talk) 19:02, 11 February 2017 (UTC)[reply]
Patch set 5 has been merged into master (i.e. approved and it's now vanilla :) --Zhuyifei1999 (talk) 14:34, 12 February 2017 (UTC)[reply]
Fantastic, I'll grab a copy and run some additional test runs to make sure everything works as intended. (likely later today or tomorrow depending on my schedule) Mifter (talk) 23:37, 12 February 2017 (UTC)[reply]
I've been busier than I expected, I will probably have to run the tests tomorrow due to timing. Mifter (talk) 01:09, 14 February 2017 (UTC)[reply]

Simulated testing has gone well, I will be running some additional supervised live tests shortly. Mifter (talk) 00:47, 16 February 2017 (UTC)[reply]

I completed a few live supervised test runs last night and everything appears to have run as designed. The bot tagged a few images without licenses and did not add more than the input maximum number of notifications to any user's talk page during a single run. Mifter Public (talk) 20:50, 21 February 2017 (UTC)[reply]
Edit summaries are still not fixed. User:CategorizationBot style notices will be better then subsections. --EugeneZelenko (talk) 15:19, 22 February 2017 (UTC)[reply]
Sorry about that, when I was integrating Zhuyifei1999's patch I forgot to go and modify the edit summary. It is all changed and corrected now (the edit summary now reads "Bot: Requesting license information."). In regards to categorization bot style notifications, as I listed in more detail above it isn't technically feasible with the current design of the bot (last time I tried to implement it I was unsuccessful), and I have concerns that even if it were feasible integrating notifications in that way over a period of time makes them harder to locate for a new user, among the other concerns I raised above. I believe the patch Zhuyifei1999 wrote and using subheaders on the talk page strikes the best balance of the technical limitations, the need for notices to be front and center, and not flooding a user's talk page in a single bot run. It does not appear that users frequently upload mass amounts of new images without license tags (users claiming incorrect licenses is another issue however), and even if they did, the new limiter would prevent the bot from inundating a user's talk page in a single run. Mifter (talk) 01:22, 23 February 2017 (UTC)[reply]

Could one please sum up what is missing here and what could be the next step? Thank you. --Krd 08:51, 3 March 2017 (UTC)[reply]

@Krd: - My understanding of the current situation is that Eugene would like the bot's notification system mirror the notification style of User:CategorizationBot. As I mentioned above I don't believe that is needed considering the relatively small number of users who upload large amounts of files without license tags, the differences between an image having no categories and no license tag, and that now we have implemented Zhuyifei's patch the bot will not provide more than 3 (number is changeable) notifications during a single run. Further, the bot does consolidate back-to-back notices under the same level two header (using level three headers for the notifications). Additionally, the way the bot is currently written it is not technically feasible to implement such a system, last time I attempted it I was unsuccessful (Zhuyifei's patch is an effort to achieve a similar outcome without rewriting the entire code.) Aside from that, the bot has performed exactly as intended during its supervised, semi-supervised, and brief unsupervised testing. As a data point, this bot is also currently running is an almost identical configuration on the English Wikipedia. In terms of next steps I would advocate for re-flagging and re-approval as the bot is running as designed.
However, in fairness to Eugene it looks like he wants a full integration of the notifications as mentioned above. In response to his concern, we have done as much as we can with the current technical framework to find a compromise, address his concern, and to prevent the bot inundating a user with notifications though a consolidation of notices under the same heading and a hard limit of x notifications per run. If we are going to require CategorizationBot notifications then in all honestly this bot will likely not be re-approved as it is not technically feasible and would require almost an entire re-write of the code. If that is something we are going to require I would honestly be disappointed as I believe this bot can provide quite a bit of value to the project by helping to reduce the amount of manual review that must be conducted for new files. Best, Mifter (talk) 23:47, 4 March 2017 (UTC)[reply]
I could live with having this task approved. @EugeneZelenko: Please advise if you agree. Thank you. --Krd 14:20, 11 March 2017 (UTC)[reply]
I prefer to have compact notifications in case of multiple problems. We have bots which work that way, so it'll be good idea to borrow code from there. --EugeneZelenko (talk) 15:00, 11 March 2017 (UTC)[reply]
AFAIK, checkimages.py cannot do this without an almost complete rewrite. As the max notifications per run has been enabled, I don't think compact notifications is a blocker for this task anymore. --Zhuyifei1999 (talk) 05:11, 12 March 2017 (UTC)[reply]

I'm going to call this approved. Although there is a minor issue left, I agree this shouldn't be a blocker for the whole task. If the issue can be resolved later, please do accordingly. --Krd 07:56, 18 March 2017 (UTC)[reply]