Commons:Bots/Requests/OptimusPrimeBot
Operator: Don-vip (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)
Bot's tasks for which permission is being sought:
#Spacemedia is a tool I'm writting that will continuously harvest media libraries of various space agencies in order to find free media not yet uploaded to Wikimedia Commons. The tool runs on Toolforge (Kubernetes/jdk11 cluster) and will use OptimusPrimeBot account to upload automatically new discovered media. This project is meant to supersede Fæ's European Space Agency batch upload project. See Commons:Spacemedia for more details. Currently, about 200k files are considered, mainly from NASA. The tool tries its best to pre-categorize the imported media, fill the description and add the relevant templates. Space agencies dedicated image libraries usually provide the necessary metadata to get precise information. Those published on Flickr however, lack this metadata. In this case, the categories are being retrieved from a manual mapping I'm doing from the Flick album name to some Commons categories. For some agencies, there is a little logic adding categories based on the media title and/or description (when I checked it doesn't raise too many false positives). In all cases, the tool adds "review" maintenance categories allowing to manually check the imported media and add the missing categories, on the model of Category:ESA images (review needed). Media that will be imported (with categories and templates being used) can be seen at https://tools.wmflabs.org/spacemedia/
Phase 1: The tool is not yet completed but I need the permission for my bot to test upload code, check that the media are correctly categorized, everything is OK, etc. During this phase I will manually import a few dozen files max from the smallest source I have: Category:Air Force Space Command (65 files only). This phase should last a few weeks.
Phase 2: When I will be confident enough on my code and I will have all the required feedback, I will gradually enable one by one the automatic upload for the other image sources, from the smallest to the largest:
- Category:Air Force Space Command (from Flickr): 65 files
- Category:Space and Missile Systems Center (from Flickr): 121 files
- Category:NASA (from SIRS): 191 files
- Category:European Space Agency (library): 643+ files
- Category:Korea Aerospace Research Institute (library): 960+ files
- Category:International Astronomical Union (library): 1357+ files
- Category:European Space Agency (from Flickr): 1640+ files
- Category:Deutsches Zentrum für Luft- und Raumfahrt (from Flickr): 2154+ files
- Category:Hubble Space Telescope (NASA library): 4389+ files
- Category:Hubble Space Telescope (ESA library): 4715+ files
- Category:European Southern Observatory: 12634+ files
- Category:NASA (from Flickr): 15534+ files
- Category:NASA (library): 175793+ files
Automatic or manually assisted: Manually assisted first, until I'm confident enough to enable automatic upload
Edit type (e.g. Continuous, daily, one time run): daily first, then continuous
Maximum edit rate (e.g. edits per minute): 12 uploads per minute
Bot flag requested: (Y/N): Y
Programming language(s): Java
vip (talk) 22:27, 19 April 2020 (UTC)
Discussion
- Please make test run. --EugeneZelenko (talk) 14:13, 20 April 2020 (UTC)
- I tried, but the Mediawiki API returns an error when I tried an upload with my bot account stating that it doesn't belong to the "Bots" group. Isn't this group set after I make the request here? vip (talk) 15:04, 20 April 2020 (UTC)
- The exact error I receive when I try to upload a single file (by URL) is:
code=permissiondenied, info=The action you have requested is limited to users in one of the groups: [[Commons:Bots|Bots]], [[Commons:Administrators|Administrators]], [[Commons:Patrol|Patrollers]], [[Commons:Autopatrollers|Autopatrollers]], [[Commons:License review|Image reviewers]], [[Commons:GWToolset users|GWToolset users]].
I read API documentation and it doesn't mention any other group restriction than having the "upload" right, which my bot has. vip (talk) 17:05, 20 April 2020 (UTC)- @Don-vip: I think that's because your bot was not in any of the groups that allow upload. Try now. Multichill (talk) 17:11, 20 April 2020 (UTC)
- @Multichill: Thanks! Unfortunately I still have the same error, even after restarting the application and fetching a new token. vip (talk) 17:39, 20 April 2020 (UTC)
- @Multichill: With your link I understand what's going on: I am uploading the file by URL and I see I need the special "upload_by_url" right. One thing I don't understand: is the "bot flag" and the "Bots group" the same thing? vip (talk) 17:55, 20 April 2020 (UTC)
- @Don-vip: Yes, you need to be in another group for that. Added auto patrollers. Let's see if that fixes it. Mind you that upload by url uses a https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L15604 whitelist] so make sure all of the domains are on it. If a user is in the bot group, the user can set the bot flag on an edit. If you go to Special:RecentChanges and also make bot edits visible, you'll see the "b" next to it. I might have missed it, but can you please share a link to your source code? Multichill (talk) 18:45, 20 April 2020 (UTC)
- @Multichill: Thanks! It works, my bot has just imported its first file: File:120914-F-HA556-220_(7987403749).jpeg. There is now some applicative issues I need to fix to complete the test run. You can take a look to my source code here: https://phabricator.wikimedia.org/source/tool-spacemedia/ Probably the most interesting class for you is this one: https://phabricator.wikimedia.org/source/tool-spacemedia/browse/master/src/main/java/org/wikimedia/commons/donvip/spacemedia/service/CommonsService.java Thanks for the whitelist, I'll review the domains I'll use and will create the Phabricator tasks I eventually need. I will let you know when the test run is complete. vip (talk) 18:59, 20 April 2020 (UTC)
- @Multichill: Only two domains were missing in the whitelist, I have created T250903 for IAU and T250904 for KARI. vip (talk) 11:18, 22 April 2020 (UTC)
- @Multichill: Thanks! It works, my bot has just imported its first file: File:120914-F-HA556-220_(7987403749).jpeg. There is now some applicative issues I need to fix to complete the test run. You can take a look to my source code here: https://phabricator.wikimedia.org/source/tool-spacemedia/ Probably the most interesting class for you is this one: https://phabricator.wikimedia.org/source/tool-spacemedia/browse/master/src/main/java/org/wikimedia/commons/donvip/spacemedia/service/CommonsService.java Thanks for the whitelist, I'll review the domains I'll use and will create the Phabricator tasks I eventually need. I will let you know when the test run is complete. vip (talk) 18:59, 20 April 2020 (UTC)
- @Don-vip: Yes, you need to be in another group for that. Added auto patrollers. Let's see if that fixes it. Mind you that upload by url uses a https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L15604 whitelist] so make sure all of the domains are on it. If a user is in the bot group, the user can set the bot flag on an edit. If you go to Special:RecentChanges and also make bot edits visible, you'll see the "b" next to it. I might have missed it, but can you please share a link to your source code? Multichill (talk) 18:45, 20 April 2020 (UTC)
- @Don-vip: I think that's because your bot was not in any of the groups that allow upload. Try now. Multichill (talk) 17:11, 20 April 2020 (UTC)
- The exact error I receive when I try to upload a single file (by URL) is:
- @EugeneZelenko: @Multichill: I have completed the test run by uploading 15 files. The first 10 I have uploaded individually to improve my code. The last 5 I have uploaded simultaneously to make sure the tool enforces a 5-second delay between successive uploads, as per wiki recommendations. Can you please take a look? vip (talk) 20:52, 21 April 2020 (UTC)
- I tried, but the Mediawiki API returns an error when I tried an upload with my bot account stating that it doesn't belong to the "Bots" group. Isn't this group set after I make the request here? vip (talk) 15:04, 20 April 2020 (UTC)
- Is it possible to create meaningful file names? --EugeneZelenko (talk) 14:48, 21 April 2020 (UTC)
- The file name is the same as the source. In this particular case, Air Force Space Command (AFSC) has lousy names. But AFSC is only 65 files. The 200k files from NASA/ESA will have much better names. I guess the file names would have been strictly identical if uploaded with other Flick import tools. vip (talk) 20:49, 21 April 2020 (UTC)
- Ah, the file names are VIRIN identifiers. I can detect those for US military sources and replace them with album title, when the media is in an album (not always the case). vip (talk) 20:59, 21 April 2020 (UTC)
- How will the bot differentiate files that are not available under a compatible license, but are mixed in with those that are? I'm considering ESA particularly...files such as [1], [2], [3], [4], [5] to name just a handful of recent additions to that site, and files on NASA sites that belong to other space agencies without free licenses. Also, I assume you'll skip any image featuring Public Domain Mark? How does the bot deal with non-obvious non-identical-hash duplicates, given that the space community on Commons already imports a lot of files from these sources, sometimes with cleanup or modification? — Huntster (t @ c) 12:56, 22 April 2020 (UTC)
- @Huntster: None of the ESA files you linked will be considered by the tool as eligible. You won't find them at https://tools.wmflabs.org/spacemedia/esa/missing . As ESA does not provide an API to search by licence yet, the tool performs the retrieval in two steps. First it performs a search with "by-sa" you can check it here. Then the license is actually checked in the result to avoid false positive, by checking the presence of "BY-SA" or "COPERNICUS SENTINEL" in the description, you can check it here. Concerning NASA, I don't exclude any file right now. Do you have an example of a file that should not be imported? Concerning the "Public Domain Mark" do you mean on Flickr? Until I saw your message my tool would have imported them with a wrong template, I just have changed it to import them using Template:Flickr-public domain mark, you can check it here. Concerning the non-identical-hash duplicates the tool does nothing in particular. T167947 needs to be resolved if we want to detect those cases, see User:Fæ/Imagehash for more details. vip (talk) 16:28, 22 April 2020 (UTC)
- @Don-vip: Regarding NASA, they routinely host images from other agencies and commercial entities. I have no particular examples at the moment, but I would avoid uploading anything that doesn't mention "Credit: NASA/...". As for the Public Domain Mark, {{Flickr-public domain mark}} is not a valid license. Public Domain Marked images are not permitted on Commons. I can only wish you luck dealing with NASA images on Flickr...they are a morass of weird and often un-legal licensing (CC when they are legally PD, etc). — Huntster (t @ c) 17:38, 22 April 2020 (UTC)
- @Huntster: Thank you for the clarification. I have checked in my database, I currently list 34413 files from Flickr, only 731 of them are using the Public Domain Mark, I will simply skip them. I will also see how to detect non-NASA images. I naively believed they only hosted their own images... vip (talk) 18:07, 22 April 2020 (UTC)
- Sounds good. I'm more than happy to help clarify any off-nominal situations you encounter. Interpreting space-related license weirdness can be a chore. — Huntster (t @ c) 18:59, 22 April 2020 (UTC)
- @Huntster: I ignored all Public Domain Mark pictures, it removes all pictures from the Space and Missile Systems Center. Do we agree that such pictures that are clear work of the U.S. Air Force (contains credit such as "(U.S. Air Force photo/Sarah Corrice)") can be imported? If yes, what should I do regarding the Flickr template on Commons? Is it enough to add Template:PD-USGov-Military-Air Force in that case to get the upload accepted by FlickrReview bot? Or should we simply forget Flickr and take it from DVIDS? Sounds simpler/safer/more accurate. vip (talk) 21:45, 22 April 2020 (UTC)
- Both options are acceptable. So long as the credit line states "Military branch/Photographer" or even just the branch without photographer, it should be okay. You could import from Flickr with {{PD-USGov-Military-Air Force}} and ignore PDM, or simply import from DVIDS. I agree the latter is probably the simpler solution, and will always result in the cleanest file, since Flickr recompresses the image. — Huntster (t @ c) 12:31, 23 April 2020 (UTC)
- Mmm not easy. I checked the Thule Air base visit done in December 2017 by Air Force Space Command Commander, pictures taken by two photographs. For 1st photograph we have 38 files on Flickr, but only 24 files on DVIDS. For 2nd photograph we have 21 on Flickr but only 12 on DVIDS. Also the VIRIN identifiers have been "renumbered" on Flickr, so they are wrong there. And you're right, DVIDS files are better (larger resolution). I will investigate User:Fæ/Imagehash project to see if I can do a similar thing to choose the best file and avoid duplicate imports. vip (talk) 13:55, 25 April 2020 (UTC)
- @Huntster: Took me a while to find correct settings and compute hashes for tens of thousands of files, but here I am. I have pretty good results, you can take a look to the 33 AFSPC files ignored from Flickr because found in DVIDS, and the 21 Flickr files considered to be imported because they have not been found in DVIDS. For larger data sources, this is impressive. Out of the 20k NASA files harvested on Flickr, 5,6k are ignored because found in main NASA image library. For ESA, 750 files out of 2500 are ignored for the same reason. vip (talk) 23:28, 15 May 2020 (UTC)
- Don-vip, very impressive! Out of curiosity, are the hashes of files on Commons that have been uploaded over checkable to avoid situations where a file has been cropped to remove a border, or something similar? (not saying it must, just curious as to the technical possibilities.) — Huntster (t @ c) 04:22, 16 May 2020 (UTC)
- Huntster: I'm not sure. I'm using the JImageHash library. I guess it should work for small crops of a few pixels, but that's just a wild guess. vip (talk) 16:10, 16 May 2020 (UTC)
Are there any unresolved issues? --Krd 10:08, 27 May 2020 (UTC)
- Approved. --Krd 06:50, 7 June 2020 (UTC)