Commons:Bots/Requests/OptimusPrimeBot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

OptimusPrimeBot (talk · contribs)

Operator: Don-vip (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought:

#Spacemedia is a tool I'm writting that will continuously harvest media libraries of various space agencies in order to find free media not yet uploaded to Wikimedia Commons. The tool runs on Toolforge (Kubernetes/jdk11 cluster) and will use OptimusPrimeBot account to upload automatically new discovered media. This project is meant to supersede Fæ's European Space Agency batch upload project. See Commons:Spacemedia for more details. Currently, about 200k files are considered, mainly from NASA. The tool tries its best to pre-categorize the imported media, fill the description and add the relevant templates. Space agencies dedicated image libraries usually provide the necessary metadata to get precise information. Those published on Flickr however, lack this metadata. In this case, the categories are being retrieved from a manual mapping I'm doing from the Flick album name to some Commons categories. For some agencies, there is a little logic adding categories based on the media title and/or description (when I checked it doesn't raise too many false positives). In all cases, the tool adds "review" maintenance categories allowing to manually check the imported media and add the missing categories, on the model of Category:ESA images (review needed). Media that will be imported (with categories and templates being used) can be seen at https://tools.wmflabs.org/spacemedia/

Phase 1: The tool is not yet completed but I need the permission for my bot to test upload code, check that the media are correctly categorized, everything is OK, etc. During this phase I will manually import a few dozen files max from the smallest source I have: Category:Air Force Space Command (65 files only). This phase should last a few weeks.

Phase 2: When I will be confident enough on my code and I will have all the required feedback, I will gradually enable one by one the automatic upload for the other image sources, from the smallest to the largest:

Automatic or manually assisted: Manually assisted first, until I'm confident enough to enable automatic upload

Edit type (e.g. Continuous, daily, one time run): daily first, then continuous

Maximum edit rate (e.g. edits per minute): 12 uploads per minute

Bot flag requested: (Y/N): Y

Programming language(s): Java

vip (talk) 22:27, 19 April 2020 (UTC)[reply]

Discussion

  • How will the bot differentiate files that are not available under a compatible license, but are mixed in with those that are? I'm considering ESA particularly...files such as [1], [2], [3], [4], [5] to name just a handful of recent additions to that site, and files on NASA sites that belong to other space agencies without free licenses. Also, I assume you'll skip any image featuring Public Domain Mark? How does the bot deal with non-obvious non-identical-hash duplicates, given that the space community on Commons already imports a lot of files from these sources, sometimes with cleanup or modification? Huntster (t @ c) 12:56, 22 April 2020 (UTC)[reply]
    • @Huntster: None of the ESA files you linked will be considered by the tool as eligible. You won't find them at https://tools.wmflabs.org/spacemedia/esa/missing . As ESA does not provide an API to search by licence yet, the tool performs the retrieval in two steps. First it performs a search with "by-sa" you can check it here. Then the license is actually checked in the result to avoid false positive, by checking the presence of "BY-SA" or "COPERNICUS SENTINEL" in the description, you can check it here. Concerning NASA, I don't exclude any file right now. Do you have an example of a file that should not be imported? Concerning the "Public Domain Mark" do you mean on Flickr? Until I saw your message my tool would have imported them with a wrong template, I just have changed it to import them using Template:Flickr-public domain mark, you can check it here. Concerning the non-identical-hash duplicates the tool does nothing in particular. T167947 needs to be resolved if we want to detect those cases, see User:Fæ/Imagehash for more details. vip (talk) 16:28, 22 April 2020 (UTC)[reply]
      • @Don-vip: Regarding NASA, they routinely host images from other agencies and commercial entities. I have no particular examples at the moment, but I would avoid uploading anything that doesn't mention "Credit: NASA/...". As for the Public Domain Mark, {{Flickr-public domain mark}} is not a valid license. Public Domain Marked images are not permitted on Commons. I can only wish you luck dealing with NASA images on Flickr...they are a morass of weird and often un-legal licensing (CC when they are legally PD, etc). Huntster (t @ c) 17:38, 22 April 2020 (UTC)[reply]
        • @Huntster: Thank you for the clarification. I have checked in my database, I currently list 34413 files from Flickr, only 731 of them are using the Public Domain Mark, I will simply skip them. I will also see how to detect non-NASA images. I naively believed they only hosted their own images... vip (talk) 18:07, 22 April 2020 (UTC)[reply]
          • Sounds good. I'm more than happy to help clarify any off-nominal situations you encounter. Interpreting space-related license weirdness can be a chore. Huntster (t @ c) 18:59, 22 April 2020 (UTC)[reply]
            • @Huntster: I ignored all Public Domain Mark pictures, it removes all pictures from the Space and Missile Systems Center. Do we agree that such pictures that are clear work of the U.S. Air Force (contains credit such as "(U.S. Air Force photo/Sarah Corrice)") can be imported? If yes, what should I do regarding the Flickr template on Commons? Is it enough to add Template:PD-USGov-Military-Air Force in that case to get the upload accepted by FlickrReview bot? Or should we simply forget Flickr and take it from DVIDS? Sounds simpler/safer/more accurate. vip (talk) 21:45, 22 April 2020 (UTC)[reply]
              • Both options are acceptable. So long as the credit line states "Military branch/Photographer" or even just the branch without photographer, it should be okay. You could import from Flickr with {{PD-USGov-Military-Air Force}} and ignore PDM, or simply import from DVIDS. I agree the latter is probably the simpler solution, and will always result in the cleanest file, since Flickr recompresses the image. Huntster (t @ c) 12:31, 23 April 2020 (UTC)[reply]
                • Mmm not easy. I checked the Thule Air base visit done in December 2017 by Air Force Space Command Commander, pictures taken by two photographs. For 1st photograph we have 38 files on Flickr, but only 24 files on DVIDS. For 2nd photograph we have 21 on Flickr but only 12 on DVIDS. Also the VIRIN identifiers have been "renumbered" on Flickr, so they are wrong there. And you're right, DVIDS files are better (larger resolution). I will investigate User:Fæ/Imagehash project to see if I can do a similar thing to choose the best file and avoid duplicate imports. vip (talk) 13:55, 25 April 2020 (UTC)[reply]
                  • @Huntster: Took me a while to find correct settings and compute hashes for tens of thousands of files, but here I am. I have pretty good results, you can take a look to the 33 AFSPC files ignored from Flickr because found in DVIDS, and the 21 Flickr files considered to be imported because they have not been found in DVIDS. For larger data sources, this is impressive. Out of the 20k NASA files harvested on Flickr, 5,6k are ignored because found in main NASA image library. For ESA, 750 files out of 2500 are ignored for the same reason. vip (talk) 23:28, 15 May 2020 (UTC)[reply]
                    • Don-vip, very impressive! Out of curiosity, are the hashes of files on Commons that have been uploaded over checkable to avoid situations where a file has been cropped to remove a border, or something similar? (not saying it must, just curious as to the technical possibilities.) Huntster (t @ c) 04:22, 16 May 2020 (UTC)[reply]

Are there any unresolved issues? --Krd 10:08, 27 May 2020 (UTC)[reply]

Approved. --Krd 06:50, 7 June 2020 (UTC)[reply]