Commons:Bots/Requests/DPLA bot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

DPLA bot (talk · contribs)

Operator: Dominic (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: To upload media files from Digital Public Library of America ([1])

Automatic or manually assisted: Automatic, unsupervised

Edit type (e.g. Continuous, daily, one time run): Runs in batches

Maximum edit rate (e.g. edits per minute): Limited only by file sizes

Bot flag requested: (Y/N): Y

Programming language(s): Python

Discussion

I am planning to undertake file uploads for DPLA and its contributing institutions, which encompasses (potentially) millions of compatible files. This bot will be designed to be substantially similar to User:US National Archives bot, through which I have already made hundreds of thousands of uploads to Wikimedia Commons. It is built in Python, using Pywikibot, taking its data from the DPLA bot and also upstream contributor APIs. It relies on RightsStatements.org or CC license URIs to ensure only files marked by their originating institutions with compatible licenses are uploaded. Dominic (talk) 14:20, 18 December 2019 (UTC)[reply]

Is it possible to combine individual pages into PDF/DjVu? Is it possible to add category for mass series? Please use language tags for title and description fields. --EugeneZelenko (talk) 15:33, 19 December 2019 (UTC)[reply]
@EugeneZelenko: Responding to each of your three points below.
  • I'm not sure about the PDF/DjVu question. Right now, I am just thinking about uploading the files as provided by institutions, in the same form they make them available themselves. Some are already PDF, just not any in my test run, and some are not—but this is essentially their decision, and Commons users can always combine them in their end as needed. Maybe this could be a future part of the workflow if it's highly desired by the community, but right now I hope it's okay that my code is just concerned with the file transfer, and not actually altering/creating multimedia files from them.
  • Thanks, I have added in categories for both DPLA and the partners into the {{DPLA}} template I am using. There may be a possibility for more categories based on subjects, locations, creators, etc., but right now I haven't matched the entities yet on Wikidata, which would be the most efficient way to do that. But that is a goal for the future.
  • These fields are typically English, as they are all US-based institutions. However, this is not indicated explicitly in the source data, and there is certainly a chance of some non-English materials, so I am hesitant to hardcode that. I am open to suggestions for how to deal with this issue. Dominic (talk) 15:23, 20 December 2019 (UTC)[reply]
@EugeneZelenko: Any other thoughts? This has been sitting for a month without comment. Dominic (talk) 15:20, 28 January 2020 (UTC)[reply]
I still think that plain text language should be specified explicitly. --EugeneZelenko (talk) 15:57, 28 January 2020 (UTC)[reply]
@EugeneZelenko: Since the original dataset does not specify language, are you suggesting I should mark them all as English and accept some small error rate? I assumed it was preferable to leave it unspecified than to introduce errors, so I want to be sure if that is what you are suggesting. Dominic (talk) 16:53, 28 January 2020 (UTC)[reply]
Data looks very regular and repeating, so you could review it before batch upload to ensure correct language tag. --EugeneZelenko (talk) 21:20, 28 January 2020 (UTC)[reply]
This would probably not scale well, as the dataset potentially contains millions of records. I am also hesitant to make a personal determination about a data point that the library catalogers that created the original description did not make for themselves. Dominic (talk) 21:43, 28 January 2020 (UTC)[reply]
May be there are Python libraries that try to detect text language? --EugeneZelenko (talk) 15:21, 29 January 2020 (UTC)[reply]
Just to back up for a minute, is this a Commons policy about requiring language codes for titles? I am not sure why this is being raised at the bot request. Because we would like to simply upload the data as described by the source repositories, who are the authorities on their own collections, rather than spend our time building a language-detection algorithm to augment metadata that we are merely aggregating, and not claiming to be the expert on. Dominic (talk) 17:56, 29 January 2020 (UTC)[reply]
Please keep in mind that Commons is multi-language project. Since bot will load many files, it's very reasonable to create best possible description and that includes language tags. Such tags will help creation of Structured Data descriptions. --EugeneZelenko (talk) 15:44, 30 January 2020 (UTC)[reply]
Obvious support - Dominic, the operator of this bot, is a veteran of uploading Commons, Wikidata and other content for many years now in his previous role as Wikimedian in Residence for the U.S. National Archives. He has done so professionally, responsibly and without controversy for that period and even more, he has given talks and helped train generations of Wikimedians in Residence to do similar functions. Therefore, please speedily approve this - it should be a mere formality given that he pioneered the entire space of WiR and responsible engagement. What he describes for DPLAbot is completely consistent with his previous work and technical expertise. -- Fuzheado (talk) 18:26, 29 January 2020 (UTC)[reply]

I'm closing this as approved. (The bot appears to be already running at full scale anyway.) --Krd 15:11, 3 February 2020 (UTC)[reply]