User:Faebot/WMUK report

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Wikimedia UK report for Commons uploads and automation using the WMUK Macmini

[edit]

A live report of uploads can be found using this catscan2 report. Some reference tables of ongoing uploads per month are available at User:Faebot/reports/catscan.

June 2014

[edit]
  • Upgrade all jpeg images in Photochrom prints to same resolution as tiff (around 3x larger) by local download of the tiff and recompiling as jpg. This will apply to around 6,000 images and will be slow as it will rely on home broadband for both tif download and jpg upload.
  • Update Fellows collection with an additional few thousand images and create Fellows animations after some Village Pump discussion on maximum working GIF sizes (23MP total).
  • Finalize Photochrom prints collection, including sub-categorization by country. Announcement on Commons Village Pump. Parallel discussion on the German Wikipedia. An update to GWT changes functionality and some significant post-upload housekeeping was systematically run.
  • Ongoing discussions with the Wellcome Digital Library and Cancer Research UK. It is likely that Faebot will be able to support these this summer.
  • The next tranche of Avionics uploads will rely on paying of forum membership. As my own payment for Wikimedia UK's membership was rejected by the Chief Executive, this is on hold, as proposals should be from a member of the charity.

May 2014

[edit]
  • Photochrom prints collection: 20R - commenced, xml generation on WMUK kit, though the GWToolset runs the file upload.
  • Images from Fellows (auctioneers): 0R - announcement pending by Andy.
  • Re-visit of the Library of Congress British Cartoon Prints collection in order to general a set of files that were previously skipped, primarily as they were over 100MB. Catscan report
  • NYPL maps upload exceeds 10,000 files. Notice posted at the Village pump to attract volunteers to start using them. -- (talk) 10:37, 9 May 2014 (UTC)

April 2014

[edit]
Central Tibet manuscript illustration, c.1500. LACMA collection.
Update to my original upload of 20,000 photographs from the museum in 2013. Based on a request from PKM, a user well known for her GLAM work.
Experimenting with the Python "flickrapi" module made it possible to upload an additional 3,083 photographs from Flickr that were All Rights Reserved, but these images are covered by a specific email release. Though this is running as a direct upload, similar code may be useful on other projects to generate a dataset for use with the GWToolset.
Christmas Menu, Royal Army Medical Corps 1918. From the National Library of Ireland
A few hundred photographs had previously been loaded from this Flickrstream of high quality early photographs from Ireland. This was an update to add the latest 220 archive images, now the NLI appears to have completed this Flickr Commons project.
The release of 20,000 high quality map images by the New York Public Library was of interest by volunteers on email lists and the Village Pump, along with requests for a bot to help with consistent uploads. This is more complex than it first appears, as direct links to the high resolution tiffs cannot be automatically "scraped" from the public website. I signed up for access to the NYPL API, and use this to deduce the url to download from. This is limited to 10,000 transactions. See Commons:Batch uploading/NYPL Maps. Many of the images are greater than 100MB, these are currently being skipped. Solutions of either creating a Python chunk uploader module (so up to 1GB might be uploaded) or generating an XML dump to use the GLAMwiki toolset to perform the upload are being considered (the latter would be much faster). Note, the average file size is large at ~50MB, due to bandwidth constraints and the fact that skipped files still result in 100MB being downloaded (as the API does not provide file sizes), the upload rate is of the order of just 5 per hour. The uploads have been popular, with Commons volunteers helping to crop and rotate the images.
The GWToolset is now being used to complete this upload (up to 50 times faster), including the files previously skipped due to being over 100MB in file size. A category for images over 50,000,000 pixel resolution in this upload has been created as Commons does not create thumbnails for images over this extremely large resolution. There are likely to be a few hundred at this pixel size from the 20,000.
This upload was paused when a few hundred images over 50 megapixels started to degrade WMF server performance in rendering tiff thumbnails. Operations are looking at ways of managing the operational load that thumbnail creation causes. The upload was cautiously restarted using a minimal number of processing threads.
After a request on Commons:Bots/Work_requests I have created a "live" SVG format checker to mark vector files with embedded raster graphics with {{BadSVG}}. This identifies around 10 to 40 files per day by sampling the Commons upload log every hour, checking the source code of all SVG files over 5KB in size. As this seems non-controversial, it is running on the WMUK supported kit, though at some future point it may be re-written and migrate to WMFlabs (note an SVGbot used to run but lapsed back in 2010). The templated files are not marked as Wikimedia UK supported, doing so might be seen as controversial as Wikimedia UK kit did not upload the files. Note, this is currently running inconsistently due to an unsolved memory leak problem.
  • A request from Robin for English Wikimedia articles to be automatically created for SSSIs in Wales has resulted in Wikipedia:Bots/Requests_for_approval#Faebot. This may not get approved, how we stage these potential 900 articles in a non-controversial way is under discussion.
  • Due to other priorities on time, the Geograph project has been delayed. This will take some experimentation and research to decide how to approach the next level of place identification and whether to mass update from the Geograph project.

March 2014

[edit]
  • Batch uploads of Collections in the Library of Congress started. Some of the tiff files are very high resolution scans of width greater than 5,000 pixels and filesize larger than 100MB. These include some collections of direct relevance to UK history. Most relevant to the UK are:
    • British Cartoon Prints Collection: 1017
      1780—1830 historic political cartoons
    • World War I posters in the Library of Congress: 1871
      As only a minority are British, these are not currently marked as sponsored by the chapter. Some posters, such as the more extreme propaganda posters from Germany, may have controversial content, however feedback from WikiProject MILHIST has been entirely supportive.

February 2014

[edit]
  • A backlog of more than 10,000 photographs as part of the Aircraft forum images were uploaded. This included around 5,000 from a Russian aircraft forum. These and later photographs will be identified as supported by Wikimedia UK. (Source code)
  • Images_from_MoD_uploaded_by_Fæ: 6033
    • Currently the Ministry of Defence releases around 35 to 70 high quality official photographs each month. Faebot should check the MoD site for new files daily, providing an estimated 600 high quality educational photographs each year. Fæ is in contact with the MoD Images library who will update him on changes in their API (which provides the metadata).

Initial

[edit]

In January 2014 the WMUK charity supplied a macmini to act as a server for selected Faebot projects this includes: