User:Fæ/Project list/Biodiversity Heritage Library

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

DRAFT

For the general Biodiversity Heritage Library project page, refer to Commons:Biodiversity Heritage Library.

Scope[edit]

This project page documents User:Fæ's work to find automated methods to identify images from the Biodiversity Heritage Library collections (BHL) that are likely to be in-scope for Wikimedia Commons, and to handle the issues that arise with uploads of 100,000 or more image files.

Project scope and approach[edit]

This batch upload project was built on the experience of running the Internet Archive book plates batch upload project, resulting in 500,000 uploaded book plates. Uploads are limited to "free" images, as the BHL archives include restricted images such as copyrighted (presumed as all rights reserved) and non-commercial, this means that filters must be applied so that only images that are verifiable as public domain, CC-BY-SA or other licenses in compliance with COM:L are uploaded. It is worth highlighting that BHL includes images with restricted licenses that on further assessment can be shown to be public domain by age.

The design approach is along the lines of small Agile software development, in that "big bang" releases are avoided and there is a cycle of <write script> - <run upload> - <check results> - <improve script>. A key disadvantage in this approach is that time must be spent writing "housekeeping" scripts to improve past uploads, and the work of other volunteers who may be categorizing or making corrections to image text pages must be respected.

The BHL archives contain other objects than books, such as articles. On the presumption that books are the most likely to be out of copyright, the focus of this project is books.

The objective is to find pages in out of copyright books with illustrations and books plates likely to be of interest to Wikimedia projects (such as Wikipedia) and of public education interest. By presenting these images on Commons in a browsable format, these are of more utility for illustrative reuse than presenting the entire work in a single file, or presenting all the pages of the book as images which will be primarily raw text. All uploads link back to the source book which can be read in its entirety at the BHL site. For projects interested in the full texts (such as Wikisource), these can be taken from BHL as djvu files which are more suitable for transcriptions.

Secrets of BHL API2[edit]

Finding the best quality image[edit]

Determining copyright[edit]

Uploading a book[edit]

Uploading a Collection[edit]

Uploading by subject keyword[edit]

Secrets of Commons API[edit]

Breaking bad[edit]

How to sniff[edit]

A year is a long time with Wikimedia[edit]

Housekeeping[edit]

Hitting a 5 sigma target[edit]

Blank pages[edit]

Duplicates of Flickr uploads[edit]

Duplicates of Internet Archive uploads[edit]