User:Fæ/Project list/ZooSphere

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
Acanthopleura ravus

Files uploaded under this project can be counted using this search.

For more information on the original project to share biological specimen photographs, see http://www.zoosphere.net/project.

There are a lot of challenges for this upload:

  1. The sequences are uploaded from a zip file of images, many of these download requests result in 404 errors. Some sequences have no downloadable zip, it is unclear why, considering all images are published as CC0. The sequenced images appear under a "/cropped" folder in the zip file, however there are exceptions where the layout varies.
  2. WebScraping is especially difficult due to the use of dynamic HTML. Working around this means using Selenium and an automated Chrome browser instance.
  3. There is no alternative API or database to draw metadata from.
  4. Commons is not well suited to display large image sequences of this type. The gallery tag is used to interrelate files, but this gives very large image pages.
  5. The X x Y image numbers may be wrong, e.g. 79/Canthon/auricollis, which will cock up that sequence upload.

Technical[edit]

Gymnopleurus sericeifrons

Files are named as:

<genus> <species> (Zoosphere <sequence number> <image number>).jpg

No date is set in the metadata, so a date is abstracted from the first photograph in the sequence file timestamp as an equivalent to a publication date, however this may be unpredictable. The species metadata may include an "author" field which is the attributed historical taxonomist, this will be shown in addition to an "author" being a named photographer or responsible institution.

The release license is checked to match CC0. No other license is accepted.

Photographs are numbered in their original sequence. This is a series of photographs moving horizontally, then incrementally in rows moving vertically. The large gallery includes a black dot where the current image appears in the sequence.

Each unique <genus> <species> has a category created for it under the main 'bucket' ZooSphere category, there may be more than one sequence of images for each <genus> <species>.

The metadata X x Y count of photographs may not match the actual total. After unzipping the discovered real files are used to generate the image page galleries.

Programmatically, this upload relies on Pywikibot running under Python using BeautifulSoup to do the webscraping and Selenium to control and interrogate a test instance of Chrome (driver available from Google), without which BeautifulSoup would only see the starting html code rather than the fully rendered dhtml. There is also an external call (Python's subprocess module) to control the unzipping of the images and Python's os.path.getmtime is used to work out a file date. I'm not sharing this code, it's hacky and platform specific, but there are enough breadcrumbs for someone playing with Python to follow. Two things that did not work: trying to get Selenium to run Firefox, and using mechanize as the browser driver. Satisfyingly the unzipping happens on a very old 2GB USB stick, cutting down a bit on wear and tear on my main disk or RAM.

Progress[edit]

  • 08 Jan 2018. Upload started 3pm. All files can be found under Category:Images from ZooSphere.
  • 10 Jan 2018. Amended to cater for alternative categorizations in the zip file. Added the photographer's name.
  • 11 Jan 2018. Run completed.