User:Fæ/Wayback

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search


This is an image source link housekeeping task, semi-automated by a supporting pywikibot script, which relies on the free service of https://archive.org/web in a similar way to the use of the Wayback Machine to archive citation sources on Wikipedia. Adding archive links will help anyone wanting to verify the original image release details once the source goes off line, moves, or should the copyright release change. The links are (currently) limited to images uploaded by .

The housekeeping task is not fully automated as it must be customized and tested for each batch upload project, due to reliance on information box layout and any specific issues with source layout and licensing terms. For these reasons it would be better to generate archive links and add to image page texts just before Commons upload in future large batch upload projects. There is no agreed standard way of displaying or including a link to the Wayback Machine on image pages.

For each file in a given bucket category, web.archive.org is checked to see if the source link is archived. If not archived it will attempt to add the page, though this only works if the original source is still online. By default the most recent webarchive link is added using the {{Wayback}} template. Visiting the link will show if other snapshots are available as web.archive shows these as a navigation bar.

As a separate experiment, very large collections are "farmed out" to a headless Raspberry Pi once they appear stable. This should result in less home power consumption, and are less likely to be restarted on this always-on processor. The host name of "raspberrypi" is automatically included in the edit comment by Faebot, to distinguish between these tasks and others (example).

Collections targeted[edit]

  • Category:Photographs from trainpix.org, the most recent snapshot is given, which will show the licence chosen.
  • Category:Images from Unsplash, the earliest web.archive snapshot is chosen, however due to the way Unsplash uses javascript, the page shows as blank, though a source inspection will show all the archived metadata. Best viewing the source with Firefox, which will pretty-format the metadata. The licence is implicit based on web terms.
    • For example, if we were verifying File:80s Chic (Unsplash).jpg; https://web.archive.org/web/*/https://unsplash.com/photos/gvo3uGk8K_4 shows that the earliest archived version is dated 23 June 2017. Though only photographs which were published before 5 June 2017 can be considered verifiably CC0, if the Unsplash site went offline, the archive link will help demonstrate the publication date by examining the archived source html. The source metadata includes "asyncPropsPhotos" which among other details includes the timestamp "created_at": "2017-05-26T21:02:44-04:00" which neatly confirms a pre 5 June 2017 publication.
  • Category:Portable Antiquities Scheme, at 400,000 files this is a very large collection.
  • Category:Images from DoD uploaded by Fæ, large collection of 300,000 photographs, though limited to those with dvidshub source links and using the {{milim}} template.
  • Rather than categories, files editing with ia_search target domain names
    • pixabay.com 3,787 WBM links added
    • fortepan.hu
    • collections.lacma.org/node 24,399 WBM links
    • openbenches.org/bench

Example cases[edit]

Technical[edit]

WBM JSON[edit]

The archive is queried using a call like https://archive.org/wayback/available?url=www.trainpix.org/photo/92963. This only ever returns a single snapshot, by default the most recent. The returned data is in JSON format:

{
  "url": "www.trainpix.org/photo/92963", 
  "archived_snapshots": {
    "closest": {
      "status": "200", 
      "available": true, 
      "url": "http://web.archive.org/web/20190104191240/https://www.trainpix.org/photo/92963", 
      "timestamp": "20190104191240"
    }
  }
}

Alternate states[edit]

The key reason that the Wayback Machine (WBM) links cannot be fully automated across different batch upload projects, is that link placement and format will vary depending on how and which sources links are included and there are failure modes that may be highly relevant to respond to for specific batch uploads.

In order to discover failures, the source link header is examined for failure codes and simple relevant error traps then respond depending on whether the response is 404, 401 or anything else.* These states are discovered ad hoc during initial runs as they are unlikely to be known existing issues. This does eat up technical volunteer time, so is only likely to be implemented where the size of batch uploads is large enough to warrant it. As a benchmark, Unsplash is over 30,000 photographs and PAS/finds.org.uk is over 400,000 images.
* An easy way of simulating this is to use https://www.webconfs.com/http-header-check.php which will show the returned header from a given URL.

Examples failure responses:

  • finds.org.uk, the catalogue link may return a 401 error. This shows that the original photograph has been taken out of public view. There may be several possible reasons for this happening, but as a simple non judgemental response the images are added to a maintenance category.
  • unsplash.com, the gallery source link unfortunately does not display at the WBM due to the way javascript behaves on the original site. To make the archive links easier to understand, a second WBM link is provided using the more complex link to the raw photo source which will display the full size photograph.