User:Fæ/LOC

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
The Wizard of Oz, 1903 (Library of Congress)

Introduction[edit]

This page explains some of the technology and decisions made by for batch uploading from the Library of Congress (LoC) image collections. Other upload projects are listed at User:Fæ/Project_list.

Search tips[edit]

The collections are in individual categories, listed further down this page, but some categories may have thousands of files to surf through. Wikimedia Commons has lots of options available for smart searching, here are some practical example searches with links where following the format may help you to narrow down images of interest:

  • Search Matson Collection for West Bank
    incategory:"Matson Collection in the Library of Congress" insource:"West Bank"
    search
  • Search all LOC uploads for TIFFs with Bethlehem in title but not in HABS as these are taken in the US
    insource:"www.loc.gov/pictures" intitle:Bethlehem -HABS filemime:"image/tiff"
    search
  • Search all LOC uploads for TIFFs with "boy" on the page with resolution over 36 megapixels and not in the HABS uploads
    insource:"www.loc.gov/pictures" insource:"boy" -intitle:HABS filemime:"image/tiff" fileres:>6000
    search ('fileres' uses the square root of height times width)
  • Search all LOC uploads for TIFFs of posters with dogs or cats with an image width greater than 3,000 pixels, and not in HALS or HABS, using a regular expression search to avoid matches to "category" or URLs with "cat_" in them
    insource:"www.loc.gov/pictures" poster insource:/ (dog|cat)[s\ \.,]/ -intitle:HABS -intitle:HALS filemime:"image/tiff" filewidth:>3000
    search

Technology and basic options[edit]

Uploads in 2014 were run using the Glamwiki toolset, refer to Commons:Batch uploading/Library of Congress.

Uploads more recently are run using a Pywikibot script running in Python. Since 2014 Pywikibot has been released in a new version (pywikibot-core), Python itself has been released in new versions and the Library of Congress site has changed in various ways. All of these changes mean that it is not possible to release an upload tool or script that can be reliably reused a year or more after development, any upload process needs maintenance to work well, especially as Wikimedia Commons has gradual changes in templates, working practices and may have significant format changes due to further integration with Wikidata.

LoC image pages may include several links to images in jpeg and TIFF these look like: JPEG (54kb) | JPEG (155kb) | TIFF (179.8mb). The upload script finds all images and selects the largest jpeg and largest TIFF based on the filesize given on the record, not by testing the file itself.

The default behaviour is to upload a jpeg and a TIFF version. A discussion on the Village Pump established this was the best way forward to cater for the poor thumbnail rendering of TIFFs, consequently the jpeg version is better for illustration of Wikipedia articles or galleries on Commons. Though the poor rendering of TIFFs has been discussed for several years, a solution has yet to be deployed by WMF development.

Based on previous correspondence with LoC staff, it is likely that automated access to LoC servers is limited to 15 accesses/enquiries per minute. Running faster than this may trigger the IP to be blocked or throttled, in 2014 a faster run was blocked for an hour after exceeding these advised limits. It is not known whether throttle limits apply to certain address types and not others.

Code detail[edit]

Example operator's terminal view showing different messages during batch upload to aid monitoring.

Pywikibot's standard site.upload is used to upload files. This uses a default chunked uploading. Though in theory ignore_warnings can be set as an array to ignore certain types of failure, this has never worked (for me) and remains a code bug. As a work around to enable both jpegs and TIFFs to use the same root file name, the upload module is first run with ignore_warnings = False, then if a returned warning is "exists-normalized", the module automatically runs again with ignore_warnings = True and assumes there are no other errors.

A duplicate check for digitally identical copies is part of the code, this relies on the ability to create a SHA1 value based on a URL to the image file hosted by the LoC to compare with the standard Commons API SHA1 search. No attempt is made to check for existing matches to LCCN identity or similar, on the basis that there may well be existing copies on Wikimedia Commons which are inferior resolution. In 2017 the LoC appears to have blocked bot access, resulting in Error 403 (Forbidden) to a python script request to read image files to generate a SHA1 hash. A work around is to simulate a browser header on access, it's cheating a little, but uploading to Commons is part of why the LoC release the images in the first place, however the outcome is identical to running a full local download, so to limit home broadband hogging, SHA1 duplicate checks are only run for files less than 20 MB in size. Start and end times for SHA1 generation are shown in-terminal for small enough TIFFs, to help keep an eye on whether size limits are being set in a practical way. SHA1s are generated and used to check for duplicate jpegs, just not flagged as worth monitoring.

API flagged errors such as the file having been previously deleted, the same file existing but having been overwritten (such as by being cropped) or the TIFF failing the WMF server fileinfo check, are respected and should cause the upload to be skipped.

Code is run from the command line in a local terminal. A parameter in the call is a URL to the grid search for a LoC image project, for example http://www.loc.gov/pictures/search/?co=wwipos&st=grid shows the WW1 poster collection. Options include the ability to skip to an image in the batch upload sequence and to upload a single image by LCCN. Based on the setting for "co=" in the URL, a bucket category for the collection is added to every image; i.e. the code has to be customized for each planned upload batch.

BeautifulSoup is used to help process the LoC webpage. Prior to 2017, metadata was also pulled from a lccn.loc.gov mods page, but these have become unavailable for unknown reasons.

Galleries[edit]

From 2018 onwards, cross-linking of jpeg and TIFF versions are done using the <gallery> tag. Where previous versions or crops may exist using the same digital identities, these are searched for using the normal Commons search engine and displayed in the gallery. For example this colour poster includes an earlier black and white scan in its gallery.

If the earlier uploads use identical file names, then they will be overwritten if the largest LOC file is at least 10% larger in file size than the one currently on Commons.

Metadata[edit]

Author is hard to extract from the LoC record or MODS record, with this either being unidentified or mixed in with a description rather than a separate field. Often the nearest to author is given as publisher or engraver. For these reasons a default of the collection name is placed in the Author field.

Fortunately swapping to JSON records makes contributor_names available. This virtually solves the Author problem.

IP blocks for lccn.loc.gov[edit]

On the 1 February onwards LCCN queries became "forbidden", apparently based on IP blocking. Presumably the number of queries for LCCN based uploading had triggered the automated server security, or the old systems are being phased out. Uploads of NCLC images continued to run, as these rely on MARC metadata rather than LCCN. Rather than swapping IPs to work around this, a code rewrite seems timely.

There are alternatives, such as using the LC labs data services, however since first running batch uploads the main site has made available a JSON service which seems the most relevant solution with the presumption that this is not subject to low throttle limits.

As of 3 February, rather than using the lccn.loc.gov sites, the more recent facility of pulling JSON versions of the records is being used. The URLs look like https://www.loc.gov/item/2001699170/?fo=json. These are not identical to the old records, with some new metadata options and others missing or possibly renamed such as 'Abstract'.

Known errors[edit]

  • Automated uploading defaults to picking the largest jpeg and largest TIFF versions by filesize. On rare occasions this gives a poor choice as different scans may include different crops with frames, or massive borders. For example http://www.loc.gov/pictures/item/00651623/ includes both an album page and a cropped image with just the photograph, and picking the largest TIFF gives the whole album page rather than the better quality crop, see LCCN00651623. Manual overwrite, or uploading as alternates when found, may be the only solution for these exceptional cases.
  • Metadata structure for "resources" is inconsistent. In the van collection this was especially apparent. Rather than following a { files: [] } format, there is a flat dictionary with URLs against some of the keys. The browser version of the JSON data does not match the version produced with Python calls. In these cases "resources" exists but contains no data, there is no known explanation for why this happens. A work around is to fall back to scraping the website to check for download links, ignoring that part of the JSON record. Example poster, it is possible that all these cases are caused by files with copyright restrictions.
  • TIFFs without matching jpegs get uploaded, these only happen when the jpeg version has already been uploaded under a different filename. Where jpegs exist with no matching TIFF, they are skipped. These are extremely rare. Example The Last supper.
  • Some apparently valid TIFFs fail tiffinfo verification when uploaded to the WMF server. This has been awaiting WMF dev investigation since January 2016. Refer to phab:T124662.
  • Outages - for periods of minutes up to days, www.loc.gov/item/ may be offline. Sometimes the collections are unavailable with read failure errors (example), sometimes they time out to a "technical difficulties" web page or "planned outage" statement. Long term upload scripts have to detect and wait for the server to return, or ask the operator to use their human brain to decide what to do.
  • Bad pnp - likely to be an extremely rare error, this is where the LOC record gives a pnp based link which is not a pnp. Only spotted in the wild once, with a pnp of "item". The code now double checks the image link to ensure the item pnp has digits in it.

Copyright[edit]

Year ranges are checked to calculate a maximum date value, which is presumed a terminus ante quem for copyright.

The default copyright check is basic, the licences used for uploads created after 22 January 2018 are:

max date > 1923: Skip file unless a rule for the collection applies
unknown date: {{PD-old-70-1923}}
max date >= 1890: {{PD-old-70-1923}}
max date <  1890: {{PD-old-100-1923}}

Collections with special templates or with different copyright provisions, such as US Government works which are public domain regardless of date of publication, may have custom copyright settings. These are added at the same point as when a collection category is identified based on collection short name.

The wrapper of {{PD-Art}} or similar is not applied as the objects may be both art or photographs.

Housekeeping[edit]

Upscale jpegs[edit]

Where TIFF and jpegs have identical base filenames, and the jpeg is significantly smaller in resolution to the TIFF, the jpeg is upscaled to the TIFF size. The jpeg is created using the Wikimedia Commons transcoded version of the TIFF at full size, which has to be uploaded from a local version as the API does not allow url uploads from its own server.

Adding and updating other version galleries[edit]

The batch uploads described on this project page, use LCCN as the primary identifier, with the only search for other versions at time of upload using the LCCN. A housekeeping task is checking for the older digital ID or pnp code, where in some cases versions have been uploaded to Wikimedia Commons before a LCCN was allocated by the library.

Where reasonable to do so, this gallery housekeeping task adds galleries to matching files which were not uploaded in the batch upload project. For more information refer to User talk:Fæ#locpnp.

Example result from housekeeping, showing the gallery added to this TIFF image page, composed of an imagehashed identical jpeg version, and an autodetected crop from the same image.

More recent example from this TIFF, relying on an OpenCV "Flann" comparison of features to confirm cropped versions which is able to work when the crop has been rotated.

In addition to the above batch uploaded categories, the generic category is being checked Images from the Library of Congress.

Refinements

A second run uses an image hash check to verify that versions are visually close, normally identical apart from resolution, light levels, or minor border crop. This confirms that any images added to the gallery are within "20 points" of the original according to the perceptual hash and difference hashes previously tested for DoD collections at User:Fæ/Imagehash. This will almost completely eliminate weak matches, but may miss crops that were previously shown as other versions. The hashes are only generated for returns for the search for LOC image identity, it is not a generic image search procedure.

A third stage checks the remaining image ID list after image hash matches for potential cropped versions. This uses a "Flann" or SIFT analysis from the OpenCV library. This will match cropped or rotated versions, including small crops such as individual portraits taken from group photographs. With any automated image transform, there is potential for a small proportion of false matches, however in conjunction with the fact that the LOC image identifier must be in the sample checked, this may be reduced to near zero probability. This image analysis is deliberately the final stage of filtering to minimize the numbers, as it is "expensive" in computer process time.

Beneficial side-effects of the rerun/refresh are:

  1. removal of links to missing images, example
  2. removing accidental duplicates within the gallery due to initial character case and html encoding, example example
  3. resolving renames and removing duplicates resulting from renames, probably created due to coincidental server lag example
  4. refresh dimensions for linked overwritten images
  5. upscale jpeg versions from matching TIFF, where jpeg name is identical to tif name, proportions of image match within 1%, and jpeg is at least 70% smaller in width and height
  6. add any cropped versions example

Behind the scenes, the image matching process finds distinctive features for both images and aligns near matches. The two screenshots below showing the matched features, were created using Python's matplotlib to display the OpenCV formatted images. These matches to File:Ty-Cobb-1913-NPC.jpeg were automatically processed with the LOC gallery housekeeping task. Interestingly, there are two versions of the scan dating from 2007, though uploads are ten years apart. The master archive negative appears to be rapidly degrading, with significant cracks appearing in Ty Cobb's face. It is unclear if the uploader in 2008 manually removed these cracks before uploading the full size jpeg version.

Exclusions

Where two <gallery> tags are found on the image page, the matching task is skipped on the presumption that the customized page has special features.

The image ID is searched for within a {{LOC-image}} template, or if that is missing then the first "loc.pnp" formatted link with digits in the number is presumed to be a match. Where the LOC-image template is used in an unexpected way, this will cause the search to give up on that page. It is possible that "loc.pnp" references to other images are in the description, this may cause false matches, however it's a highly unlikely scenario.

Rename dummy entries[edit]

Some scans have a temporary statement about being a "Digital display record". An example is LCCN2014637246. It is unknown how rare these are, hopefully very rare. The start of this search lists those needing renaming and descriptions.

A housekeeping task finds pages with the dummy title and if the page has not already been moved opens the marcxml page under the given lccn.loc.gov source. Under tag 245 the MARC record will have the original title which appears to have become lost, or not retrospectively revised, in the original record. Several hundred files have been renamed this way. Some exceptions where the title is too long will remain needing a manual selection of new name.

Collections[edit]

Batch uploads before 2018 included:

  • Frank and Frances Carpenter Collection
  • World's Transportation Commission
  • African American Photographs Assembled for 1900 Paris Exposition
  • American 18th and 19th century cartoons
  • Sergey Prodkudin-Gorsky

Popular Graphic Arts (pga)[edit]

28,168 R Popular Graphic Arts

Based on a Village Pump request in Jan 2018, the 'pga' collection was added to the upload script.

Due to uploads moving to chunked uploads, the previous issues with uploading 100MB+ TIFFs no longer exist, and the relevant log and test options were stubbed out.

Jpeg versions as hosted by the LoC are uploaded directly, these are normally of a much lower resolution than the available TIFF. This might be suitable for housekeeping to 'upgrade' the jpeg size, if there is demand.

From this batch, the upload comment has been improved to follow the format:

Library of Congress <date> LCCN <LCCN> <image format> #<batch upload sequence number>

This tweet announced the start of the upload and was retweeted by the @wikicommons twitter stream. ✓ Done 14 February 2018.

Civil War Glass Negatives (cwp)[edit]

9,977 R Civil War Glass Negatives

LoC: http://www.loc.gov/pictures/search/?co=cwp&st=grid

A few hundred existing copies of the glass negatives in this collection exist on Commons, though it's unclear in many cases if the original source was the LoC. The estimate is between 600 to 900 images may already be uploaded, probably in jpeg formats in various resolutions. The LoC collection shows a potential of 8,710 photographs available.

Upload run started 22 January 2018. Copyright release set at {{PD-old-100-1923}}. ✓ Done

National Child Labor Committee Collection (nclc)[edit]

6,385 R National Child Labor Committee Collection

This is an example of an upload where LCCNs do not exist for the collection. Metadata mining had to swap to using the MARC record.

Filenames take the format:

<title> LOC <reproduction number>.(jpg|tif)

Upload run started 23 January 2018. Photographs go up to 1924, so those after 1922 are {{PD-USGov}}.

31 January 2018. Due to a flaw in how images were selected, it was not always the largest being uploaded. New array sorting and size checks are re-uploading where needed. Examples: LOC cph.3a27326 and LCCN00650089.

✓ Done

Carl Van Vechten (van)[edit]

Category:Photographs by Carl Van Vechten

The LOC collection has 1,395 photographs, while the collection on Commons before this upload was ~500. The total uploaded in this batch can be discovered using this search. The upload is likely to create many duplicates, however many of these will be due to earlier scans or the use of non-LOC sources which may have changed the EXIF data. Photographs use the custom license {{PD-Van Vechten}}.

1 February 2018, Upload run started. ✓ Done at least 1,353 photographs added.

Yanker Poster Collection (yan)[edit]

Category:Yanker Poster Collection

The listed collection has 1,322 images listed, though many may be considered in copyright and all those from 1978 onwards are skipped. No posters appear to have been uploaded to Commons previous to this batch upload, possibly due to the complexity of copyright.

Copyright has been justified by the Library of Congress as public domain for most cases due to publication before or during 1977 without a copyright notice. These are uploaded with a license of {{PD-US-no notice}} and on the normal PD-1923 license for any before 1923.

1 February 2018, upload run started. ✓ Done ~160 distinct posters uploaded, a small proportion due to dates and license filters.

Performing Arts Posters (var)[edit]

Category:Performing Arts Posters

Copyright is left as the default filter, so any published after 1923 will be skipped. ✓ Done ~2,043 posters uploaded.

Artist Posters (pos)[edit]

1,487 R Library of Congress Artist Posters collection

This is a large collection which includes posters in other collections. However only 2,700 are available online and many of these will be dated after 1923.

From this upload an additional field of "partof" is added to the information box. If posters are in sub-collections this will be shown, and the images will be relatively easy to re-categorize on Commons using a source search. For example LCCN2001695176 is given a link to LOC's "posters: world war i posters" collection, which can be used to move it to World War I posters in the Library of Congress.

Default copyright filter applies, so only up to 1923 publications will be uploaded. ✓ Done

PH Filing Series Photographs (ph)[edit]

Category:PH Filing Series Photographs

Though over 3,000 images are public, filtering the collection to be before 1923 reduces the numbers significantly. Of all the LOC collections, this group may have debatable quality. Age and photographer names should make all the photographs of some educational value and the LOC state they are selected for "special aesthetic, technical, or historic importance".

Uploads started 7 February 2018. ✓ Done ~480 unique photographs added.

World War I posters (wwipos)[edit]

Category: World War I posters in the Library of Congress

This is a refresh run, the last run being in 2014.

Extra rules were required as there may be doubt on PD status in some countries depending on author death dates. Though the LOC is correct in hosting these as PD in the USA, many war time materials were seized and under US (or UK) law became government property, including any intellectual property these represent. However, in the country of publication (Germany or allies) there never was a revocation of the rights of the artist and Wikimedia Commons respects the law of the country of first publication as well as US law.

A general filter to test whether partof includes wwi posters and then does a check for location. If location is empty, the image is skipped, this turns out to be the case for many items in the collection. Where location is not given, but subject includes american, location is forced to be "united states". Allowed locations include:

  • united states
    • new york
    • new jersey
    • minneapolis
    • lancaster county
    • pennsylvania
  • south america
  • australia
  • canada
  • great britain
    • england
  • italy
  • india
  • france
    • ecorse
    • brittany
    • calvados
    • aisne
    • alsace
  • africa
    • algeria
  • serbia

Uploads started 13 February 2018. ✓ Done

Gladstone Collection of African American Photographs (gld)[edit]

320 R Gladstone Collection of African American Photographs

This is a relatively small collection of ~350 photographs with some cartoons relating to slavery. Most are civil war period, all are of African Americans. ✓ Done

National Photo Company Collection (npco)[edit]

69,777 R National Photo Company Collection

This is a large collection of photographs dating between 1909 to 1932 and over 69,000 are returned by the search view. Photographs have no known copyright restriction, so can be uploaded as {{PD-old-70-1923}} for pre-1923 photographs and {{PD-US-no notice}} for on or after 1923. Where no date is discovered, the 'no notice' option is taken as a cautionary default.

Some of this collection have no links to images in the JSON records, a special fallback had to be written to scrape the displayed webpage records and use MARC index. It is unclear why the inconsistency exists, possibly because they were mistakenly treated as not visible outside of LOC. Example Roosevelt at Union Station LOC npcc.00046.jpg ✓ Done

Matson Collection (matpc)[edit]

25,953 R Matson Collection in the Library of Congress

Primarily photographs of Palestine up until 1946. A special template already exists {{PD-Matson}} which is added to all images. Uploaded as {{PD-old-70-1923}} for pre-1923 photographs and {{PD-US-no notice}} for on or after 1923. ✓ Done

Liljenquist Family Collection of Civil War Photographs (lilj)[edit]

1,923 R Liljenquist Family Collection of Civil War Photographs

Many photographs from this collection have been previous uploaded to Commons under various categories. The collection is being added to, so updates are needed in the long term.

All photographs are dated to the American Civil War, so are all public domain. ✓ Done

Harris & Ewing Collection (hec)[edit]

66,052 R Harris & Ewing Collection

2,443 photographs had been uploaded from this collection in advance of batch uploading (though only ~160 in the main category). All uploads use {{PD-Harris-Ewing}} and the PD-old-70-1923 or PD-US-no notice releases depending on date.

Upload run start 16 February 2018. ✓ Done

Bain Collection (ggbain)[edit]

76,176 R George Grantham Bain Collection

The collection includes many portraits of notible early 20th Century figures, including politicians, socialites and sports winners. Around 10,000 images from the collection were uploaded before this run. 41,000 photographs are available at LOC. {{PD-Bain}} is added to all photographs. ✓ Done

Miscellaneous Items in High Demand (cph)[edit]

77,964 R Miscellaneous Items in High Demand, PPOC, Library of Congress

The "Miscellaneous Items" represents a variety of the Prints & Photographs Division's photographic, print, drawing, and architectural holdings. These items were singled out for description because copy photographs or digital copies were requested for a publication, exhibition, or other special project that increased demand for the pictures.

Uploads are limited to pre-1923 dated works. ✓ Done

Lawrence & Houseworth (lawhou)[edit]

1,522 R Lawrence & Houseworth

The images date from 1862 to 1867. Just over 900 images are in the collection. Before this upload, four images had been uploaded to Wikimedia Commons. ✓ Done

Frances Benjamin Johnston (fbj)[edit]

5,471 R Photographs by Frances Benjamin Johnston

Frances Benjamin Johnston (1864-1952) was one of the first American women to achieve prominence as a photographer. Early colour photographs of many buildings and gardens.

Over 3,200 photographs are available in the collection. {{PD-Johnston}} is added to uploaded images. ✓ Done

Gottscho-Schleisner Collection (gsc)[edit]

13,193 R Gottscho-Schleisner Collection

This collection illustrates 20th Century architecture. Photographs date up to the 1940s, though some may be of a small size. {{PD-Gottscho}} is applied to the collection. 79 photographs were available on Commons before the batch upload and over 28,000 are available. ✓ Done

Japanese fine prints, pre-1915 (jpd)[edit]

2,560 R Japanese fine prints, pre-1915

Over 2,600 prints are in the collection. ✓ Done

Genthe (agc)[edit]

4,204 R Genthe Collection

Over 4,000 prints are in the collection and available digitally. ✓ Done

Fine prints in the Library of Congress (finepr)[edit]

1,156 R Fine prints in the Library of Congress

A relatively small proportion are available externally, just over 1,000. ✓ Done

John Margolies Roadside America Photograph Archive (mrg)[edit]

10 R John Margolies Roadside America Photograph Archive

This collection is being uploaded using an older search method, and is limited to TIFFs.

Drawings, Documentary and Master (drwg & drwgma)[edit]

172 R Drawings (Master) collection in the Library of Congress
2,592 R Drawings (Documentary) collection in the Library of Congress

At 175, there are only a modest number of "Master" drawings, but there are 2,800 drawings listed as being "Documentary" in content.

As all of the scans are artworks, the {{PD-Art}} template wrapper has been added to the usual PD license. ✓ Done

LOTs[edit]

LOT 4339[edit]

61 R Japanese and Chinese portraits 1863-1877, Library of Congress LOT 4339

A series of 19th century portraits by two different photographers. Special request by Animalparty. ✓ Done