Commons:Batch uploading/U.S. Army Corps of Engineers Digital Visual Library

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Source to upload from

[edit]

U.S. Army Corps of Engineers Digital Visual Library - https://usace.contentdm.oclc.org/digital/search
Provides almost 50,000 media files by the Corps of Engineers in the subjects of

License

[edit]

Public domain - {{PD-USGov-Military-Army-USACE}}

Description

[edit]
  • Do the media URLs follow a pattern? - site is based on the "CONTENTdm® Digital Collection Management System"
    • Description page: https://usace.contentdm.oclc.org/digital/collection/<collection ID>/id/<ContentDM record ID>/rec/<running number for related records>
    • Full size downlaod linkhttps://usace.contentdm.oclc.org/digital/download/collection/<collection ID>/id/<ContentDM record ID>/size/full
    • Description pages contain plenty of metadata such as title, caption, sub-collection, dates published/digitized, subject keywords, general keywords, physical location/description, "Data entered by"/uploader, language, usage rights (always PD)/use credits (not mandatory), record number/ContentDM record number, which should be included
  • Does the site have an API? - I don't know
  • What else could ease uploading?
    • Unrelated general thing I noted is that the majority of files have a very recent digitisation date (after 2020)
  • Did you contact the site owner? Not nessescary
  • Is there a template that could be used on the file description pages, or should one be created?


TheImaCow (talk) 09:31, 3 August 2024 (UTC)[reply]

IIIF (JSON) API:
  • Collection Manifest: https://usace.contentdm.oclc.org/iiif/2/<collection ID>/manifest.json
  • Item Manifest: https://usace.contentdm.oclc.org/iiif/2/<collection ID>:<ContentDM record ID>/manifest.json
    • For better annotated description metadata indexed by unique identifier keys: https://usace.contentdm.oclc.org/digital/api/collections/<collection ID>/items/< ContentDM record ID>/false
  • Item info: https://usace.contentdm.oclc.org/iiif/2/<collection ID>:<ContentDM record ID>/info.json
  • Download: https://usace.contentdm.oclc.org/digital/iiif/2/<collection ID>:<ContentDM record ID>/full/max/0/default.<format> with format offerings specified in the Item Info API above. JPEG is presumed to be available for all files.
-- DaxServer (talk) 18:14, 3 August 2024 (UTC)[reply]
@TheImaCow Could you create the template based on the LOC-image as you suggested? Do you have any suggestions on the categorization schema? I suppose the files go into specific collection category. Apart from this, other categories?
I can start working on this and file for a bot task approval when ready. -- DaxServer (talk) 18:17, 3 August 2024 (UTC)[reply]
This looks very good, thanks for your work already!
I created {{USACE image}} based on similar templates, where parameter 1 is the <collection ID> and parameter 2 the <ContentDM record ID>.
I think categorisation can be done like the USGov FEMA import from 15 years ago, categorization mirroring the collectionID (see subcats of Category:Images from FEMA called "Images from FEMA, category XXX" - for example we use something like "Images from USACE, Fish & Wildlife reports collection". I will create such categories in a moment.
+ a category like the recently created Category:Images from NPGallery to check on all files that can be removed when the actual content categories have been added manually. TheImaCow (talk) 19:23, 3 August 2024 (UTC)[reply]
@DaxServer I created Category:Images from USACE+subcats for the individual collections + Category:Images from USACE to check.
I noted that above you mentioned "JPEG is presumed to be available for all files."
This appears to be correct, however, a large portion of the media are PDF files/scanned documents, where, when requesting a JPG, only the title page would be uploaded, nonsense obivously. I am not sure how to get the PDF using the way described above, as I can't add a .pdf into the .<format> nor is PDF stated under the aviable file types at info.json
However adding /api/ and appending /download to the URL downloads the proper PDF https://usace.contentdm.oclc.org/digital/api/collection/p16021coll7/id/22887/download-example, but there might be a better way.
(Files where the manifest states "file type" "pdf" should not uploaded as JPG) TheImaCow (talk) 20:08, 3 August 2024 (UTC)[reply]
Of course, yes. I was only referring to images and probably should have said "presumed to be available for all images" instead 😅 -- DaxServer (talk) 10:02, 4 August 2024 (UTC)[reply]
BTW, do you want to create a {{USACE file}} or {{USACE work}} or some sorts for these pdfs? Additionally, the {{USACE image}} can just invoke the former with a "image" argument so the templating is not duplicated? -- DaxServer (talk) 10:04, 4 August 2024 (UTC)[reply]
Here's what I'll use:
FWIW their API needs some overhaul into fixing things and streamlining -- DaxServer (talk) 13:41, 4 August 2024 (UTC)[reply]
I added an "image=yes/no" parameter to the template where it either states "This document..." or "This image...". (And "This work..." if the parameter is missing) Can be added based on filetype, but I don't think that it really matters. TheImaCow (talk) 14:39, 4 August 2024 (UTC)[reply]
I've downloaded the entire metadata into my OpenRefine. I think this would enable us to strategize the work.
Would you be able to devise a schema for the {{Information}} so that we can add as much information as we can? I'm also studying the structure and can share my thoughts once I process them. -- DaxServer (talk) 19:08, 4 August 2024 (UTC)[reply]
Is this file under PD? The rights ask to consult the West Point Museum - https://usace.contentdm.oclc.org/digital/collection/p15141coll5/id/505 -- DaxServer (talk) 10:43, 6 August 2024 (UTC)[reply]
Yes, as it is from 1847, 177 years old. It's creator died 1889, so {{PD-art-100-expired}} applies. Are there more items which are not labeled as PD? TheImaCow (talk) 11:01, 6 August 2024 (UTC)[reply]
There are a few, ~30 or so, that are deemed restrictive and do not provide access to content. Ex - https://usace.contentdm.oclc.org/digital/collection/p16021coll6/id/1100 -- DaxServer (talk) 13:05, 6 August 2024 (UTC)[reply]

I propose to use {{Photograph}} for images/{{Book}} for PDFs instead of {{Information}}

USACE to commons/examples
Label to Commons Template
ContentDM label {{Photograph}} value {{Book}} value
Title Title= Title=
Alternative title X Subtitle=
Description description= description=
Sub-collection collection= collection=
Organizational author/Organizational creator author= author=
Digital Publisher publisher= X
Publisher X publisher=
Local place, State/Province, Country depicted=<Local place>, <State/Province>, <Country> city=
Subject other_fields_2={{Information field|name=Subject|value=''XXXX<br>XXXX''}} other_fields={{Information field|name=Subject|value=''XXXX<br>XXXX''}}
Keywords can be omitted can be omitted
Notes notes= X
Physical location department= X
Physical description medium= X
Document location/Format extend/File Type/File size irrelevant for us irrelevant for us
Resolution other_fields_3={{Information field|name=Resolution|value=''XXXX''}} other_fields_3={{Information field|name=Resolution|value=''XXXX''}}
Data entered by/Rights/Contributed by/Disposition/CONTENTdm number/CONTENTdm file name redundant redundant
Record number X redundant, not an actual record number X
Use credits Example Example
Report type X genre=
Publisher X publisher=
Date created(img)/published(book) date= date=
Date digitized other_fields={{Information field|name=Date digitized|value={{date|XXXX}}}} other_fields={{Information field|name=Date digitized|value=''{{date|XXXX}}''}}
Location not existent (?) not useful
Language X language=
Personal creator photographer= X

Image

== Summary ==
117 Cameron destroyed buildings
Photographer
Cameron, Harry F.
Title
117 Cameron destroyed buildings
Publisher
United States. Army. Corps of Engineers. Office of History
Description
Montebourg, Normandy [destroyed buildings].
Depicted place Montebourg (France), Normandy, France
Subject
InfoField
World War, 1939-1945
Buildings
War damage
Villages
Date 1944
date QS:P571,+1944-00-00T00:00:00Z/9
Medium 2" x 2" color 35 mm slides in cardboard mounts
Current location
Cameron Collection, Rows 1-2
Accession number
This image was released by the U.S. Army Corps of Engineers, the military engineering branch of the United States Army.

This tag does not indicate the copyright status of the attached work. A normal copyright tag is still required. See Commons:Licensing.

Notes Col. Harry F. Cameron, Jr., commanded the 164th Engineer Battalion in Europe during World War II. Title created by staff.
Resolution
InfoField
600 dpi
Source U.S. Army Corps of Engineers Digital Library
Date digitized
InfoField
2023

Book

== Summary ==
Carolina Power and Light Company, Mayo Electric Generating Plant: Final environmental impact statement
Title
Carolina Power and Light Company, Mayo Electric Generating Plant: Final environmental impact statement
Publisher
US Army, Corps of Engineers, Wilmington District
Genre Environmental impact statement
Language eng
Subject
InfoField
Coal-fired power plants
Environmental impact statements
Electric power-plants
Publication date September 1978
Accession number
This document was released by the U.S. Army Corps of Engineers, the military engineering branch of the United States Army.

This tag does not indicate the copyright status of the attached work. A normal copyright tag is still required. See Commons:Licensing.

Place of publication North Carolina
Resolution
InfoField
Bitonal 1 bit/600 dpi, Greyscale 8 bit/300 dpi, Color 24 bit/300 dpi
Source U.S. Army Corps of Engineers Digital Library
Date digitized
InfoField
2021

Dates need to be converted from YYYY-MM-DD to YYYY|MM|DD for the date template, multiple subjects need to be converted from XXXX;YYYY to XXXX<br>YYYY
Hope I didn't forget anything important. TheImaCow (talk) 21:38, 4 August 2024 (UTC)[reply]

Here are the contentTypes:
  • application/octet-stream 82
  • application/pdf 29550
  • application/url 487
  • audio/mpeg 1
  • image/jp2 17124
  • image/jpeg 5
  • image/tiff 141
  • restricted 31
  • video/mp4 45
Some notes and questions:
  • The octet-stream contents are mainly ppt/x, doc/x, exe, pdf, zip. I think I'll skip these, if necessary can be manually uploaded.
  • Here are the URL types: https://usace.contentdm.oclc.org/digital/collection/p16021coll11/id/3967
  • For TIFF files, I'll upload them as they're lossless and the community seem to prefer them
  • Do you know which info template to use for video instead of {{Photograph}} ?
  • Do we just note the photographer and/or publisher to be text or do we create a Creator: and/or Institution: namespace entries and use that to enrich info?
  • Do we put the {{USACE image}} template in accession number field?
-- DaxServer (talk) 13:32, 6 August 2024 (UTC)[reply]
Since jp2 format is not supported on Commons, I'll use IIIF download to fetch jpeg version -- DaxServer (talk) 14:05, 6 August 2024 (UTC)[reply]
Actually, all of the images are provided in TIFF, I'll upload them instead. -- DaxServer (talk) 15:07, 6 August 2024 (UTC)[reply]
I would leave out everything that is not image/PDF, audio, URL, exe, "restricted", videos too. Creating creator templates for authors would be very good, is there a way to show the authors with the most files? (prob. everything over 100 or so is something we can create a template for)
USACE image template: I picked the accession number field, because I think it fits the best. The objects don't have a identifier other than the two IDs assigned by the record mangament system - which are then present as part of the template.
TIF issue: I compared some 15 objects from various collections JPG vs TIF using both GIMP/Windows Photo Viewer, and the image quality is always 1:1 the same, pixel by pixel the same. See https://i.imgur.com/iowJb9X.png for a quick comparison, left JPG right TIF. (files used are p15141coll5/3453 and p15141coll5/10765)
The TIFs here are in no way better quality, therefore we don't loose anything when using the JPGs. (there is apparently consensous to not create JPG duplicates, but nothing againest uploading only JPGs) TheImaCow (talk) 22:35, 6 August 2024 (UTC)[reply]
Here are the most used "Personal creator"s:
  • Cameron, Harry F. - 315
  • Rowland, Chester A. - 140
  • Jordan, Jonas - 83
  • Wood - 38
  • Ryan, Robert H. - 31
  • Knuppel, Lee - 26
  • O'Sullivan, Timothy H. - 26
  • Boswell, Ray - 17
  • Majors - 14
  • Wu, Andy - 14
  • Garver, Cpt. - 11
Re TIFs, I'll upload the JPEGs instead per your observation. -- DaxServer (talk) 13:33, 7 August 2024 (UTC)[reply]
For the depicted places, Local place and State can have multiple values. Ex: https://usace.contentdm.oclc.org/digital/collection/p16021coll6/id/1918 How should we approach that? -- DaxServer (talk) 13:47, 7 August 2024 (UTC)[reply]
The only creator I could find anything about is Timothy H. O'Sullivan, who was a known/notable photographer, template already at {{Creator:Timothy H. O'Sullivan}}. Unfortunaly I couldn't find anything reliable about any of the other people. Sometimes people with the same name were apparently involved in e.g. the Vietnam War, but their photos are from WW2 and there is no mention about them being in WW2, and similar situations.
Different states: I don't think that this is very common, and since it is on a PDF, I think we can ignore then location then - don't think it's worth creating a special case for that.
What I noticed at the same file is the very long title. Commons files names are up to 240 bytes long, this title would have ~3600 bytes. However this file has an much shorter "alternate title" field "River & harbor annual reports, 1883-1892". We should use the "alternate title" as the main title/file name here, and the "title" as "|description=" (as there is no description otherwise either). If there are too long file names/titles without an alternate title, we should cap the title with "..." or so. TheImaCow (talk) 08:15, 8 August 2024 (UTC)[reply]
What needs to be done with the Use credits? I didn't get what you meant by Example -- DaxServer (talk) 11:53, 8 August 2024 (UTC)[reply]
Oops, "example" was the placeholder when generating the table preset using the source editor, and I forget replacing it.
The "credit_line=" parameter in the photograph/book templates should be used for "Use credit" values. TheImaCow (talk) 23:16, 8 August 2024 (UTC)[reply]
The location seems pretty complicated and not just a simple combination of <Local place> <State> <Country>. It requires significant effort to cleanup and streamline. Do you have any thoughts on how to avoid that? -- DaxServer (talk) 07:42, 13 August 2024 (UTC)[reply]
I would suggest to use three seperate fields for Place/State/Country then. However, the Artwork template supports only up to four custom fields (using {{Information field}}/other_fields/other_fields_1/2/3 parameters). Three of these custom parameters are already used (Date digizized, resolution, subjects), so either we add support for more custom parameters to {{Artwork}} -no clue how to-, or we limit to using only the "Local place" value. (Using the other_fields_1= parameter, which has not been used so far).
It appears that the Country/State values can be easily inferred from the local place value anyway. ~TheImaCow (talk) 10:03, 13 August 2024 (UTC)[reply]
@TheImaCow Unfortunately I won't be able to work on the Place/State/Country field. That would be upto someone else to take up on. If that is okay with you, let me know and I can conclude the templating and do a test run. -- DaxServer (talk) 07:42, 30 August 2024 (UTC)[reply]
Yes, than we'll just continue without it. ~TheImaCow (talk) 11:01, 30 August 2024 (UTC)[reply]
I did a test run. Can you check - Special:ListFiles/CuratorBot -- DaxServer (talk) 19:19, 1 September 2024 (UTC)[reply]
Looks good. There is this file, which is watermarked and noted "Contact A&M for usage rights.". Searching for files where there is something noted about "usage rights" only returns this file and this image, which should be excluded.
Otherwise, metadata, categories, etc. looks really good. ~TheImaCow (talk) 20:20, 1 September 2024 (UTC)[reply]
Is there something specific to add in SDC? Other bots seem to add quite some basic statements, so we don't have to do that ourselves. -- DaxServer (talk) 20:44, 1 September 2024 (UTC)[reply]
Are there any inferred categories that can be added during the upload? (asked at the bot request) -- DaxServer (talk) 07:36, 2 September 2024 (UTC)[reply]
Re SDC, I don't really know much about that. We could probably add something like "copyright public domain", "filetype image/jpg", or similar, but since there are already other bots doing that everywhere, I personally don't think its worth the effort.
Re categories, I replied there.
Re language template (at the bot request), the values we should place inside {{EN|1=XXXX}} are Description, Subjects, Notes, Title, Collection, Author, Publisher, Genre. ~TheImaCow (talk) 19:47, 2 September 2024 (UTC)[reply]
Subject
InfoField
English: Strategy, Sustainability, Planning
Subject
InfoField
English: Strategy; Sustainability; Planning
Subject
InfoField
English: Strategy
Sustainability
Planning

Which format would you prefer for the subject? -- DaxServer (talk) 07:48, 3 September 2024 (UTC)[reply]

I'd say the second one (with ";") ~TheImaCow (talk) 00:13, 5 September 2024 (UTC)[reply]

Current state is very good, however, It looks like uploads of the "Photographs" category have stopped at around the letter "R" ([1]), a few thousand images are still missing there. ~TheImaCow (talk) 16:50, 12 October 2024 (UTC)[reply]

Opinions

[edit]
Assigned to Progress Bot name Category
DaxServer (talk · contribs) In progress CuratorBot (talk · contribs) Commons:Bots/Requests/CuratorBot (3) Task #3 Category:Images from USACE