Commons:Batch uploading/U.S. Army Corps of Engineers Digital Visual Library
Source to upload from
[edit]U.S. Army Corps of Engineers Digital Visual Library - https://usace.contentdm.oclc.org/digital/search
Provides almost 50,000 media files by the Corps of Engineers in the subjects of
- About USACE (405 files)
- Booklets, Manuals & Guides (2,109 files)
- Fish & Wildlife reports (883 files)
- Histories (309 files)
- IWR reports (1231 files)
- Laws & Congressional reports (1,302 files)
- Magazines & Newsletters (148 files)
- Maps & Drawings (1,373 files)
- Media (438 Videos, were already uploaded to YouTube)
- Photographs (17,022 files)
- Project management reports (12,075 files)
- Public Notices and Jurisdictional Determination Forms (5,310 files)
- Regulatory Information (438 files)
- Technical reports (5,378 files)
License
[edit]Public domain - {{PD-USGov-Military-Army-USACE}}
Description
[edit]- Do the media URLs follow a pattern? - site is based on the "CONTENTdm® Digital Collection Management System"
- Description page:
https://usace.contentdm.oclc.org/digital/collection/<collection ID>/id/<ContentDM record ID>/rec/<running number for related records>
- Full size downlaod link
https://usace.contentdm.oclc.org/digital/download/collection/<collection ID>/id/<ContentDM record ID>/size/full
- Description pages contain plenty of metadata such as title, caption, sub-collection, dates published/digitized, subject keywords, general keywords, physical location/description, "Data entered by"/uploader, language, usage rights (always PD)/use credits (not mandatory), record number/ContentDM record number, which should be included
- Description page:
- Does the site have an API? - I don't know
- What else could ease uploading?
- Unrelated general thing I noted is that the majority of files have a very recent digitisation date (after 2020)
- Did you contact the site owner? Not nessescary
- Is there a template that could be used on the file description pages, or should one be created?
- Something like {{LOC-image}} could be created
TheImaCow (talk) 09:31, 3 August 2024 (UTC)
- IIIF (JSON) API:
- Collection Manifest:
https://usace.contentdm.oclc.org/iiif/2/<collection ID>/manifest.json
- Item Manifest:
https://usace.contentdm.oclc.org/iiif/2/<collection ID>:<ContentDM record ID>/manifest.json
- For better annotated description metadata indexed by unique identifier keys:
https://usace.contentdm.oclc.org/digital/api/collections/<collection ID>/items/< ContentDM record ID>/false
- For better annotated description metadata indexed by unique identifier keys:
- Item info:
https://usace.contentdm.oclc.org/iiif/2/<collection ID>:<ContentDM record ID>/info.json
- Download:
https://usace.contentdm.oclc.org/digital/iiif/2/<collection ID>:<ContentDM record ID>/full/max/0/default.<format>
withformat
offerings specified in the Item Info API above. JPEG is presumed to be available for all files.
- Collection Manifest:
- -- DaxServer (talk) 18:14, 3 August 2024 (UTC)
- @TheImaCow Could you create the template based on the LOC-image as you suggested? Do you have any suggestions on the categorization schema? I suppose the files go into specific collection category. Apart from this, other categories?
- I can start working on this and file for a bot task approval when ready. -- DaxServer (talk) 18:17, 3 August 2024 (UTC)
- This looks very good, thanks for your work already!
- I created {{USACE image}} based on similar templates, where parameter 1 is the <collection ID> and parameter 2 the <ContentDM record ID>.
- I think categorisation can be done like the USGov FEMA import from 15 years ago, categorization mirroring the collectionID (see subcats of Category:Images from FEMA called "Images from FEMA, category XXX" - for example we use something like "Images from USACE, Fish & Wildlife reports collection". I will create such categories in a moment.
- + a category like the recently created Category:Images from NPGallery to check on all files that can be removed when the actual content categories have been added manually. TheImaCow (talk) 19:23, 3 August 2024 (UTC)
- @DaxServer I created Category:Images from USACE+subcats for the individual collections + Category:Images from USACE to check.
- I noted that above you mentioned "JPEG is presumed to be available for all files."
- This appears to be correct, however, a large portion of the media are PDF files/scanned documents, where, when requesting a JPG, only the title page would be uploaded, nonsense obivously. I am not sure how to get the PDF using the way described above, as I can't add a .pdf into the .<format> nor is PDF stated under the aviable file types at info.json
- However adding /api/ and appending /download to the URL downloads the proper PDF
https://usace.contentdm.oclc.org/digital/api/collection/p16021coll7/id/22887/download
-example, but there might be a better way. - (Files where the manifest states "file type" "pdf" should not uploaded as JPG) TheImaCow (talk) 20:08, 3 August 2024 (UTC)
- Of course, yes. I was only referring to images and probably should have said "presumed to be available for all images" instead 😅 -- DaxServer (talk) 10:02, 4 August 2024 (UTC)
- BTW, do you want to create a {{USACE file}} or {{USACE work}} or some sorts for these pdfs? Additionally, the {{USACE image}} can just invoke the former with a "image" argument so the templating is not duplicated? -- DaxServer (talk) 10:04, 4 August 2024 (UTC)
- Here's what I'll use:
- Collection API - https://usace.contentdm.oclc.org/digital/api/search/collection/p16021coll7/page/2/maxRecords/50
- Item API - https://usace.contentdm.oclc.org/digital/api/collections/p16021coll7/items/22884/false
- Download API - https://usace.contentdm.oclc.org/digital/api/collection/p16021coll7/id/22884/download
- FWIW their API needs some overhaul into fixing things and streamlining -- DaxServer (talk) 13:41, 4 August 2024 (UTC)
- I added an "image=yes/no" parameter to the template where it either states "This document..." or "This image...". (And "This work..." if the parameter is missing) Can be added based on filetype, but I don't think that it really matters. TheImaCow (talk) 14:39, 4 August 2024 (UTC)
- I've downloaded the entire metadata into my OpenRefine. I think this would enable us to strategize the work.
- Would you be able to devise a schema for the {{Information}} so that we can add as much information as we can? I'm also studying the structure and can share my thoughts once I process them. -- DaxServer (talk) 19:08, 4 August 2024 (UTC)
- I added an "image=yes/no" parameter to the template where it either states "This document..." or "This image...". (And "This work..." if the parameter is missing) Can be added based on filetype, but I don't think that it really matters. TheImaCow (talk) 14:39, 4 August 2024 (UTC)
- Here's what I'll use:
- Is this file under PD? The rights ask to consult the West Point Museum - https://usace.contentdm.oclc.org/digital/collection/p15141coll5/id/505 -- DaxServer (talk) 10:43, 6 August 2024 (UTC)
- Yes, as it is from 1847, 177 years old. It's creator died 1889, so {{PD-art-100-expired}} applies. Are there more items which are not labeled as PD? TheImaCow (talk) 11:01, 6 August 2024 (UTC)
- There are a few, ~30 or so, that are deemed restrictive and do not provide access to content. Ex - https://usace.contentdm.oclc.org/digital/collection/p16021coll6/id/1100 -- DaxServer (talk) 13:05, 6 August 2024 (UTC)
- Yes, as it is from 1847, 177 years old. It's creator died 1889, so {{PD-art-100-expired}} applies. Are there more items which are not labeled as PD? TheImaCow (talk) 11:01, 6 August 2024 (UTC)
I propose to use {{Photograph}} for images/{{Book}} for PDFs instead of {{Information}}
USACE to commons/examples
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Image == Summary ==
Book == Summary ==
|
Dates need to be converted from YYYY-MM-DD to YYYY|MM|DD for the date template, multiple subjects need to be converted from XXXX;YYYY to XXXX<br>YYYY
Hope I didn't forget anything important.
TheImaCow (talk) 21:38, 4 August 2024 (UTC)
- Here are the contentTypes:
- application/octet-stream 82
- application/pdf 29550
- application/url 487
- audio/mpeg 1
- image/jp2 17124
- image/jpeg 5
- image/tiff 141
- restricted 31
- video/mp4 45
- Some notes and questions:
- The octet-stream contents are mainly ppt/x, doc/x, exe, pdf, zip. I think I'll skip these, if necessary can be manually uploaded.
- Here are the URL types: https://usace.contentdm.oclc.org/digital/collection/p16021coll11/id/3967
- For TIFF files, I'll upload them as they're lossless and the community seem to prefer them
- Do you know which info template to use for video instead of {{Photograph}} ?
- Do we just note the photographer and/or publisher to be text or do we create a Creator: and/or Institution: namespace entries and use that to enrich info?
- Do we put the {{USACE image}} template in accession number field?
- -- DaxServer (talk) 13:32, 6 August 2024 (UTC)
- Since jp2 format is not supported on Commons, I'll use IIIF download to fetch jpeg version -- DaxServer (talk) 14:05, 6 August 2024 (UTC)
- Actually, all of the images are provided in TIFF, I'll upload them instead. -- DaxServer (talk) 15:07, 6 August 2024 (UTC)
- I would leave out everything that is not image/PDF, audio, URL, exe, "restricted", videos too. Creating creator templates for authors would be very good, is there a way to show the authors with the most files? (prob. everything over 100 or so is something we can create a template for)
- USACE image template: I picked the accession number field, because I think it fits the best. The objects don't have a identifier other than the two IDs assigned by the record mangament system - which are then present as part of the template.
- TIF issue: I compared some 15 objects from various collections JPG vs TIF using both GIMP/Windows Photo Viewer, and the image quality is always 1:1 the same, pixel by pixel the same. See https://i.imgur.com/iowJb9X.png for a quick comparison, left JPG right TIF. (files used are p15141coll5/3453 and p15141coll5/10765)
- The TIFs here are in no way better quality, therefore we don't loose anything when using the JPGs. (there is apparently consensous to not create JPG duplicates, but nothing againest uploading only JPGs) TheImaCow (talk) 22:35, 6 August 2024 (UTC)
- Here are the most used "Personal creator"s:
- Cameron, Harry F. - 315
- Rowland, Chester A. - 140
- Jordan, Jonas - 83
- Wood - 38
- Ryan, Robert H. - 31
- Knuppel, Lee - 26
- O'Sullivan, Timothy H. - 26
- Boswell, Ray - 17
- Majors - 14
- Wu, Andy - 14
- Garver, Cpt. - 11
- Re TIFs, I'll upload the JPEGs instead per your observation. -- DaxServer (talk) 13:33, 7 August 2024 (UTC)
- For the depicted places, Local place and State can have multiple values. Ex: https://usace.contentdm.oclc.org/digital/collection/p16021coll6/id/1918 How should we approach that? -- DaxServer (talk) 13:47, 7 August 2024 (UTC)
- The only creator I could find anything about is Timothy H. O'Sullivan, who was a known/notable photographer, template already at {{Creator:Timothy H. O'Sullivan}}. Unfortunaly I couldn't find anything reliable about any of the other people. Sometimes people with the same name were apparently involved in e.g. the Vietnam War, but their photos are from WW2 and there is no mention about them being in WW2, and similar situations.
- Different states: I don't think that this is very common, and since it is on a PDF, I think we can ignore then location then - don't think it's worth creating a special case for that.
- What I noticed at the same file is the very long title. Commons files names are up to 240 bytes long, this title would have ~3600 bytes. However this file has an much shorter "alternate title" field "River & harbor annual reports, 1883-1892". We should use the "alternate title" as the main title/file name here, and the "title" as "|description=" (as there is no description otherwise either). If there are too long file names/titles without an alternate title, we should cap the title with "..." or so. TheImaCow (talk) 08:15, 8 August 2024 (UTC)
- What needs to be done with the Use credits? I didn't get what you meant by Example -- DaxServer (talk) 11:53, 8 August 2024 (UTC)
- Oops, "example" was the placeholder when generating the table preset using the source editor, and I forget replacing it.
- The "credit_line=" parameter in the photograph/book templates should be used for "Use credit" values. TheImaCow (talk) 23:16, 8 August 2024 (UTC)
- The location seems pretty complicated and not just a simple combination of <Local place> <State> <Country>. It requires significant effort to cleanup and streamline. Do you have any thoughts on how to avoid that? -- DaxServer (talk) 07:42, 13 August 2024 (UTC)
- I would suggest to use three seperate fields for Place/State/Country then. However, the Artwork template supports only up to four custom fields (using {{Information field}}/other_fields/other_fields_1/2/3 parameters). Three of these custom parameters are already used (Date digizized, resolution, subjects), so either we add support for more custom parameters to {{Artwork}} -no clue how to-, or we limit to using only the "Local place" value. (Using the other_fields_1= parameter, which has not been used so far).
- It appears that the Country/State values can be easily inferred from the local place value anyway. ~TheImaCow (talk) 10:03, 13 August 2024 (UTC)
- @TheImaCow Unfortunately I won't be able to work on the Place/State/Country field. That would be upto someone else to take up on. If that is okay with you, let me know and I can conclude the templating and do a test run. -- DaxServer (talk) 07:42, 30 August 2024 (UTC)
- Yes, than we'll just continue without it. ~TheImaCow (talk) 11:01, 30 August 2024 (UTC)
- I did a test run. Can you check - Special:ListFiles/CuratorBot -- DaxServer (talk) 19:19, 1 September 2024 (UTC)
- Looks good. There is this file, which is watermarked and noted "Contact A&M for usage rights.". Searching for files where there is something noted about "usage rights" only returns this file and this image, which should be excluded.
- Otherwise, metadata, categories, etc. looks really good. ~TheImaCow (talk) 20:20, 1 September 2024 (UTC)
- Is there something specific to add in SDC? Other bots seem to add quite some basic statements, so we don't have to do that ourselves. -- DaxServer (talk) 20:44, 1 September 2024 (UTC)
- Are there any inferred categories that can be added during the upload? (asked at the bot request) -- DaxServer (talk) 07:36, 2 September 2024 (UTC)
- Re SDC, I don't really know much about that. We could probably add something like "copyright public domain", "filetype image/jpg", or similar, but since there are already other bots doing that everywhere, I personally don't think its worth the effort.
- Re categories, I replied there.
- Re language template (at the bot request), the values we should place inside
{{EN|1=XXXX}}
are Description, Subjects, Notes, Title, Collection, Author, Publisher, Genre. ~TheImaCow (talk) 19:47, 2 September 2024 (UTC)
- I did a test run. Can you check - Special:ListFiles/CuratorBot -- DaxServer (talk) 19:19, 1 September 2024 (UTC)
- Yes, than we'll just continue without it. ~TheImaCow (talk) 11:01, 30 August 2024 (UTC)
- @TheImaCow Unfortunately I won't be able to work on the Place/State/Country field. That would be upto someone else to take up on. If that is okay with you, let me know and I can conclude the templating and do a test run. -- DaxServer (talk) 07:42, 30 August 2024 (UTC)
- The location seems pretty complicated and not just a simple combination of <Local place> <State> <Country>. It requires significant effort to cleanup and streamline. Do you have any thoughts on how to avoid that? -- DaxServer (talk) 07:42, 13 August 2024 (UTC)
- What needs to be done with the Use credits? I didn't get what you meant by Example -- DaxServer (talk) 11:53, 8 August 2024 (UTC)
- Here are the most used "Personal creator"s:
- Actually, all of the images are provided in TIFF, I'll upload them instead. -- DaxServer (talk) 15:07, 6 August 2024 (UTC)
- Since jp2 format is not supported on Commons, I'll use IIIF download to fetch jpeg version -- DaxServer (talk) 14:05, 6 August 2024 (UTC)
Subject | English: Strategy, Sustainability, Planning |
Subject | English: Strategy; Sustainability; Planning |
Subject | English: Strategy Sustainability Planning |
Which format would you prefer for the subject? -- DaxServer (talk) 07:48, 3 September 2024 (UTC)
- I'd say the second one (with ";") ~TheImaCow (talk) 00:13, 5 September 2024 (UTC)
Current state is very good, however, It looks like uploads of the "Photographs" category have stopped at around the letter "R" ([1]), a few thousand images are still missing there. ~TheImaCow (talk) 16:50, 12 October 2024 (UTC)
Opinions
[edit]Assigned to | Progress | Bot name | Category |
---|---|---|---|
DaxServer (talk · contribs) | In progress | CuratorBot (talk · contribs) Commons:Bots/Requests/CuratorBot (3) Task #3 | Category:Images from USACE |