Commons talk:Digital Public Library of America/Modeling

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

File captions[edit]

In addition to the statements, I'd also like to start a discussion about how to format the file captions, which could be added programmatically across all DPLA uploads. My initial thought is that it should be a string formed with some combination of the title, creator, institution and ID (and date?) fields, with some basic punctuation. For example, "{title}" by {creator}, from {institution} (DPLA ID {id}). Is there a good standard to follow here? Dominic (talk) 14:45, 16 June 2021 (UTC)[reply]

  • FRomeo (WMF) (talk) 12:49, 15 July 2021 (UTC): I would prefer that file captions on Commons were descriptive of the content of the image. Some of the items on https://dp.la/ are exactly like this. For example, this item's description is "Two sections of a cream-colored silk valance with scalloped edge and silk fringe border. Embroidered with delicate naturalistic flowers and tendrils tied into bows. In pale greens, blues, golds, and pinks." Are you able to use descriptions when they're available? Looking at the other metadata on DPLA, the most descriptive fields that aren't description are title, format, and subjects. Given that subjects will probably be used for depicts, would some combination of title and format work? E.g. For this item, the title is "First successful dirigible, 1883" and the format is "Trade cards. Cigarette cards." Taken together, that's a reasonable description of the image. But I understand that the quality and extent of metadata varies by partner.[reply]
    • @FRomeo (WMF): We have many (most?) DPLA items that lack the description field, and even subjects are not universal, but I take your point, so maybe we can have one approach for those that do have these types of more descriptive fields, and then do something like I said above just as a fallback when needed. One question, though. Something Carly said that stuck with me is that the MediaSearch is favoring specific fields–namely: file caption, depicts, and unstructured text—but adding more granular SDC statements might end up hurting the discoverability if certain fields (e.g. title) aren't factored into the weighting. My takeaway then was that if we want someone who is searching on basic terms found in the title or creator (but not depiction), then we should put those terms in the caption. Do we lose any discoverability if we use the narrative description like the one in your example? Dominic (talk) 13:45, 15 July 2021 (UTC)[reply]
      • @Dominic: It is true that depicts and file caption are weighted more heavily in search results. However, adding more granular depicts statements shouldn't hurt discoverability in any way, as long as broader depicts statements are also included. Unstructured text is still searched, and that unstructured text includes title and creator. So that information still turns up in search results - just ranked lower than matching information in depicts or file captions. There's no reason that using narrative description would hurt discoverability. Hope that helps - please let me know if you have a follow up question! CBogen (WMF) (talk) 16:57, 13 September 2021 (UTC)[reply]
        • @CBogen (WMF): Thanks! So just to clarify your clarification, what we are actually hoping to do is (1) put all fielded metadata into structured data statements for easier maintenance (not just depicts statements), (2) modify existing templates so they display structured data to end users see the same metadata, and (3) remove the actual plain text from the wiki page that is duplicative of structured data, so we only have to maintain data one once place. Are other structured data fields outside of depicts and caption searched? As long this data is searched when it's imported by a template using parser functions or Lua, I guess that works. (Though, is it searching all labels in all languages if it's a Wikidata item?) It still seems like you might have issues where, for one example, if you put the actual title in a P1476 statement, but not the caption—even though a search term found in a title field should be highly relevant—this would rank below a less relevant file where the words of the title are in the caption, even though that is not the title of the work. Dominic (talk) 18:32, 15 September 2021 (UTC)[reply]
          • @Dominic: You have it right and this sounds like a good plan to me. The content that you put into templates via SDC will then be indexed as full text and become searchable via MediaSearch, even though SDC fields other than depicts, caption, and digital representation of are not directly indexed right now. If you'd like to see what's indexed on any given file page, you can append ?action=cirrusDump to file pages to see what’s indexed - the "auxiliary_text" field contains the parsed wikitext contents. CBogen (WMF) (talk) 13:20, 29 September 2021 (UTC)[reply]

Modeling feedback[edit]

described at URL (P973)[edit]

URL (P2699)[edit]

IIIF manifest URL (P6108)[edit]

  • As with "described at", the IIIF manifest is for a whole item. We have this data, but is there a preferred way to add this at the asset-level?
    • It seems just fine. Maybe just add a qualifier with the name of the institution?

DPLA ID (P760)[edit]

  • As above, the DPLA ID is an identifier for the whole item, not the specific image that has been uploaded. How can we add it in SDC at the asset-level without confusing the two?

title (P1476)[edit]

  • The vast majority of items in DPLA are in English, but language is not specifically spelled out as a field in the data. Is it better to use language code "en" and accept a small error rate, or to apply "und" to all?
    • It should be just in English, so it might be better to use the "en" code.
  • Should this specify in some way that it is the title of the depicted work, and not the digital asset?
    • I think it should only be the title of the item, as it is official. In the Commons' file name, we have the information that it is a page from the document already.

copyright license (P275)[edit]

  • Question: How is CC0 handled? Copyrighted or not, and is CC0 a "license"?
    • CC0 is considered a license and, I believe, it should be applied if it was chosen as a license by the institution. I think the same should apply to any other open license, as the No Copyright - United States (Q47530911) for example.
Hi, we had quite some discussion about CC0 on already. The community felt that it is best modelled as in: Commons:Structured_data/Modeling/Copyright#Cc-zero_license. In short, have a special copyright flag for this, as it is not technically PD but also not considered copyrighted. This model was both adopted by Multichill and myself. Please consider the same so we remain consistent. Otherwise, let's continue the discussion on Commons_talk:Structured_data/Modeling --Schlurcher (talk) 07:35, 25 August 2021 (UTC)[reply]

Commons media contributed by (P9126)[edit]

  • If adding all organizations in the chain, do we use a qualifier to distinguish the source institution from the aggregators (and do we describe DPLA and its hubs differently)?
    • As all institutions are collectively responsible for the item to be on Commons, it might be good to have them on the chain and with qualifiers informing their roles in it.

page(s) (P304)[edit]

  • I am not sure this is the right property, as it is mostly used in Wikidata references, but it seems like an important concept because many of our uploads are only single pages of larger works.
  • Often, the page number of the sequence of files uploaded and the original page number of the scanned page are not the same. e.g., the first upload in a sequence for an book could be the cover, while the actual page "1" might be the 5th file after title page, acknowledgements, copyright page, etc.
  • Can we also represent in SDC the number of pages in the work, in addition to the page number of the current file?
    • I'm not totally sure this is completely right either, but I think a solution would be to add page(s) (P304) for the page number in the item and file page (P7668) for the number of the digital file.
      • Here it is important to remember that we are describing the digital uploaded file with SDC. So page(s) (P304) is probably not useful at all, unless as a qualifier on depicts (P180). Likewise, file page (P7668) would probably only be fitting on a file if it was extracted from another file (like a single jpg extracted from a pdf) and then as a qualifier of the source. Ainali (talk) 08:33, 18 July 2021 (UTC)[reply]
        • @Ainali: Thanks for this comment! So, it almost seems as if there is not really a good property for exactly this element. It's not file page (P7668), which is about the page within a large file, but rather about the sequence of a particular page (or digital asset) within a larger collection of files that constitute an item. Do you have any recommendation here, or do we need to propose a property (hoping to avoid that!)? Dominic (talk) 19:05, 20 July 2021 (UTC)[reply]

author name string (P2093)[edit]

  • DPLA does not have a controlled vocabulary around creator entities, so we can only use text strings here. It will be a difficult task to decide how to match to Wikidata items, but we can use the author name string property to start with.
    • This is a solution to the problem, yes. However, I don't think this is the best scenario. The ideal situation would be to work with the text strings to have them as proper Wikidata items, which would ask for a metadata reconciliation and "wikidatification" process.
  • "creator" is used very broadly and variably across DPLA's institutions, in ways that may not match expectations of this property's scope. One particular issue with the National Archives is that "creator" is typically the agency that preserved the record, but not the person that created it (which is sometimes the employee of the agency, but also sometimes a citizen who submitted documents to the government, gave testimony in court/Congress, correspondence/clippings saved by an agency, etc.)
    • In cases like these, as exceptions, it might be interesting to separate from the rest and do an upload that moves, for example, the National Archives from creator to the right field/property.


I was not able to answer all of the questions, as we do not have solutions to all of those problems yet. And, of course, better answers and solutions to the questions might appear. Therefore, I'm really looking forward to reading the community feedback.

Question: Is there an intention to include information in the Captions field? If yes, do you have any idea on how to model and add that information? GFontenelle (WMF) (talk) 04:08, 24 June 2021 (UTC)[reply]

Sourcing[edit]

My current plan is for all of these statements, where applicable, to use determination method (P459) determined by GLAM institution and stated at its website (Q61848113), (e.g. [1]) as a qualifier. When references are (hopefully) added for SDC, we would also add a reference to the catalog URL, in addition. But for now, I think this works well enough, since we will already have the described at URL statement somewhere in the data as well.

This will distinguish the statements added by DPLA from those added by the community, which means it would also be what we use for synchronization. We can design the bot to strictly change statements with this qualifier, and not the others–since changes made by DPLA would only be to make it match the source, this seems like fair game to update any statement with that qualifier. Dominic (talk) 16:18, 13 July 2021 (UTC)[reply]

Next phase of statements[edit]

I am wrapping up our first bot run across the entire DPLA set of uploads. As of now, we have added about 4.8 million statements across 1.6 media files (out of about 2.4 million). The first run added a set of very simple properties: DPLA ID (P760), heat treating (P6212), and RightsStatements.org statement according to source website (P6426), in edits such as this one. For the next phase, I'd like to propose the modeling for the following statements. These encompass what I think of as "medium difficulty" modeling questions, since these are all ones I think are fairly safe to begin based on discussion so far–more complicated than copyright status, but less complicated than ones that require more entity matching (creator and subject, particularly). Please see below:

Statement Comment
source of file
Normal rank file available on the internet
described at URL https://ark.digitalcommonwealth.org/ark:/50959/z603tb264
IIIF manifest URL https://ark.digitalcommonwealth.org/ark:/50959/z603tb264/manifest
URL https://ark.digitalcommonwealth.org/ark:/50959/z603tb28p/large_image
operator Digital Commonwealth
determination method determined by GLAM institution and stated at its website
0 references
add reference


add value
This format using P7482 -> Q74228490, with qualifiers for URL types seems to be favored, per Commons:Structured_data/Modeling/Source. See the talk page there for additional DPLA-specific discussion about this format.
title
Normal rank "A Rill from the Town Pump" essay by Sarah (Sallie) M. Field, Abbot Academy, class of 1904
determination method determined by GLAM institution and stated at its website
0 references
add reference


add value
Will apply "en" lang code, per above discussion.
This type of statement already very commonly established on Commons.
This is specifically for the source institution.
Commons media contributed by
Normal rank Digital Public Library of America
object has role aggregator
0 references
add reference
Normal rank Ohio Digital Network
object has role aggregator
0 references
add reference
Normal rank Toledo-Lucas County Public Library
object has role repository
0 references
add reference


add value
This is how we model this type of statement for the situation where DPLA uploads material that is provided to DPLA from what we call a "service hub", a regional aggregator that harvests from local institutions in an area. In this case, three institutions are listed with their roles. repository is a new Wikidata item created to describe this role (the other "repository" item is in reference to a storage site, but not an organization).
This is how we model this type of statement for a different type of situation, where DPLA uploads material that is provided to DPLA from what we call a "content hub", a large institution that DPLA harvests directly, such as the National Archives or Smithsonian. In this case, there are three values listed with their roles, but the last one is not an independent organization, but the unit which maintains the collection. custodial unit is a new Wikidata item created to describe this role (the general "department" item was not specific to this meaning). Note in these types of content hub situations, the content hub (e.g. National Archives) would also be used for P195, and not the custodial unit.
author name string
Normal rank Department of State. Agency for International Development. 1961-10/1/1979
determination method determined by GLAM institution and stated at its website
0 references
add reference


add value
This is how we will add all creators in our first pass. In later updates, we would replace a P2093 statement with a specific creator property linking to the Wikidata item for the creator, if it is identified.

Please feel free to discuss if anyone has feedback on any of these proposals. In addition, I expect to add file captions at the same time, if we can get clarity in the discussion at #File captions. I'll post this discussion to a few different places and give people a few days to see if anyone has comments. Dominic (talk) 16:12, 27 August 2021 (UTC)[reply]

Thanks for sharing. My only comment is on author name string (P2093), which we currently do not use a a direct statement. I understand you later want to update to a creater property. How about mapping it directly to creator with a qualifier? This is also what is done for Commons contributors without wikidata item. Would also make easier queries these non-Wikidata creators. See Commons:Structured data/Modeling/Author and for example File:Maxent_(35)_Église_09.JPG. The mapping could look like this:
creator
Normal rank somevalue
author name string Department of State. Agency for International Development. 1961-10/1/1979
determination method determined by GLAM institution and stated at its website
0 references
add reference


add value
Best regards, Schlurcher (talk) 18:05, 30 August 2021 (UTC)[reply]
@Schlurcher: Thanks, will implement it this way. I'm hoping to begin making these edits soon.Dominic (talk) 18:10, 15 September 2021 (UTC)[reply]

Structured data claims added to redirect[edit]

User:DPLA bot has added structured data claims to File:STS099-734-043 - STS-099 - Earth observation views of Pheonix,Arizona taken from OV-105 during STS-99 - DPLA - 9cb6ac83e74a3aeec7dada30c065bf3a.jpg instead of File:STS099-734-043 - STS-099 - Earth observation views of Phoenix, Arizona taken from OV-105 during STS-99 - DPLA - 9cb6ac83e74a3aeec7dada30c065bf3a.jpg. --Mirokado (talk) 07:52, 15 April 2023 (UTC)[reply]