Commons talk:Structured data/Get involved/Feedback requests/GLAM metadata and ontologies mapping

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search


Further comments and feedback after May 7, 2018[edit]

I'm sorry I joined this discussion late; I missed the opportunity to comment to specific contributions above. So I've opened a few sections below, and would be very glad if you add your own comments.

Cataloging Cultural Objects[edit]

An important omission in the list of standards mentioned at GLAM_metadata_and_ontologies_mapping is Cataloging Cultural Objects. CCO is the content standard that lays out best practices for describing artworks, and gives excellent examples. Compared to it, MARC, LIDO, CIDOC CRM are technical means whereas CCO describes the desired end.

As stated elsewhere, GLAM holdings are very diverse, and CCO doesn't cover everything (eg Numismatics has its own best practices). But it's crucially important for museums and galleries. So I believe we should start the mapping exercises from CCO:

  • Going through the CCO Examples and representing them as Wikidata is one step
  • Checking that some Wikidata artworks (eg from SoAP) are adequately described according to CCO is another

--Vladimir Alexiev (talk) 20:42, 8 May 2018 (UTC)[reply]

GLAM Ontologies[edit]

I've been asked by Canada's CHIN to recommend which ontologies they should use for their upcoming LOD-based national aggregation. I was at a bit of a loss, because I believe that right now there is no clear winner. I think these are viable choices:

  • Schema.org (with extensions similar to bib.schema.org), which has a very tolerant and pragmatic approach and is used by a huge number of websites world-wide
  • CIDOC CRM, which has a good foundation, and is used by various projects, particularly in Europe
  • http://linked.art, which is a CRM profile used by the American Art Collaborative
  • EDM, even though it's missing some important capabilities (eg the ability to describe specific contribution to an artwork)
  • and Wikidata, which has a fluent and collaborative way of creating properties and describing their application to specific domains

So yes, I believe that working on GLAM metadata mappings to Wikidata & Commons is a worthy goal!

--Vladimir Alexiev (talk) 20:42, 8 May 2018 (UTC)[reply]

Image vs Artwork[edit]

@SandraF (WMF): asked at Discussion_for_metadata_that_is_specific_or_important_for_GLAMs "Is this list of (broader types of) GLAM-specific metadata complete? If not, which elements do we miss?" I think the main GLAM-related consideration is to distinguish:

  • data about images (what is stored in an institutional DAMS) from
  • data about artworks (what is stored in a Collection management system)

Eg "accession number" is a property of the artwork, so attaching it directly to the image is not quite correct. An artwork often has many images, and these images have different DAMS IDs.

It's also important to distinguish what object an image represents, vs what images are shown on an artwork. This picture from CIDOC CRM may clarify the distinction: [1]. Here's an example:

  • The painting Mona Lisa (crm:E24_Physical_Man-Made_Thing) crm:P65_shows_visual_item the well-known image of that woman (crm:E38_Image).
  • The image is a concepual object that in addition to the painting, may be rendered on a photo, a T-shirt, a computer file, etc. It may also be modified in various ways, and still be recognizable as the same image
  • The image crm:P138_represents Lisa del Giocondo (a person).
  • As a shortcut, we can say that the painting crm:P62_depicts Lisa del Giocondo

If we take a photo of the painting: the photo (anoher crm:E24_Physical_Man-Made_Thing) crm:P65_shows_visual_item the same image, but it also crm:P62_depicts the painting. You can also say as a transitive shortcut that it crm:P62_depicts Lisa del Giocondo.

It seems to me that elements to relate image to artwork are missing in the original proposal, and are the main thing to discuss for this page "GLAM metadata and ontologies mapping"

--Vladimir Alexiev (talk)

Archives, Hierarchical Links[edit]

@Retrent: wrote at Commons_talk:Structured_data/Get_involved/Feedback_requests/Ontology#Discussion_for_metadata_that_is_specific_or_important_for_GLAMs "It would make sense that the (archival) collection as a whole would have a Wikidata entry".

This is not a given. Archives hold huge amounts of material, a lot of them never digitized (nor even cataloged down to the item level). I don't think it's the job of Commons or Wikidata to replicate big archives like NARA or aggregations like APEX (Archives Portal Europe). However, the ability to express hierarchical and lateral relations between materials is very important, eg:

  • the EAD hierarchy (12 or even more levels, such as class, collection, file, fonds, item, recordgrp, series, subfonds, subgrp, subseries)
  • images of artworks in a series
  • prints/images of different stages of an engraving
  • photos of the same object taken at different times

Some of the higher-level objects would not have associated images, so they'll be represented only at WD or WB@Commons.

--Vladimir Alexiev (talk)

Europeana[edit]

  • "Are there any statistics regarding the level of interlinking"

Eg Count of Europeana entities by type shows 55M artworks ("CHO"), 132M images or other representations, 2.7M Places, 72k persons (Agents)

There are only 76.8k WD objects with Europeana ID. From the other side, I don't believe Europeana contributors track WD IDs. I believe the intersection is at least 0.5-1M, so these 76k coreferenced objects is a low number.

As for "contextual entities" (places, agents), they are collected in the Europeana Entity Base, and the counts above include local (per-object) entities. For example, one of the Agent URLs (per-object) is http://hispana.mcu.es/lod/oai:galiciana-bibliotecadegalicia.xunta.es:10000156710#ent4 and it's declared sameAs a Galiciana authority http://www.galiciana.bibliotecadegalicia.xunta.es/aut/BDGA20140003808 (also VIAF, LOC, FAST). It's not easy to say how many unique persons & places are represented by the number above.

They're also not normalized, eg https://www.europeana.eu/portal/en/record/11629/_HERBAR_BGBM_GERMANY_B_10_0356587.html mentions http://www.geonames.org/maps/google_52.3166656494_13.1833333969.html, which should be resolved to http://sws.geonames.org/2832318/ (the Siethener Elsbruch forest).

And there are places without any notability or importance, eg iid:1666153/SP.1 (12, R, Torenstraat, Overasselt, Gelderland, Nederland), which is used by only one object https://www.europeana.eu/portal/en/record/2020704/DR_20181109.html

The issues of notability and lack of normalization concern the provider data. What is in the "Entity Collection" mentioned above is better. As a matter of fact int he example above (https://www.europeana.eu/portal/en/record/11629/_HERBAR_BGBM_GERMANY_B_10_0356587.html mentions http://www.geonames.org/maps/google_52.3166656494_13.1833333969.html) we do not even recognize a contextual entity, even a provider entity. It's just a URL. And yes we can't do quality control on that scale of detail - providers are very inventive in what they send us. Isaacantoine (talk) 14:40, 5 November 2018 (UTC)[reply]

  • "EDM is internally not very consistent"

I'd put it this way: different institutions use it in very different ways, and Europeana hasn't done enough to normalize it

--Vladimir Alexiev (talk) 20:42, 8 May 2018 (UTC)[reply]

Well it's a never-ending effort. That said I'd welcome some examples, to see if it's heterogeneity by design (i.e. use of properties that can accommodate different things) or basic inconsistencies (i.e. wrong use of properties). Isaacantoine (talk) 14:36, 5 November 2018 (UTC)[reply]

Summary and next steps[edit]

Hi everyone! My warmest thanks to everyone who contributed to this discussion. With some delay, here's an attempt at (very briefly) summarizing what I (Sandra) read in your comments above, adding some of my own thoughts and reflections to the mix. Feel free to comment!

A happy coffee drinker, to thank everyone who contributed to this discussion. Coffee drinker (1888) (d) by Ivana Kobilca, from the National Gallery of Slovenia.

Mapping GLAM metadata schemes and ontologies to structured data on Commons - is this a worthy undertaking?[edit]

  1. Several people seem interested in working on this and think that a common and centralized effort makes sense, although the scope for this work needs to be better defined (see below).
  2. It is probably good to start by taking a step back first: what impact do we want to achieve? More GLAM contributions to Commons? Better contributions? Less frustrating upload and contribution processes? ...
  3. Interestingly: during a recent Wikidata workshop day at the Europeana offices, I (Sandra) got very clear feedback from some GLAM participants that they don't think we should put enormous amount of efforts in mapping GLAM metadata schemes in great detail; it would (according to them) be much better instead if we would work towards crystal-clear, well-documented and findable instructions, and towards standardized ways in which GLAMs should model their own data towards Commons. Although coming from a small group, I find this interesting input which I'd like to verify more broadly.

Better focus and prioritization needed[edit]

  1. The original proposal was way too broadly and vaguely defined and seems to be very unclear to people not familiar with GLAM metadata and ontologies.
    1. It is important to distinguish between ontologies and vocabularies, as these are very different things.
    2. Looking at 'vocabularies', it's probably also good to distinguish between
      1. Thematic / topical data (subjects, concepts - example 'oak tree')
      2. Person names (photographers, artists, depicted people)
      3. Organization names (both organizations that contribute files to Commons, such as GLAMs, and other organizations that may be involved in the production of our media files, such as publishers, photo studios, etc.)
    3. We also must clearly distinguish between metadata used for the description of artworks (as GLAMs do in their collection management systems, and which in our case will probably mostly be used to describe artworks on Wikidata) and of media files (as GLAMs do in digital asset management systems, and which in our case will be used to describe Commons files).
  2. We need to prioritize our efforts: it is probably most worthwhile to work first on those metadata schemes and vocabularies that are very widely adopted in the GLAM sector.
  3. I read some consensus that working on this will not produce a magic bullet, and converting GLAM metadata to Commons will always be painful. (While this is true, I think it's a worthy undertaking to work towards a process that makes it the least painful as possible.)

On Wikidata, we have already started working on some GLAM ontologies and vocabularies.[edit]

Some insights from that perspective:

  1. We are mapping many vocabularies on Wikidata, including thesauri. We might want to include more information on Wikidata about the hierarchical relations in those, and we might want to work on mapping the SKOS format to Wikidata.
  2. The Commons category system is also a hierarchical structure with a wealth of data that we don't want to get lost.
    1. I (Sandra) recommend everyone to read the findings about categories in the context of GLAM uploads as part of the GLAM research earlier this year, where participants report having difficulties finding the right categories; from my own experience since 2012 with GLAM uploads - both performing and re-using files from them - I also notice these tend to be under-categorized, often with sub-optimal selection of categories.
  3. Inside GLAM vocabularies and inside our own projects, there are still major knowledge gaps!

The longer term[edit]

We need to think about the longer term: maintainability and constant updates to mappings.

Technical integration[edit]

We need to think carefully how such mappings (if we work on them) are integrated in technical infrastructure. It's probably not a good idea to statically 'bake' them into APIs - perhaps code libraries make more sense, and we might want to encourage specific tool development in this direction? This also needs further investigation.

Follow up in June 2018 and beyond:[edit]

  1. I (Sandra) have the feeling we need more input from GLAMs themselves, and I'm now thinking how to do this: whether this can be done in an informal survey or another type of consultation, and which questions need to be asked. Please let me know if you have ideas or suggestions here.
  2. Categories. The core team working on Structured Data on Commons needs all currently allocated time and funds to give its full focus to the basic functionalities of structured data itself; extra work in technical support for transitioning categories is out of scope within the current timeline and budgets. I myself also can't give extra attention to category conversion from a practical perspective. Conversion of category data to structured data is, like data modelling and conversion itself, a task for the communities.
  3. It would be helpful to make it easy for more people to contribute to the process of mapping GLAM metadata to Wikidata and Commons. More help is certainly welcome, and needed. I (Sandra) can try help support this by creating a set of GLAM info pages, part of the Structured Commons information site, including a better structured set of 'landing pages' on GLAM vocabularies and ontologies. These will point towards existing work on Wikidata's WikiProjects, be extensible by anyone, offer a first attempt at prioritization, and point to (if it exists) documentation. Help is welcome here!
  4. Several members of the Structured Commons team will be present at the Wikimania hackathon, which is a good opportunity to talk to volunteer developers about ideas for technical integration of GLAM metadata mapping. It is probably quite relevant to tools that (will) support GLAM uploads to Commons and Wikidata, for instance Pattypan and OpenRefine 3.0.

Many greetings! SandraF (WMF) (talk) 15:17, 6 June 2018 (UTC)[reply]