Commons:Structured data/WMSE white paper on Structured Data on Commons/Reflections and conclusions

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
IntroductionWiki Loves MonumentsUNESCO ArchivesMusikverketFataburenReflections and conclusions

Reflections and conclusions

In a nutshell

  • When adding SDC to files that were mass uploaded previously, such as files from a GLAM collection, you might have to decide between using the original metadata and scraping the file pages on Commons to reflect users' edits.
  • The Wikimedia Commons Query Service is an indispensable tool to both explore the data and provide feedback about your SDC work.
  • There are several editing tools that support SDC, but none that provide an end-to-end workflow for uploading and editing files. OpenRefine will bridge this gap.

Which data to work with?

[edit]

A practical question that appeared when working with GLAM collections that had been uploaded to Wikimedia Commons by WMSE was which data to use when creating SDC statements – the data harvested from the file information template or the original metadata we had received when preparing the uploads? In the case of both Musikverket and UNESCO, we still had access to the original spreadsheets provided by the GLAMs, as might be the case for other Wikimedia chapters wishing to work with their previous GLAM uploads to enrich them with SDC. Both approaches, working with the original metadata and scraping it off the Commons pages, have their advantages and disadvantages, which are outlined below.

The original metadata is usually better structured than the raw content of Commons file descriptions, making it easier to work with, especially if provided in a commons format such as CSV or JSON. This greatly reduces the time it takes to prepare an SDC upload, as one does not have to parse the content of the file description templates to break out the necessary fields. The metadata file can be loaded in OpenRefine to reconcile the relevant fields, such as creator to Wikidata and output the Q IDs we need, and the results can be exported directly to QuickStatements or a custom tool that can upload SDC statements.

On the other hand, this simplicity comes with a price. By using the original data, we ignore the possible changes and improvements that the Wikimedia Commons community has made to the file descriptions. This is especially relevant if a long time has passed from when the files were uploaded. When working with both the UNESCO and the Musikverket collections we found that users had made significant progress in replacing some of the text strings, such as creator or performer with Wikidata IDs (using the {{Q}} template), which would have been a shame to ignore when creating SDC statements.

There are different ways of scraping the content of file pages: we have tried both writing a custom script to harvest a large number of pages and using OpenRefine to download the contents of a smaller set of pages and then parse it within the application. The first method was fast and suitable for working with large file sets (Wiki Loves Monuments in Sweden has produced over 20,000 photographs), while the OpenRefine method, while slower and thus more suitable for smaller file sets, gave us more nuanced reconciliation capabilities and allowed us to notice and work with files with unexpected structures or data values.

In the end, which workflow to choose depends on the nature of the file set, its size, how much variation there is in the structure of the file descriptions, and how much community engagement there has been to improve the files. If one is uploading files and adding SDC to them simultaneously, it obviously makes sense to work with the original metadata. Once OpenRefine is extended with SDC support this will hopefully become much easier, and we will be able to work with one tool throughout the whole file upload and SDC enrichment process.

The SPARQL endpoint: an important tool for editors and users

[edit]

One particular challenge that we have encountered during our work were the limited capabilities of the Wikimedia Commons Query Service (WCQS). The query service allows users to run SPARQL queries against the SDC endpoint, and since it works the same way as the Wikidata Query Service – which has been operational for several years – it has a relatively low entry threshold for users who are already familiar with the latter. By taking advantage of Wikidata federation, Wikimedia Commons users can collate information about files that have structured data in a much more flexible way than what the category system allows. For example, one can retrieve all the images depicting church buildings in Sweden, regardless of how well they are categorized. Another typical use case, which is very familiar to Wikidata editors, is to retrieve lists of files where certain statements are missing. When uploading structured data in an automatic way, depending on how well the source data is structured and the certainty with which it can be matched to Wikidata items, there is always some degree of vagueness, as not all data can be converted to SDC statements. By using the query service to identify files that are missing certain information, such as creator or depicted objects, users can get a list of files to review and improve manually, either individually or as part of an edit-a-thon.

At this time, the SPARQL endpoint for Structured Data on Commons is in a beta version, hosted on Wikimedia Cloud Services, which heavily limits what can be achieved with it. The biggest issue, which we have heard mentioned every time when talking to users, is that it does not provide live results. The data is reloaded weekly from Wikimedia Commons dumps. Furthermore, in our experience, it takes additional time after the weekly dump has been published for the updated results to show up.

This makes the query service quite impractical to use, especially with non-Wikimedians, such as when demonstrating it to our GLAM partners. This is especially painful in the case of the maintenance queries mentioned above. If you generate a list of files needing attention, for example because they appear to be missing a depicts statement, it's possible that the information has already been added a couple days ago. And if you add the statement, there's no immediate feedback, your addition does not show up when run the query again. In the case of Wikidata, updates are near-instantaneous, taking up to a couple minutes at the most. Because of that, the WCQS is not (yet!) a suitable tool to use during edit-a-thons or other projects encouraging people to edit SDC, as the impossibility to visualize the editors' progress takes away much of the "fun" of editing and makes it harder to convey the usefulness of SDC.

Another aspect of the WCQS that we had to tackle during our work was the lack of documentation, and the fact that the existing documentation was primarily aimed at Wikimedians, and in practice required that the users were already familiar with Wikidata's query service. That's why we included improving the documentation on the WCQS in our project, by making the text more approachable and moving the documentation page to the Commons:Structured data page. Before that, the page was hard to find, even for seasoned Wikimedia Commons users!

Many tools, no obvious workflow

[edit]

Ever since SDC was implemented, the community started developing tools to help with adding and editing statements. There is, for example, the ACDC gadget, created by Lucas Werkmeister, where a collection of statements can be added to set of files, akin to how Cat-a-lot works. Magnus Manske's SDC Tool works in a similar way. QuickStatements, a tool for mass editing Wikidata, can also be used to edit SDC statements, but it involves the additional step of converting the file page titles to Media IDs, which can be done using Minefield.

Pattypan is a popular tool to mass upload files to Wikimedia Commons. Due to its relatively simple workflow, it has been adopted by both members of the Wikimedia Community and GLAM staff. Pattypan does not, however, support SDC, despite many community members having expressed a need for such a functionality.

OpenRefine is another tool that is very popular among Wikimedians – it is a data wrangling and cleaning tool that has support for both Wikidata integration and editing. It is both powerful and flexible, enabling its users to work with large datasets and align them with Wikidata. Users of OpenRefine have also expressed a keen interest in expanding it with Wikimedia Commons support, which has not gone unheard; in spring 2021 the OpenRefine developers applied for, and were subsequently awarded, a project grant for this purpose. The plan is that by mid-2022, the software will:

  1. allow Wikimedians to edit and upload large and diverse batches of files with structured data on/to Wikimedia Commons, and
  2. be accompanied by good training materials and documentation to enable new Wikimedians to master these functionalities.

This is great news for both the existing Wikimedia Community and our current and future GLAM partners. The latter in particular will benefit from having access to a single tool that unifies the processes of cleaning up and preparing the metadata of their files, uploading it to Wikidata and uploading the files with the accompanying SDC to Wikimedia Commons. As of now, this workflow requires several tools, and can involve cleaning up the data and uploading it to Wikidata with OpenRefine, uploading the files to Commons with Pattypan and finally uploading the SDC statements with a tool like QuickStatements – or even coding your own script to do this part of the work. This development of OpenRefine has the potential to revolutionize the way with work with SDC and bring it much closer to newcomers while making it more understandable.