Commons:OpenRefine/Adding structured data with OpenRefine

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
 About How to: upload files How to: edit files Advanced tips and tricks Training Projects 

Step by step instructions on how to (batch) add structured data to (existing) Wikimedia Commons files with OpenRefine.

Software installation[edit]

Download and install OpenRefine (version 3.6 or later!)[edit]

⚠️ For editing Wikimedia Commons, you need OpenRefine 3.6 or newer. Wikimedia Commons is not supported in OpenRefine 3.5 or earlier versions.

Download and install OpenRefine on your computer. To edit files on Wikimedia Commons, you need OpenRefine 3.6 or newer.

You can download OpenRefine for Windows, MacOS and Linux from https://openrefine.org/download.html.

There are detailed download instructions and installation instructions in OpenRefine's user manual.

Download and install the Wikimedia Commons extension for OpenRefine[edit]

Additionally, it is highly recommended to also install OpenRefine's Wikimedia Commons extension. It is very helpful for Wikimedia Commons batch editing. The extension offers:

  • A start screen to load file names directly from Wikimedia Commons categories.
  • Thumbnails of Wikimedia Commons files (not all file formats supported yet).
  • Several dedicated GREL expressions to retrieve data from wikitext for further processing.

The extension can be downloaded from GitHub, where you can also follow installation instructions.

The explanation on this page assumes that you have installed this extension.

Alternative: run OpenRefine online, in the cloud (via Wikimedia PAWS)[edit]

If you are unable to install OpenRefine on your computer, or if it runs very slowly, then you can also use it in the cloud (on wmcloud.org through PAWS). Everyone with a Wikimedia account can access OpenRefine here. Visit https://hub-paws.wmcloud.org/, log in, and click on the OpenRefine (blue diamond) logo.

The Wikimedia Commons extension (mentioned above) is installed in OpenRefine on PAWS.

Please note: with OpenRefine on PAWS it is NOT possible to upload files to Wikimedia Commons from your local computer. But it is possible to edit existing files.

Start an OpenRefine project based on one or more Commons category / categories[edit]

These instructions assume that you are using OpenRefine's Wikimedia Commons extension. See its documentation for installation instructions. If you are not using this extension and want to start an OpenRefine project via another method, check the advanced tips and tricks page.

  • Select the Wikimedia Commons option in OpenRefine's startup screen.
  • Now you can type the name of one or more Wikimedia Commons categories. You can also specify the depth with which you will transverse the Commons category tree.
  • Click Next.
  • The project preview will load. You will see a list of file names that are loaded from the category or categories you specified.
  • At the bottom of the preview window, you can indicate whether you also want to load a column with the Commons categories of each file, and/or a column with M-ids of the files. Commons categories can be very informative and useful to extract data that can later be added as structured data. If you decide to not yet retrieve the files' categories now, you will also get the opportunity to do this later.
  • Give your project a meaningful name and click Create project. The project will now load, showing thumbnails of the files. The file names are blue and clickable, which means they are already reconciled with Wikimedia Commons.

Extract Wikitext and structured data[edit]

This step is optional, but may be very useful. Existing files on Wikimedia Commons are always described with wikitext, which usually contains information about the file's creator, license, and one or more Wikimedia Commons categories. It will often make sense to parse this Wikitext in OpenRefine, retrieving valuable bits of data from it which can be converted to structured data in a next step. Good examples of such data may include:

  • The file's description, which you can convert to a file caption
  • The file's creator
  • The file's source
  • Things depicted in the file, and other valuable information, may be mentioned in the file's categories

In order to create one or more new columns with Wikitext (and structured data statements) from your column of reconciled file names, select Edit columnAdd columns from reconciled values... in the file column's menu. You will get a dialog window in which you can select one or more options; you can choose just one or multiple.

  • Wikitext: will create a column with the (full) Wikitext of each file
  • Various structured data statements; the dialog windows suggests several common ones, but you can use the search functionality to search for any property that you are interested in
  • You can retrieve file captions by typing the capital letter C, followed by the two-letter language code (e.g Cen for English file captions, Cja for Japanese file captions).

📘 See Add columns from reconciled values in OpenRefine's user manual for general information about this feature.

Parse wikitext[edit]

The column with wikitext may contain valuable bits of data that you want to extract (or parse) into new OpenRefine columns.

📘 The instructions below show how to parse wikitext using functionalities of OpenRefine's Wikimedia Commons extension. Also see this extension's documentation.

There are many more ways in which you can parse this data. The Advanced tips and tricks page contains other recipes - make sure to add your own if you have created new useful methods!

The Wikimedia Commons Extension unlocks two dedicated GREL commands, which help to extract specific information from the wikitext of Wikimedia Commons files.

GREL, General Refine Expression Language, is a dedicated scripting language used in OpenRefine for many flexible data operations. OpenRefine's manual contains a general reference on using GREL. You will also discover many examples and recipes by searching on the web.

From the column with wikitext you retrieved in the previous step, you can now extract values and categories as described below. Start by selecting Edit column > Add column based on this column... in the column menu. In the next dialog window, you can use various specific GREL commands.

Extract values from template parameters: extractFromTemplate[edit]

Use the following syntax:

extractFromTemplate(value, "BHL", "source")[0]

where you replace BHL with the name of the template (without curly brackets) and source with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385.

Extract Wikimedia Commons categories: value.extractCategories[edit]

Use the following syntax:

value.extractCategories().join('#')

This GREL syntax will return all categories mentioned in the Wikitext, separated by the # character, which you can then use to split the resulting cell further as needed.

Reconcile other columns with Wikidata[edit]

Structured data on Commons describes files on Commons by using (multilingual) items and properties from Wikidata.

Say, you have one or more columns in your project with (new) data that you want to add to the files as structured data. The data in the cells of your columns may correspond to Wikidata items. You will need to reconcile these, to help OpenRefine understand that it will need to make the link to these Wikidata items. Examples include:

You will reconcile these columns against the Wikidata reconciliation service, in English or another language that may be relevant (English usually works fine). The English Wikidata reconciliation service is installed by default in OpenRefine.

Reconciled columns have a header that is underlined with a dark green stripe; values in the column are blue hyperlinks which point to Wikidata items.

You can find more instructions on how to reconcile data in OpenRefine's user manual and on Wikidata.

Create your editing schema[edit]

Finally, you will build a schema in OpenRefine, to model the Wikimedia Commons edits that OpenRefine will perform for each row in your project.

Click on the Schema tab in the blue bar above your dataset, or go to the Wikidata/Wikibase extension menu and select Edit Wikibase schema. You will get an empty schema window at first. Verify that the info text on top mentions Wikimedia Commons; if it mentions Wikidata, then you need to switch your Wikibase instance to Wikimedia Commons via the Select Wikibase instance... menu item in the Wikibase extension menu.

Click on the blue + add media link. Several fields will appear.

You can now type, and/or drag and drop all the info you want to include in the files' metadata.

  • In the main field (which says type entity or drag reconciled column here), you will drag your reconciled column of file names (see previous instructions above). Note: that column must have a green line (as a result of the reconciliation).
  • Captions: if you have created columns with file captions, then you can drag them here. Make sure to add the corresponding language.
  • Statements: click + add statement to add structured data statements, one by one. You can type values that are the same for all your files, or drag (reconciled) columns.

See Schema alignment in OpenRefine's user manual for general information about schemas.

Make sure to follow Wikimedia Commons data modeling conventions[edit]

Don't invent your own method to describe files, but make sure to follow Wikimedia Commons best practices. In case of doubt, ask the Wikimedia Commons community for feedback on the general Structured Data talk page.

Data models for structured data about media files on Commons are explained and discussed at Commons:Structured_data/Modeling.

Basic structured data statements for all Wikimedia Commons files are:

Structured data to add Brief instructions In-depth instructions info about the data model in structured data
File caption(s) (multilingual) A (short) textual description of the file, in at least one language. Plain text; no Wiki markup or hyperlinks. Data modeling guidelines: File captions
Date Usually the date when the file was created; using a inception (P571) statement. Data modeling guidelines: Date
Source of the file Information about where the file was taken from. Is it the uploader's own work, was it uploaded from an external website,...? Typically using a source of file (P7482) statement. Data modeling guidelines: Source of the file
Creator Who created the file? Typically described with a creator (P170) statement. Data modeling guidelines: Creator of the file
Copyright status and license Is the file still under copyright, or is it public domain? If still under copyright, which license(s) applies/apply? Using copyright status (P6216) and copyright license (P275). Data modeling guidelines: Copyright and licenses

Preview and upload your edits to Wikimedia Commons[edit]

You can preview your edits by clicking the Preview tab on top of your schema. The Issues tab will inform you about errors that may be present in your data or schema, so that you can fix them.

Tip: Start with editing just one or a few files first, to test your edits and to check whether they look good. Fixing mistakes after a large batch is possible, but testing at an early stage is better.

When you are ready to upload your edits, then select Upload edits to Wikibase... in the Wikidata/Wikibase extension menu, and log in with your Wikimedia Commons credentials. OpenRefine will encourage you to use a bot password, but if you like, you can ignore this warning. Provide a descriptive edit summary. No need to change the maxlag value. Click Upload edits and your batch edit will start.

You will see your recently edited files in your own edit history on Wikimedia Commons.

See documentation about uploading in OpenRefine's user manual for general information about this feature.

Correcting mistakes with the EditGroups tool[edit]

When checking your user contributions, you will see your recent Wikimedia Commons edits done with OpenRefine. Each OpenRefine edit displays a (details) hyperlink after the edit summary, which links to the edit batch in the EditGroups tool.

In EditGroups, entire batches can be easily undone, in case some mistakes have been made.

All Wikimedia Commons batches with OpenRefine are listed at https://editgroups-commons.toolforge.org/?tool=OR.