Commons:OpenRefine/Advanced tips and tricks

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
 About How to: upload files How to: edit files Advanced tips and tricks Training Projects 


Advanced tasks - general[edit]

Wikimedia Commons functionalities not present? Adding the Wikimedia Commons manifest to OpenRefine[edit]

If you don't see Wikimedia Commons as an option for reconciliation or in the schema, then you must still add the Wikimedia Commons manifest to OpenRefine.

This manifest is a kind of 'settings' file that provides OpenRefine with all the information it needs to be able to edit Wikimedia Commons. Do this as follows:

  • In the Wikidata extension menu at the top right of your OpenRefine project, choose Select Wikibase instance.... Click Add Wikibase. You will be prompted to paste either a manifest URL (this is recommended), or paste the JSON directly. Wikimedia Commons' manifest URL is: https://raw.githubusercontent.com/OpenRefine/wikibase-manifests/master/wikimedia-commons-manifest.json
  • After adding this URL, you should now see Wikimedia Commons in your list of Wikibase instances. Click Wikimedia Commons to activate it. You can now close this dialog window by clicking the Close button.
  • Adding the Wikimedia Commons manifest in OpenRefine will also automatically add the Wikimedia Commons reconciliation service.

Adding the Wikimedia Commons reconciliation service to OpenRefine[edit]

If you don't see Wikimedia Commons as an option for reconciliation, then you must still add the Wikimedia Commons reconciliation service to OpenRefine. Select ReconcileStart reconciling... In the resulting (reconciliation) dialog window, click the button Add standard service... and paste https://commonsreconcile.toolforge.org/en/api there. If you prefer working with properties and labels in a different language, you can replace the en string in that URL with the two-letter language code of your choice.

More info and documentation about the Commons reconciliation service is available at https://commonsreconcile.toolforge.org/.

Manually reconciling file names with Wikimedia Commons[edit]

If you start OpenRefine projects via OpenRefine's Wikimedia Commons extension, then file names will already be reconciled. They will be blue and clickable, and the file name column will be highlighted with a dark green line.

If you start an OpenRefine project in another way, using a list of Wikimedia Commons files, you will still need to actively use the Wikimedia Commons Reconciliation Service as a starting point to begin batch editing these files. This step makes sure that OpenRefine recognizes these files, links them to their M-ids on Wikimedia Commons, and ensures that OpenRefine can edit them later.

You start the reconciliation process by selecting ReconcileStart reconciling... in the file column's menu. Then select the Wikimedia Commons reconciliation service and click the Start reconciling... button. (See above on how to add the service if you don't see the Wikimedia Commons option yet.)

Favorite schemas in OpenRefine[edit]

Since OpenRefine version 3.7 it is possible to use, save, share and re-use favorite schemas in OpenRefine.

Watch this video demo:

Working with somevalue / novalue (or unknown value / no value) for Wikibase in OpenRefine[edit]

Watch this video demo to discover a way to work with somevalue/novalue Wikibase statements in OpenRefine (partially developed, end 2023).

Advanced tasks - editing files[edit]

Obtain file names with the PetScan tool[edit]

If you want to get a list of file names from Wikimedia Commons in another way than via the "categories" approach through OpenRefine's Wikimedia Commons extension, you can also retrieve a selection of file names with the PetScan tool.

PetScan gives you many different options to retrieve lists of file names based on various criteria, e.g. usage of specific templates, or using search.

Expand the table below for detailed instructions on how to do this with PetScan:

A step-by-step guide on retrieving a list of Commons file names using the PetScan tool  
 Launch PetScan First of all, launch the PetScan tool!
You will start the tool inside the first tab ('Categories').
  • Make sure you select Wikimedia Commons here, by clicking on Commons.
  • Categories: Type or paste one or more Commons category names here which contain the file names you want to retrieve. Omit the Category: prefix.
    • You can indicate the depth with which you want to retrieve files from the category tree. In the example shown on the left, we are retrieving files that are directly in the category Uploaded with iNaturalist2Commons AND that have the category Lepidoptera of Australia (or one of its subcategories, up to three levels deep).
  • Combination: if you select the radio button 'Intersection', you will only retrieve those files that are in all your chosen categories - usually a smaller number of files. If you select the radio button 'Union', you will retrieve a larger number of files that are in either of the categories you entered. Usually 'Intersection' is the logical option.
Go to the next tab in the tool ('Page properties').
  • Namespaces: Deselect the first (unnamed) checkbox and select the File checkbox. This indicates that you only want to retrieve file names (not categories, not gallery page titles, etc).
If you like, you can now already click on the 'Do it!' button to verify that you are indeed retrieving the right file names.
In some cases, it is convenient to retrieve the file names as plain text, or in another format. You can adjust that in the last PetScan tab ('Output').
  • Format: for instance, select the 'Plain text' radio button.
  • Scroll down and click on the 'Do it!' button again. You will now see the list of file names as plain text.
If you chose 'Plain text' as output format, you will get a list of file names in plain text.

The following link gives you the above-shown example, with HTML output: https://petscan.wmflabs.org/?psid=22129478

PetScan's full manual is available on meta.wikimedia.org.

Other ways to obtain lists of file names to work with[edit]

You can also retrieve / obtain this list in other ways, e.g. from the Wikimedia Commons or Wikidata query service, or via another method of your choosing.

Other ways to start OpenRefine projects with lists of file names[edit]

You may have just a list of file names, or a larger spreadsheet or dataset with extra data about the files. Both are good starting points in OpenRefine.

Depending on the data format you have, you can enter this data into OpenRefine and start a project with it. You can use OpenRefine's Clipboard option to paste a list of file names (or a small dataset) from your computer's clipboard. Or you can have a list of files in a .csv or spreadsheet which you can open regularly in OpenRefine.

You can read more about starting projects (and the settings for various data formats) in OpenRefine's user manual.

Advanced tasks - uploading files[edit]

Retrieve EXIF data from files[edit]

Sometimes, you have very little metadata about a set of files, but there may be valuable information (e.g. the name of the author, the creation date, a description, geographic coordinates...) in the EXIF data of each file.

OpenRefine does not offer you the ability to retrieve this EXIF data, but there are other very doable ways.

You can use Exiftool to create a csv file with all the EXIF data from a list of files, which you then load in OpenRefine. This YouTube video explains the process nicely. The command that’s used is a variant of exiftool -csv *.jpg > exifdata.csv

GREL recipes for Wikimedia Commons[edit]

GREL to extract information from Wikitext[edit]

(Wikimedia Commons extension only) Extract values from template parameters: extractFromTemplate[edit]

This syntax only works when you have installed the Wikimedia Commons extension in OpenRefine.

Use the following syntax: extractFromTemplate(value, "BHL", "source")[0]

where you replace BHL with the name of the template (without curly brackets) and source with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385.

(Wikimedia Commons extension only) Extract Wikimedia Commons categories: value.extractCategories[edit]

This syntax only works when you have installed the Wikimedia Commons extension in OpenRefine.

Use the following syntax: value.extractCategories().join('#')

This GREL syntax will return all categories mentioned in the Wikitext, separated by the # character, which you can then use to split the resulting cell further as needed.