Commons:Monuments database/Harvesting

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

This page describes the Harvesting of structured lists into the monuments database. This is a process that needs to be setup once per source and will run every night after that is done.

How does it work?[edit]

A bot harvests templates on Wikipedia. For the bot the Pywikibot framework (rewrite) is used. The bot loops over all the sources. For each source the bot loops over pages that contain a row template. For each row template the bot grabs the different fields and inserts it into the database. Each source has a separate table with fields matching the fields in the row template. After all the sources have been harvested, all these source tables a merged into one big table using one very big query (automatically generated).

Configure a new source[edit]

Gather the information[edit]

File a Phabricator task with the necessary data for the configuration.

When using Structured lists[edit]

  1. You first need structured lists. This guide assumes you have those already. We need the header and row template.
  2. Add to the Phabricator task:
    • project : This is most likely Wikipedia ('wikipedia')
    • lang : The Wikipedia language code ('nl')
    • headerTemplate : The header template for the lists ('Tabelkop rijksmonumenten')
    • rowTemplate : The row template for the lists ('Tabelrij rijksmonument')
    • commonsTemplate : The template here at Commons to track images ('Rijksmonument'). You can leave this empty
    • commonsTrackerCategory : The category added by the previous template ('Rijksmonumenten with known IDs'). You can leave this empty
    • commonsCategoryBase : The base of the category tree at Commons ('Rijksmonumenten'). You can leave this empty.
    • autoGeocode : Do you want to do auto geocoding (False/True). Always start with False
    • unusedImagesPage : Page on Wikipedia where to report unused images ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Ongebruikte foto\'s'). You can leave this empty.
    • imagesWithoutIdPage : Page with a list of images without an identifier template at Commons ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Foto\'s zonder id'). You can leave this empty.
    • missingCommonscatPage : Page with a list of monuments where a category about the monument exists on Commons, but no link is in the list yet. ('Wikipedia:Wikiproject/Erfgoed/Nederlandse Erfgoed Inventarisatie/Missende commonscat links'). You can leave this empty.
    • namespaces : Namespaces to work on at Wikipedia ([0]).
    • table : Name of the table to store everything in (u'monuments_nl_(nl)'). The convention is monuments_<countrycode>_(<lang>). Please keep this in line.
    • truncate : To empty out the table on update (False). You need this if you don't have strong identifiers.
    • primkey : The primary key in the table ('objrijksnr'). This should be the identifier.
    • fields : All the fields the bot can find
      • source : The name of the field in the row template ('objrijksnr')
      • dest : The destination field in the sql table (u'objrijksnr'). Please keep this ascii to prevent problems. If you leave it empty this information will just be dropped.
      • conv : Do we want to do any conversions? Deprecated leave empty please.
    • sql_data: this is the mapping between your source fields to the shared fields (like id, name, address, municipality, lat, lon, image, source, changed & monument_article)

For Wikidata data[edit]


Write and commit the configuration[edit]

Once the configuration is ready, it needs to be converted to our JSON configuration format added to the bot source code. This has to be done through Gerrit. The monuments database maintainers will do it as time allows ; but if you are tech-savvy, you can help and do it yourself by following the following instructions:

  1. You'll have to setup Git/Gerrit. The location of the repository is ssh://USERNAME@gerrit.wikimedia.org:29418/labs/tools/heritage.git
  2. Branch with a suitable name (bug/<bugid> or something like source/<country>_<lang>)
  3. Look into the monuments_config directory: each JSON file is a dataset
  4. Copy a country you like using either format (structured lists or Wikidata) and start filling out fields. It's easiest to just copy and modify the configuration of an existing country.
  5. Test the harvesting using Docker − see README file for the steps.

Now that everything is setup, submit your patch.

Deploying a new source[edit]

Your patch will be reviewed and merged by a maintainer, who will also deploy the new source code to Toolforge (using the deploy-to-toollabs.sh Shell script). Once deployed, the source will automatically be used and harvested every night.


Monuments database
ErfgoedBot