Commons:Structured data/Archive/2014/Short introduction to Wikidata

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

What is Wikidata ?[edit]

Wikidata is a structured central store of information that can be accessed through appropriately written templates on Commons and other wikis.

So, for example,

Both of these examples are still in early stages of development, but they use current live data, and show the possibilities for drawing data from the central store, rather than keying it in separately on each wiki where it is used. Furthermore, the design of Wikidata is inherently multilingual, so it is easy to switch language, and the information drawn from Wikidata will be automatically localised to any language in which it is available, or the user's preferred fallback.

Wikidata also contains the information used (by default) to construct the each page's sidebar sitelinks to the corresponding page in other wikis.

How is information stored on Wikidata ?[edit]

Items[edit]

The basic organising element on Wikidata is the item. An item on Wikidata is roughly analogous to an article on a Wikipedia (though d:Wikidata:Notability may be somewhat more permissive).

Each item on Wikidata is identified by a unique Q-number, so for example Q42 identifies the Wikidata item for Douglas Adams. The contents of an item can be viewed using the Reasonator tool, or alternatively through the (more edit-orientated) Wikidata page, eg d:Q42.

Properties[edit]

Most information about an item is stored in the form of properties. Each property is identified by a P-number, and takes a value, which might be a time or a string or one of the limited number of other datatypes that Wikidata supports. Most often the type of value required will be another Wikidata item. So for example on d:Q2480921 (Reasonator), the property d:Property:P195 is used to identify the collection that the painting belongs to, taking the value d:Q1948674 -- the Groeningemuseum in Bruges. Together, a property-value pair is the most basic kind of statement that can be made about an item. Statements can be added either by editing an item's Wikidata page directly, or by a purpose-written bot, or using bulk tools such as QuickStatements.

A list of the properties that have so far been approved can be found at d:Wikidata:List of properties, which is broken down by subject areas. Alternatively, various WikiProjects on Wikidata maintain schemes of properties appropriate for certain kinds of items: so Wikidata's WikiProject Visual Arts publishes a scheme of properties for artworks, while its WikiProject Books publishes properties for a specified edition of a book and properties for an underlying book title. More details about a property are given on the property's talk page, for example d:Property talk:P195. This may include sample-cases of its use, the datatype it requires, and details of any constraints on the values that can be attached to it.

Language for describing statements.
Each statement contains a property and a value, optionally with additional qualifiers. If the statement also includes one or more references, the property-value-qualifiers combination is called a claim.
An item may contain several statements relating to the same property.

An item may contain more than one property-value pair for the same property. For example an artwork may depict (P180) many things; or it may have been in different collections (P195) at different times. These can be distinguished by adding one or more qualifiers to the property-value pair. Each qualifier is a property and a value that relates to the statement -- for example: start date (P580) and end date (P582) could be used to add when an artwork was in a particular collection; applies to part (P518), if a statement applies to a specific tangible part of an item; or (P459) if a particular determination method was used to establish the statement.

An item may also carry different property-value pairs for the same property if different sources disagree -- for example, perhaps over the attribution of an artwork to a particular creator (P170). Statements should therefore also contain references. A reference is specified using a further property-value-qualifiers triple. Ideally the property stated in (P248) should be used, with the Q-number of a Wikidata item for the particular source; but it is also possible to use reference URL (P854), with a bare URL. Normally a separate item with its own Q-number should be maintained for the particular article or (for a book) the particular edition of the particular book that is the target of a reference, using the schemes at d:Help:Sources. References can be made more precise by adding qualifiers to the reference such as page (P304) or quote (P387). Qualifiers such as volume (P478) or chapter (P792) may be useful in references to a book-edition. (This information will usually already have been specified for a journal article in its item). Where a statement includes references, the property-value-qualifiers triple being referenced is called a claim. The diagram on the right shows the whole structure. Conflicts between statements can be resolved by ranking: by default statements are given a ranking of "normal"; incorrect or out-of-date statements can be marked "deprecated"; while a statement considered best can be marked "preferred".

One particularly important property, that ought to be set for all items, is instance of (P31), defining the kind of thing that an item is. An appropriate value of this property is frequently a requirement in constraint checks for other properties.

Other information[edit]

Not all information about an item is stored in the form of properties. Information is also stored as labels, descriptions, aliases and sitelinks.

  • The label gives the most common name that an item would be known by. Wikidata is inherently multi-lingual, so appropriate labels can be stored in multiple different languages, with the user shown their currently selected user language, or best available fallback.
  • The description is a short phrase to describe the item, which serves to disambiguate items with the same or similar labels. Again, Wikidata can store descriptions in a multitude of available languages.
  • Aliases are alternate names for an item, that can be hits for searches in the same way as labels. Multiple aliases are possible for an item in each language (but only one label).
  • Sitelinks identify pages on Wikimedia wikis in 1-to-1 correspondence with an item. This generates essentially the list that is presented on the sidebar of a Wikipedia article. Sitelinks to redirects are possible, but only if the redirect is temporarily turned into a normal page, then reverted back to being a redirect once the sitelink has been constructed.
    Note that it is only possible for an item to be sitelinked to a single page on each wiki. One consequence of this is discussed below, in the section Commons-Wikidata sitelinks.

Getting information out of Wikidata[edit]

In the simplest case, it is possible to extract the value of a single property of the item connected with a page using the parser directive {{#property:P195}}, as described at m:Wikidata/Notes/Inclusion syntax. Earlier drafts planned for more extensive parser-level functionality, but these have been scaled back.

More complicated data extraction can be accomplished in Lua, as described at mw:Extension:Wikibase_Client/Lua.

A number of intermediary Lua modules and templates built on them have been written, which continue to evolve, to extract data and format it in particularly useful ways, most notably d:Template:Data, based on fr:Modèle:Wikidata. As of October 2014 the syntax for these continues to develop, and is not yet standardised across different wikis.

To obtain lists of items matching certain criteria, database-like queries are available using the Wikidata Query (WDQ) API. This can in particular be accessed using the Autolist or Autolist 2 front-ends, which also allow the results of a search to be combined with the contents of a category (or tree of categories) on a Wikimedia wiki.

Wikidata deployment to Wikimedia wikis[edit]

The full Wikidata functionality is only being deployed to public-facing wikis like Commons in phases, once technical requirements have been met, and after local RfCs. (Although the full database is available to WDQ and other tools).

Phases[edit]

  • Phase 1 (live on Commons already). The most basic use of Wikidata is central management of sitelinks. This went live on Commons in September 2013. The sitelinks shown in the sidebar of a page are now by default drawn from the page's corresponding item on Wikidata (though it is still possible to over-ride these values by specifying language links explicitly on the page).
  • Phase 2 (RfC approved; now live). Activation of Phase 2 allows a template on a page to draw information from the Wikidata item specifically related to that page. This would allow something like the automatically-multilingual gallery header template presented at d:Template:SimpleCommonsGalleryHeader/test (or some more developed version of it) to be deployed on gallery headers.
However, under Phase 2, such a template cannot access information stored on any other Wikidata items. Thus information about an artist from Wikidata could be displayed on a gallery specifically related to that artist, but not on any other page. Some consequences of this are discussed further below.
Phase 2 is now activated for Commons, though has been little used so far.
  • Phase 3 (expected to be available in early 2015). In Phase 3, templates can access and display information drawn from any item on Wikidata, not just the item in 1:1 relation with the page. Phase 3 is currently available on Wikidata itself, which is why eg the gallery header test and Creator/wrapper/test linked to above were constructed on Wikidata.
Enabling "arbitrary access" allows data form a particular item on Wikidata to appear on, or affect the output of, a much larger range of final pages -- pages which need to be reliably purged from the cache and regenerated if the underlying value on Wikidata is changed. Code is therefore being developed to track such usage (bugzilla:47288). But this will need to be completed before it will be possible to enable Phase 3 on live wikis such as Commons (bugzilla:47930).

Commons-Wikidata sitelinks[edit]

Interwiki links as currently constituted between different sister projects on Wikidata: each item can only have one link to each project
The workaround: category/article duality

As noted above, each item (Q-number) on Wikidata can only be site-linked to a single target on each Wikimedia wiki. In particular, this means that each item can only be site-linked either to an article or to a category. This makes some sense for Wikipedias, where there are some distinctive category-like properties that only apply to categories; and where the category structure (and so the principal category, if any, for each item) may vary from language to language. It has therefore been preferred to keep the relationships between categories distinct from the relationships between articles.

However, on Commons, the most common sitelinks between Commons and Wikipedias have been from Commons Categories to Wikipedia Articles. But, in addition, Commons Categories may also correspond to Wikipedia Categories, and both may have associated galleries. From a Commons perspective, the most natural solution would be for a Wikidata Q-number to be linked with both the Wikipedia article and the Wikipedia category most closely linked to that thing, as well as the relevant Commons category and Commons gallery (if any).

However, this is not possible as Wikidata is currently constituted. Instead, at present, each item on Wikidata can only be linked to a single target on each sister project.

As a work-around, Wikidata therefore has a parallel structure of category-like items and article-like items, connected by connector properties:

This structure has some serious drawbacks:

  • It means that when a reader clicks to an article, they will not automatically see a Commons category link in the sidebar, only a gallery link (if there is one); and if a reader clicks to a Commons category, in the category sidebar they will not see links to articles in multiple languages, only categories (in languages where they exist). (Note: sidebar links to Commons are currently available only as a beta preference; they are due to be rolled out to full production use soon).
  • It also means that, since at the moment wiki-pages can only access information from Wikidata items to which they are directly linked, it is not currently possible for a Commons category to draw information from the corresponding article-like Wikidata item.
Wikidata deployment Phase 3, which would enable such "arbitrary access", is not expected to be available before early 2015.
  • It means that a category on Commons has to keep track of a separate article-like item on Wikidata to describe the object that the category represents, as well as the category's directly-linked category-like item describing just the category (if that item exists).

These consequences are all deeply unfortunate. However, so long as each Wikidata item can only link to a single target on each sister project, if the structure is to remain stable, predictable and traversable, it is essential that the category-to-category and article-to-article rule is observed.

Statistics[edit]

The results of some database searches are given at d:Wikidata:WikiProject Structured Data for Commons/Phase 1 progress/Statistics. As of September 2014:

  • there are 3,400,000 categories and 112,000 galleries on Commons in total
  • 700,000 article-like items have a related Commons category identified by Property:P373
  • 240,000 Commons categories have sitelinks to corresponding category-like items; 84,000 Commons non-categories have sitelinks to corresponding non-category-like items
  • 95,000 Commons categories are currently sitelinked to non-category-like items. Of those about 64,000 currently have P373 set.
  • About 55,000 of the 700,000 article-like items with Property:P373 set currently have Property:P910, topic's main category set. A further 35,000 items can be identified with a Commons category indicated by P373 that matches the value of P373 for a category-like object.

What next ?[edit]

Some currently possible tasks that could be help improve the integration of Wikidata and Commons include:

  • Better item plumbing -- More Commons categories and Commons galleries should be sitelinked to relevant items on Wikidata, with appropriate targets for Property:P373 identified and set. Sitelinks from article-like items to categories should be removed. Appropriate Property:P910/Property:P301 pairs should be created where there are both article-like and category-like items.
For the time being wikilinks from categories to articles will need to be handled by locally-coded templates, also including a link to the article-like item, but not by sitelinks. These templates should be developed and rolled out.
  • Template development -- More development should be done on the possible templates for gallery headers and category headers.

Structured Data initiative[edit]

The largest change that the Structured Data initiative proposes to make will be the creation of a second Wikidata-like database specifically for Commons, sometimes given the name CommonsData.

Under these proposals items on the new CommonsData would be used for information that related specifically to a single file, and each file would have a corresponding item on CommonsData; while items on Wikidata would be used for information shared between many files (eg details about creators etc). Properties on CommonsData would take as their values either another Commons data item, or a Wikidata item, or else one of the same datatypes used by Wikidata.

The overall structure used for items on CommonsData would be similar to that used for items on Wikidata already. Discussions about the detailed data modelling, properties required, APIs for exposure to gadgets and tools, and other aspects are all currently ongoing: for details see elsewhere in this set of pages.

Resources and further information[edit]