Commons:Requests for comment/Future of file metadata

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
Centralized discussion
See also: Village pump/Proposals   ■ Archive

Template: View   ■ Discuss    ■ Edit   ■ Watch

Background[edit]

There were several discussions on where and how to store information about files, primarily induced by the new Structured data adding Wikibase(Wikidata) functionalities to Commons.

Current situation[edit]

Please help adding information to assess the current situation.

Where information is stored[edit]

Data Wikitext imageinfo metadata(API and bottom of the page) Wikibase (Structured data)
Author YES NO yes (copied form Wikitext by bots)
Source YES NO yes (copied form Wikitext by bots)
Description YES NO CAPTION shorter and without links
Permission/OTSR/... YES NO partial (copied form Wikitext by bots)
File versions(with new filepage) YES NO planed
Date (of recording) YES YES (for files with EXIF) partial (only day no time)
Coordinates YES YES (if not in EXIF as separate parameter through a module, then needs Wikitext template) YES (copied form Wikitext by bots)
EXIF and FILE metadata NO YES NO (some statements on some files)
What is on the file? Categories NO (possibly Tags in EXIF) Statements

Advantages and Disadvantages[edit]

Searching[edit]

Information in Wikitext[edit]
  • Standard search finds all content.
  • With advanced knowledge more complex searching is possible
Information in Metadata[edit]
  • Unknown
Information in Wikibase[edit]
  • Standard search uses captions
  • New Special:MediaSearch can search for statements
  • SPARQL Query service is planned but not release

Editing[edit]

Information in Wikitext[edit]
  • Well known by old editors
  • Possibly complex for new editors
  • Hard to edit on touch devices
Information in Metadata[edit]
  • not to be edited anway
Information in Wikibase[edit]
  • Editing itself is very easy and touch findly
  • Property and statement hierarchy is very complex

Proposal[edit]

Not written yet.

Discussion[edit]

This is not ready to be a RFC. Terms like "metadata" are almost meaningless here. Metadata is used by various different types of contributor to mean:

  1. file header data (like EXIF data) that may include geocoordinates, camera type, photographer and includes information that is not visible to anyone on-wiki, such as OCR data embedded in PDFs
  2. associated technical file data that may be unique to Commons, like size, play length, multi-page numbers
  3. image data from a curator's perspective, like recorded date, taken location, photographer, which may not agree with file header data
  4. data from a wiki perspective like image notes, usage, footnotes about publications off-wiki

Despite the statement that metadata is not to be edited, all these types of data do get edited, updated or overwritten. -- (talk) 12:38, 3 November 2020 (UTC)[reply]

This page is currently not meant to be a final RfC it is for creating a RfC? This page is exactly for what you did: assessing and discussing the situation. Renaming this page could be part of this. For me metadata is all information that is not the pixels in the file itself. Is there really a way to edit the data you get when querying for prop=imageinfo without uploading a new file? --GPSLeo (talk) 13:04, 3 November 2020 (UTC)[reply]
Then please remove the 'live' RfC template. -- (talk) 13:10, 3 November 2020 (UTC)[reply]
The technical metadata question is complex. Data can be suppressed on the database separate from the wiki, this include file histories. For most users, overwriting EXIF data would be a question of uploading a new version, and if necessary suppressing the revision, this is how most privacy problems are addressed. Again this is just one type of "metadata", in this case a subset of file header information that the current version of the Commons API chooses to interpret, and even that interpretation is a moving feast with unclear specifications, at least not clear to the vast majority of users. -- (talk) 13:14, 3 November 2020 (UTC)[reply]
  • «Editing itself is very easy» Really? For whom? With what kind of expectations? This is such BS, guys. If I wanted to dream up a caricature of a cartoon villain promoting WD/SCD and attempting to disrupt Commons I’d not go for something like this page: Too much on the nose. -- Tuválkin 00:16, 18 November 2020 (UTC)[reply]
The "very easy" is meant relatively to Wikitext with all the templates used. As one example the captions: You only need to hit one button and then you are able to add the caption and save them. If you want to add a description to the file you have to write {{langcode|1=The description}} to the correct position in the information template. Or if you want to add the information that there is an apple tree on the image you just have to write "apple tree" in the language you want and then you just klick the correct suggestion based on the short description. But how to find the correct category for the apple tree on the image. --GPSLeo (talk) 17:42, 18 November 2020 (UTC)[reply]
Newbies have no idea why they can't press enter to write several lines in the caption, nor is it made clear why the past 20 years of wikitext conventions don't work in these fields, so there's nothing easy or obvious about it. As for matching categories, there is absolutely no reason why this same embedded search behaviour cannot be built in to the normal category fields, they already auto search and auto complete. -- (talk) 18:00, 18 November 2020 (UTC)[reply]
Newbies coming from mobile devices and know messengers where new lines are not possible too. And newbies never heard about Wikitext and so the argument it is an old convention does not make sense here when talking about new users. Yes it would be possible to make the category system multilingual but the developers decided to not work on this and use Wikibase instead. An here we are on assessing the current situation and not possible features in the future, this would be the next step. --GPSLeo (talk) 19:30, 18 November 2020 (UTC)[reply]

File information and Wikidata[edit]

Firstly, thank you to User:GPSLeo for creating this space where we can try to workshop through these questions. My apologies for being later here than I had hoped to be -- various real life things that had to be dealt with.

As per the subhead above, I am particularly interested in the question of file information for objects that have Wikidata items, even before we get on to possibilities that may get opened up by SDC. This is an issue that is already with us. Template {{Artwork}} is currently used on 300,000 files linked to objects with Wikidata items, from which it already can and does fill some or even all of the fields on the description page. Some fields on some pages have already had wikitext removed and been "dematerialised". Other file pages are increasingly being "born dematerialised", without ever having had wikitext in those fields. Template {{Map}} could probably also be adapted to work in much the same way.

I am currently getting to the final stages of preparing a combined Commons and Wikidata upload of a set of maps and prints, so I have a particular interest in trying to find out what we do think should be best practice, to try to keep as much as possible of the advantages of different options, and reduce as much as possible the disadvantages. For reference, d:User:Jheald/BL18C/fields contains a work-in-progress data mapping of metadata for this upload to wikidata properties and to fields in the templates; also File:Saint Christopher by Albrecht Dürer- V&A - E.679-1940.jpg, an example of a file drawing its metadata largely from Wikidata.

So, what are some pros and cons of dematerialisation, as I see them:

  • Some advantages of drawing from Wikidata
  • Internationalisation and linking Commons is supposed to be an international project. Go to the page for the Durer print above and switch language to Russian, and almost everything gets turned into Cyrillic. With plain wikitext that doesn't happen. Yes, for some things we get round it by translation templates. But even then, Wikidata tends to have a wider range of languages for any particular concept; and it lets people edit and add labels in their own language, unlike wikitext which is very English-first.
Secondly, if the value is drawn as a Q-number from wikidata, the template automatically picks up whether there is eg a Creator template, or a Wikipedia article, or a Commons category for the concept. With wikitext this generally will not happen.
  • Synchronisation If a new wikipedia article or creator template gets made, or internationalisation gets added for a concept, what gets shown on the template page will automatically get updated to show it. With plain wikitext this does not happen.
Also, if information gets improved on Wikidata, what gets shown by the template will automatically get updated, so the two views remain in sync. Templates like {{Artwork}} include a pencil icon on displayed fields to encourage editors to edit the relevant statement on Wikidata. In contrast if local wikitext gets edited to add new information, that added knowledge does not get synchronised to Wikidata.
  • Some disadvantages of drawing from Wikidata
  • Vandalism and tracking The flip-side of encouraging people to make edits on Wikidata is that that may also lead to more questionable edits there. (Indeed, one argument made for the Short descriptions project on en-Wikipedia was to reduce bad edits on Wikidata.
Edits made on Wikidata do not show up on Commons watchlists, or in the histories of Commons pages, so they can get missed. (Though they do show up in the histories of the Wikidata pages, and in queries for changes made to particular sets of pages).
  • Transparency (?) It can be unclear just where data is coming from, or how new data can be added, particularly for fields not shown. (eg recent discussion at COM:VP, of a file where literally only {{Artwork}} appeared in wikitext, as the data for the file, such as it was, was all drawn from Wikidata; and even the Wikidata ID itself was drawn from SDC). This might be helped by always leaving at least some fields of the template open in wikitext.
  • Searching (?) A point made by User:Fæ in the discussion at COM:VPP previously was that information should be kept in wikitext as this is valuable for search. It's perhaps worth noting that basic searches still work fine, as search searches the text of pages after template content has been transcluded. So eg [1] still finds this file, even though the words "Beach scene with fish-sellers at Katwijk" appear nowhere in the local wikitext. But what Fae may be referring to is searches that start with the insource: keyword, specifically to search wikitext -- as used for example in files like this one to search for other content from books by the same author, part of the bulk upload by Fae from the Internet Archive.

A comment: When first planning the upload I'm working on, it was the 'questionable edits' point above that gave me most concern. So my first thought was to use templates like eg {{WrapWD|Q90}} -> Paris 

wikidata:Q90

to represent field-values locally in wikitext, and dissuade people from editing the wikidata, while trying to keep as much of the benefit of internationalisation and linking as I could. (By the way, off-topic question: Where do we prefer such links to point? Traditionally, wikipedia has been the default preferred choice, if available. But it seems to me there's a lot to be said for trying to link to Commons categories as first preference, where possible. But anyway that's a whole different discussion).

However, I'm now tending more to doing what has now been done for so many paintings, i.e. for the fields in question just have statements on Wikidata that get picked up by the template, without local wikitext for those fields. And for me the balance of this has been tipped by synchronisation. As part of the upload process, I won't be able to match all strings to wikidata items, at least not to start with. So some values will inititally go into wikidata (and be reflected by the template) as unmatched strings, using Wikidata properties like author name string (P2093). Doing this lets me get the images up as quickly as possible, with at least something (a string) as the field's value.

But later, as a follow-up, I will be coming back to those strings, and trying much more exhaustively to match them. If I have synchronisation, then I only need to make that edit once on Wikidata, and what's shown in the template will update automatically, and pick up all of the internationalisations, links, creator templates, etc automatically. Also it means that if anyone else had found the unlinked name on Commons, and wanted to link them, I wouldn't have to worry that they might have done that in wikitext, that might now get overwritten. If instead they had followed the 'pencil' link, and editied the entry to a Q-number on wikidata, they would have already done my job for me, and in the process taken one more name off my to-do list. Even when that hasn't happened, it's far easier for me if I know that I (or anybody else) only has to deal with the 'matching' issue on Wikidata (for which there are well-developed workflows), knowing that what gets shown on Commons will update automatically accordingly.

So that's why personally, at the moment, I'm leaning towards the 'dematerialisation' model for my upload. But I want to get this right, and go with whatever approach as far as possible maximises all the advantages, and minimises all the disadvantages.

And so that's why I really want to know what are ways to minimise disadvantages; or, alternatively, what are other approaches that could still manage to secure as much as possible of the positives? Jheald (talk) 21:25, 25 November 2020 (UTC)[reply]

  • Is this question about moving every separate image to Wikidata? A prime example of the headaches it creates are the thousands of images inside Category:19th-century photographs of Brazil. I think the first question should be what images (if any) should and are at wikidata and whether the "metadata" from Wikidata should be copied here or just incorporated automatically. I think it doesn't make sense for an image that is uploaded to Commons to have all its corresponding data at Wikidata as it makes every Commons template and discussion just circular. -- Ricky81682 (talk) 18:41, 2 December 2020 (UTC)[reply]