Commons talk:Structured data/Overview

October 2016 Consultation[edit]

Please share you thoughts about the project plan shared at Commons:Structured data/Overview. In particular, we are seeking feedback on the questions described in this section of the Overview. Please create subsections on this page. We look forward to your discussion. Astinson (WMF) (talk) 01:02, 26 October 2016 (UTC)[reply]

This is NOT a new project proposal, comments here should relate to the October 2016 update from the WMF on possibly expediting the project. — Preceding unsigned comment added by Seddon (WMF) (talk • contribs) 11:43, 26 October 2016 (UTC)[reply]

Questions[edit]

Why do we need your comment? How can you help? section had some questions. I copied them below:

Do you see this expedited roadmap as a worthy undertaking[edit]

yes --Jarekt (talk) 02:10, 26 October 2016 (UTC)[reply]
Yes. - PKM (talk) 22:55, 26 October 2016 (UTC)[reply]
Yes --John Cummings (talk) 12:19, 27 October 2016 (UTC)[reply]
No. Before enabling new stuff please fix the existing one. --Steinsplitter (talk) 11:23, 27 October 2016 (UTC)[reply]

@Steinsplitter: Could you specify what you would like to have fixed? ChristianKl (talk) 08:53, 29 October 2016 (UTC)[reply]

Just look at the open bugs at phabricator. --Steinsplitter (talk) 18:30, 31 October 2016 (UTC)[reply]

Yes. --Denny (talk) 20:02, 27 October 2016 (UTC)[reply]
Yes. Multichill (talk) 21:35, 27 October 2016 (UTC)[reply]
Yes. ChristianKl (talk) 08:53, 29 October 2016 (UTC)[reply]
Yes. --Micru (talk) 23:14, 29 October 2016 (UTC)[reply]
Yes!! Susanna Ånäs (Susannaanas) (talk) 07:16, 31 October 2016 (UTC)[reply]
The sooner the better. YES. Spinster (talk) 08:52, 31 October 2016 (UTC)[reply]
Long overdue, imho. --El Grafo (talk) 13:21, 31 October 2016 (UTC)[reply]
Yes. I am pleased to see a move to bring Commons into the 21st century. A file-repository never worked well with Wikipedia software, particularly the mess that is categories. Go for it! -- Colin (talk) 13:38, 1 November 2016 (UTC)[reply]
Yes, I endorse the high level concept. I am not aware of anyone saying how much this would cost, who would take responsibility for management, what minimal promises can be made for any level of investment, or how this compares with other options. If there were a choice of options to support which each had a price tag then I am not sure which among the high-level projects I would prefer. I hope that no one interprets this poll as a community preference in favor of other options. A lot of other things might be done with Wikidata preferentially and at lower cost, and I am not sure who is making decisions or how. I can only guess that this project would be among the most expensive, complicated, and risky directions for development. The risk is going over budget and time for subpar delivery, and I would like to see some solid unqualified successes out of the WMF. Structured data in Commons is something that I dearly want, among other things that I dearly want. Blue Rasberry (talk) 16:43, 1 November 2016 (UTC)[reply]
Yes, but Steinsplitter made a good point above. There are important issues which need to be fixed for a long time. On the top of my head: 1. interwikis with galleries and categories to the rest of Wikimedia, 2. using WD for authors (Creator pages, etc.). Yann (talk) 18:15, 1 November 2016 (UTC)[reply]
Most certainly yes. It's really good to hear that efforts are now being made to look seriously at some of the fundamentals of the Commons software for the first time in a decade. MichaelMaggs (talk) 12:47, 8 November 2016 (UTC)[reply]
Yes. Make it happen! --Beat Estermann (talk) 13:27, 10 November 2016 (UTC)[reply]
No. Frankly, the effort seems confused and backwards. I look at http://structured-commons.wmflabs.org/wiki/File:LighthouseinDublin.jpg and see an instance of a photograph. I'd expect the photograph to have a Q-number and I'd expect the descriptors to hang off that Q-number. I'd expect it to be an instance of a photograph. I'd expect it to have a subject that is the Poolbeg Lighthouse. Instead there's a new type MediaInfo and stuff hangs off of it. For licensing, I expect there to be a generalized author for copyright purposes, but instead there is a photographer. So the photo doesn't have a photographer but the MediaInfo does; the property attaches to the wrong object; the photographer did not take a photo of the MediaInfo. There's no statement that the media is a photograph. There's MediaInfo that can be ripped from the JPEG metadata, but it is less accurate than the data in the JPEG (eg, date taken). The MediaInfo label is Poolbeg Lighthouse, but the object is not the lighthouse but rather a photograph of the lighthouse. It depicts a "lighthouse", but it should be depicting Poolbeg Lighthouse (Q7228600) (an instance of a lighthouse Q39715 which is a tower which is an architectural structure ... so dig deep to find a 3D object and know that panorama is an issue). The MediaInfo approach is a contorted view of a simpler problem. The basic world is simpler. Look at A Connecticut Yankee in King Arthur's Court Q848612. Copyright status falls out of that Q-number. It's an instance of P31 a book that was published P577 in 1889. The author P50 is Mark Twain who died P570 in 1910. I'll use country of origin P495 to infer publication was in US. That gives me PD-old-auto-1923|deathyear=1910. All done outside of the MediaInfo type. See also Moonrise, Hernandez, New Mexico (Q17107995) which has more complicated issues; individual prints of the photograph are significantly different. Far from expediting, I think somebody needs to pull the MediaInfo plug. Glrx (talk) 01:51, 21 November 2016 (UTC)[reply]

What roadblocks, risks or challenges do you anticipate with accelerating such a project[edit]

I do not anticipate that you ever going to get rid of text version of some page descriptions. We had the same problem several times before:

Early images did not require infobox, like {{Information}}, so for years people were adding images without them. We put a lot of effort into creating and maintaining core set of infoboxes and unifying hundreds of other rarely used infoboxes and description templates into them. We also put a lot of effort into adding infoboxes to files lacking them; however ~1% of files (in Category:Media missing infobox template) is still missing them. I expect they will be still missing years from now. The reason is that the remaining files mostly require manual processing an that is a very boring task that nobody is lining up to do. Also many of those files do not meet current standards of documentation and end up with mostly empty fields and some are randomly deleted for lack of source or author.
We had better luck some years ago with enforcing the rule that all files require a template with a copyright tag. It was a massive job to add license templates to all the files that never had them or lost them.
Another perpetually unfinished task is transfer of files from wikipedias to Commons. main issue is impossibility of automatically converting from one format of wikitext based description to another. So it has to be done automatically while dealing with missing data and frequent deletion of old images for lack of current day metadata. I expect that we will get 80-90% done but the remaining files will be with us for a long time. --Jarekt (talk) 02:10, 26 October 2016 (UTC)[reply]

Another challenge we run into are stuborn users that do not like their files moved from wikipedia to commons, or do not like {{Information}} template and will engage in a war with anybody that adds standard infoboxes, or do not like any of the standard license templates and write their own text of a license. Same users might be creating succesful roadblocks at using wikibase type descriptions. --Jarekt (talk) 02:10, 26 October 2016 (UTC)[reply]
- perhaps the stubborn users object to their items which were stable on wikipedia for years, but are then transferred and deleted on commons. we might also talk about the stubborn admins on commons who do not play nice with anyone including wikidata. Slowking4 § Richard Arthur Norton's revenge 19:30, 31 October 2016 (UTC)[reply]
  - @Slowking4: , I agree. Sometimes people transfer files from Wikipedia, relying on tools which are very bad at keeping all the image metadata. Than images are deleted from Commons due to insufficient metadata without notifying the photographer, only the user or bot that did the transfer. It is a maddening, but unfortunately not uncommon. I fully support users that that happen to to be "stubborn" about future moves by others. However the safest route is to move them yourselves. I also agree that being admin on Commons, or any other project, does not inoculate from being stubborn or difficult. I can show many examples. The best remedy is to nominate people who are not stubborn or difficult for the job. --Jarekt (talk) 16:50, 1 November 2016 (UTC)[reply]

Final challenge I see is treatment of files that do not use {{Information}} template but one of other infoboxes or templates derived from them.--Jarekt (talk) 02:10, 26 October 2016 (UTC)[reply]

Another challenge and opportunity would be capturing messy details of multiple licenses which apply to different jurisdictions and my be related to multiple co-authors. For example a photograph to the right should require information about sculptor and the photographer and list copyright tags for both. Many images require information about copyrights in the country of origin and in the US, and may also include information about copyrights in other countries. All that complexity is not well captured using current templates (See Commons:Multi-license copyright tags for info on copyright templates), but could be captured in well designed wikibase structure. I think we could automatically migrate many of the current files to such system, but probably not all. I have however high hopes on capturing such complexity with the uploads we contribute years from now. Finally I hope that being able to capture more details will not add to higher confusion during upload and will not lead to mass purges of old non-complying files that meet community standards at the time of the upload, but might not be meeting future standards. --Jarekt (talk) 13:07, 26 October 2016 (UTC)[reply]
@Jarekt: This is all really great feedback! Having realistic expectations about the speed of adopting structure, is really important. We are trying to calibrate the proposal not to overpromise the conversion, but are cautiously optimistic. Of course, working to build a reasonable community process for prioritizing and supporting that transition will be really important. Astinson (WMF) (talk) 16:01, 1 November 2016 (UTC)[reply]
Apart from all the challenges regarding migrating existing files, what about uploading new ones into the new system? Many experienced users prefer alternative upload methods (see Commons:Upload tools) over the default UploadWizard. Please make sure you don't underestimate the disaster breaking those tools would cause! Forcing people to use the UploadWizard will result in angry mobs with torches and pitchforks. I think it's crucial reach out to the developers of those upload tools as early as possible to give them ample time to adopt. --El Grafo (talk) 14:09, 31 October 2016 (UTC)[reply]
@El Grafo: We agree with you completely: One of the Year 2 priorities, is going to be working with communities on uploading and other important tools, especially those which are already designed to work with structured data in some way (for example, if you haven't engaged with Commons:Pattypan yet, I would recommend giving it a try (it makes the infoboxes much easier)). Wikidata was hugely successful because of the community of volunteer developers, the plan is to apply that learning to this project. Astinson (WMF) (talk) 16:01, 1 November 2016 (UTC)[reply]
@Astinson (WMF): Thanks, that's great to hear! --El Grafo (talk) 14:30, 8 November 2016 (UTC)[reply]

Volunteer tools were such a big success for Wikidata, because we could use Wikipedia as data source. The biggest data source were Wikipedia categories and Wikipedia templates. I guess the most important data source for Commons will be the category system in commons. Other possible data sources will be the name of the files and the description of the images. A possible new tool could be an image recognition software for bots like pywikibot to detect if the picture is about people, animals, houses, etc. There is Commons:Bots/Work requests, but it is not very active. --Molarus (talk) 06:59, 4 November 2016 (UTC)[reply]

not a chance of upload wizard only. the challenge is how do we guide new uploaders to appropriate tools, since they are open source, (changing support) and harder to find. by default upload wizard with small off-ramp, you get a lot of information template to change to artwork. and questions at village pump. need a dashboard for uploaders to select right one; need a tool life-cycle, with on-boarding at WMF for the good ones. Slowking4 § Richard Arthur Norton's revenge 19:40, 31 October 2016 (UTC) — Preceding unsigned comment added by Astinson (WMF) (talk • contribs) 16:19, 01 November 2016 (UTC)[reply]
Handling the social/community aspects will be critical. Technical changes - especially after such a long period of technical stagnation on Commons - will inevitably annoy some people, and we know that long intemperate discussions on wiki significantly put off volunteers who might given a more positive environment be very happy to help out. That's one reason why as Jarekt mentions above it can be difficult to crowdsource volunteers for new cleanup tasks. In parallel with discussions of technical changes we need to make sure our Commons policies, guidelines and legal rules are up to the task. Commons currently has relatively few formal policies/guidelines, and one consequence is that for many issues the community has not yet worked out what its 'official' consensus view should be. Where rules need to be added or changed early community discussion is going to be really important, with a strong emphasis on encouraging cross-wiki and cross-community collaboration. MichaelMaggs (talk) 13:35, 8 November 2016 (UTC)[reply]

+1, I'd go so far as to say that this might very well be the most difficult part of the whole project. --El Grafo (talk) 14:30, 8 November 2016 (UTC)[reply]

That's why we should start by small steps: automatic handling of institutions and creators seem to be a good idea. Once that work, we can shift to bigger issues. Regards, Yann (talk) 17:41, 8 November 2016 (UTC)[reply]

Poor design implies disaster. Glrx (talk) 01:55, 21 November 2016 (UTC)[reply]

Having read the dialogue above between user:Steinsplitter and user:Lydia Pintscher (WMDE) above (see [1] and [2], I cannot say that my level of trust in this project is very high. In my opinion, the "existent stuff" on file description pages is very important because it shows exactly how the file was introduced into Commons. It is vital that this information is not only stored forever, in its original form (usually a natural language), but also easily accessible. The complete history must be preserved for good and must remain accessible to any reader (and uploader/writer). Otherwise, a lot of confusion is certain to follow. To improve the search for Commons objects is a worthy task but this must not be an excuse to "replace" the existent descriptions. I admit that I have a great fear of monopolization here. If databases are interlinked and "networked", the danger is that independent data sources are deleted and each data base only shares information from other data bases but the origin of the data is obscured.--Mautpreller (talk) 14:15, 11 January 2017 (UTC)[reply]
- @Mautpreller: the engineers are actually working on a set of backend features that should help in this context: the main one will be Multi-Content Revisions which will allow multiple types of content change history to be merged into one log, to allow for folks to find a greater amount of information on historical changes for a page. This is only part of the problem solving though: we also have to make sure that the history pages for the commons media files have sufficiently clear and usable interfaces for working with the contribution history.Astinson (WMF) (talk) 17:26, 2 February 2017 (UTC)[reply]

"Making (semi-)automatic editing easier" - I am afraid this means more automatic editing by bots and mass-editing, especially concerning deletions. From the point of view of someone mainly contributing to Wikipedia rather than Commons, altogether the current bots and scripts on Commons do more harm than good. The deletions of files that I stumbled upon because they were used in articles were almost all wrong. For example I know of
- two instances, where a sketch/diagram was deleted and one where it was almost deleted because someone tagged it as being "wrong", just because the automatic categorisation was wrong, suggesting that it showed something different. A complete description was only found in the Wikipedia articles where the sketches/diagrams were depicted and referred to in the text for many years. Without prior notice, without someone actually reading the articles or providing an alternative picture, they get silently erased by CommonsDelinker. If that was not enough, the CommonsDelinker in one case also stated the wrong reason "missing license" for the deletion. People started complaining on the discussion page half a year later because they didn't understand the text that said things like "see figure 1" but it took many years for someone to figure out the reason and repair it.
- another example is an editor trying to improve articles by uploading improved versions of images. Because his new creations are based on the old ones, he states that they are not entirely his own work. The old pictures in the articles are replaced with the new ones and the new ones automatically labeled as missing license or permission and then deleted. No one bothers to look at the discussion page of the uploader if he explains what he has done or looks at the articles where the images are used to find out their origin. Often the article can be reverted to use the old images in case there is a valid reason for deleting the new ones. A similar case happened to me. I just used different parameters in a Matlab code that someone put on Commons as public domain to create an animation. I thought it would be sufficient to link to the original file as a source, but unfortunately the bot couldn't read that.
- last but not least the cases already mentioned above, where a file was transfered to Commons from Wikipedia and then deleted due to missing information that was lost during the process or some automatic-cleanup afterwards.

Given the current handling of deletions and the way in which statements are added to Wikidata at present, just because some algorithm suggests something, these problems are likely to get worse. For example I corrected quite a few items with wrong coordinates added by some scripts, where one error often results in a whole bunch of wrong statements. In case this information is used to hunt for pictures to nominate for deletion (e.g. because "freedom of panorama" does not apply for what the picture is supposed to show and where it appears to be taken), the project will cause a lot of damage and anger among the Wikipedia editors.--Debenben (talk) 16:25, 22 January 2017 (UTC)[reply]

Does the current project accurately represent the role of the communities, especially the Wikimedia Commons and Wikidata communities, in engaging with such a software project?[edit]

That is a hard question. On commons many of the most active users find their niches and work there, for example many hardworking admins work on keeping up with daily flood of copyrighted images that need to be deleted. Often current self appointed tasks do not allow easy switch to new tasks, so it is much harder to crowdsource a new set of tasks. For example when we finally identified all the files without a infobox by adding them to Category:Media missing infobox template there was not army of volunteers to fix them. Similarly when Wiki Loves Earth competition dumped over a month few thousands images with bad coordinates it was hard to find volunteers to fix them. On the other hand Wikidata grew a large community of volunteers to run the site doing tasks nobody thought of years ago. I do not expect them to drop their current work and come to Commons and work on our structured data, but I do hope we can grow our own community of volunteers to work on our new tasks. Those could be users that do not like the ways we do image description or categories currently on Commons. --Jarekt (talk) 13:22, 26 October 2016 (UTC)[reply]
No. I have attempted to identify who the "we" is by reading the overview, but only managed to work out that it was not WMDE, the WMF or Wikimedia Commons volunteers (though it could be that "we" is the WMF and WMDE, just talking about themselves like third parties). I can deduce two names of WMF employees based on the posts being made on-wiki, but it would probably be wrong to presume that's the whole team. In terms of engaging, that's an interesting point, as the example image groups being targeted in "How Commons Content Could Change" includes a lot of uploads from my projects, but nobody has approached me with practical analysis on how my uploads might need to be adapted, for example where I have applied project specific ingestion templates to hundreds of thousands of images, such as {{nypl}}. My assumption would be that engagement will remain passive, consisting of posted invites suggesting volunteers to comment now plans and proposals have been published. Experience shows that volunteers that invest their unpaid time creating suggestions for changes and improvements will be resisted or politely sublimed by the late stages of proposals like this. In summary, I might be interested in helping the transition to structured data, if it was explained in a way I understood rather than way-laid with jargon that does not seem pinned on measurable definitions, and the plan (or timeline) showed there was anything that I could look at from a pragmatic Commons perspective before 2019. From what I've read, I think this is all about Wikidata until then. --Fæ (talk) 11:17, 27 October 2016 (UTC)[reply]
@Fae: The "we" in the document is the WMF and WMDE group scoping this work which includes staff across both organization's engineering units; Seddon and I are providing community engagement support, because the project could include external funding (in the scope of WMF Major Gifts) and is closely related to movement work in the GLAM-Wiki space (in the scope of WMF Programs team).

Additionally: thanks for the feedback on needing a more pragmatic timeline: at this time, we are offering a high level, because whoever is leading on this project (someone like User:Lydia Pintscher (WMDE) is for Wikidata), will be working directly with the community for both prioritizing features to implement and the order/way in which community activity might change (making sure that available infrastructure changes alongside community consensus). The current focus is on investing in the infrastructure, how the infrastructure will be used by the community will be dependent a lot on those further conversations and is a major component of the timeline -- we don't want to prescriptively propose a means of implementation before the demo is ready for demonstration, and the Commons community has had time to evaluate it and provide feedback. Astinson (WMF) (talk) 13:58, 31 October 2016 (UTC)[reply]
No. I never seen a community consensus pursuant to COM:RFC. --Steinsplitter (talk) 11:24, 27 October 2016 (UTC)[reply]
No. I don't see clear goals or clamor for particular features. Put the data in a database rather than in text is a solution, but what is the problem that is being solved? It cannot be just to move data around. The project does not state goals but rather "benefits". Those benefits are motherhood-and-apple-pie statements rather than concrete goals. "One way to think of structured data: It’s a kind of DNA that explains information in a much more integral way." What does that mean or solve? DNA is instructions for assembling proteins; it tells us nothing about what those proteins do. Bad metaphor. It's a sales pitch in the aether. Lots of author information on Commons is dead wrong; moving it to a database isn't going to fix that. Today I corrected the author for File:Rhamnus frangula - Köhler–s Medizinal-Pflanzen-120.jpg. All of the credits in the Köhler's Medizinal-Pflanzen are wrong: Koehler wrote the book, but others did its artwork. If we knew that Koehler is not an illustrator, then we could deduce that he is not the author of a drawing. We can use a database for checking. But I do not see that as a stated goal. What are the goals? How will those goals be achieved? Glrx (talk) 02:14, 21 November 2016 (UTC)[reply]

How would you like to support this project?[edit]

The GLAMpipe metadata transformation and upload tool would be an ideal tool to support working with this framework. I will commit to developing it.
I am interested in working to streamline licensing procedures, and the way how media, licenses and metadata are represented in the MediaViewer, file page or shared content.
I am also interested in contributing to "templates", metadata subsets that are required for specific types of media. I have worked with the Map template, discussing broadly with media providers and reusers, and I can specifically contribute to that.
I would like to see participatory methods of developing these, meaning I as well as many others would like to be engaged in discussion and design. However, as pointed out elsewhere, this is a long anticipated development, and it should not be set on a track that could get it jammed. Therefore, I opt for inclusive, forward-thinking and productive setups. The priority is very high, and should be recognized.
When in place, I would like to contribute to developing ways to enrich data: Adding location, connecting to additional data, making annotations, recognizing features - in micro tasking applications like the Wikidata Game and in regular MediaWiki tools

--Susanna Ånäs (Susannaanas) (talk) 07:38, 31 October 2016 (UTC) (edited 14:34, 31 October 2016 (UTC))[reply]

Categories, tags, and navigation within Commons[edit]

Structured data is a huge opportunity for Commons that I've been waiting for years. I would really like this opportunity to be used to rethink what is the purpose served by Commons categories, and if there are still needed with structured data (I think they aren't). In my mind, categories have this functions :

They give information. For instance, I know that everything in Category:Paintings by Vincent van Gogh are paintings, made by Vincent van Gogh. This function will always be best served by structured data, wikidata:Property:P31 : wikidata:Q3305213 and wikidata:Property:P170 : wikidata:Q5582. Structured data is better because it is multilingual and more precise at the same time.
They connect (subparts of) Commons to other Wikimedia projects (and share this role with pages/galleries). This connection is both for readers (if you are interesting in this article, maybe look at our collections of images about this topic) and editors (improving a wikidata item about someone ? We might have a picture of their tomb). Structured data would help this by having more meaningful results (compare Category:Paintings by Vincent van Gogh, which has only subcategories and a handful of low quality files, to the appropriate SPARQL request. Instead of entry points based on manual curation, which can be explicit (pages such as Vincent van Gogh) or implicit (by adding "Category:Vincent van Gogh" in a file), we could have dynamic entry points, defined as SPARQL requests. (They could be updated by bots every day, or dynamically generated for each reader, or any solution that is cost effective).
They creates path of navigation within Commons. This is their most overlooked job, and it feels like we, the Wikimedia movement, forget that people might just want to look at pictures of a place without reading an encyclopedic article about it or looking for travel advices. Categories, with their rigid inclusion semantic, don't help. Sure, if I want to see only portrait paintings by Vincent van Gogh, there is Category:Portrait paintings by Vincent van Gogh. But if I want to see paintings of flowers ? Or sunsets ? Or any combination of criteria that is not yet here (and we already have LOT of multicriteria categories such as Category:Portrait paintings by Vincent van Gogh, Saint-Rémy 1889 and yet we are very far from covering everything). With structured data, we can allow the reader to choose their criteria (paintings only in a given geographic area, or about a given topic), but also open doors of serendipity (see paintings of flowers from other artists).

Structured data is going to turn Commons into the wonder it deserve to be : let make sure we give it the full power to amaze us ! Léna (talk) 11:00, 26 October 2016 (UTC)[reply]

I agree that revamping the category system might be one of the great benefits of structured data. See also my slide with examples of other issue with categories on Commons. --Jarekt (talk) 12:42, 26 October 2016 (UTC)[reply]

In theory it's a good idea, but in practice it would depend how well it was implemented, and what kind of user interface could be created. Categories do at least work, they are fast to browse and fast to update. That SPARQL request locked up my browser for a few minutes, and if I wanted to modifying the query I'd have to spend time understanding the query language and Wikidata properties. I suspect the results would also be limited to one image per artwork, there'd be no way to display all matching files in Commons. --ghouston (talk) 00:21, 27 October 2016 (UTC)[reply]

The Wikidata notability requirements would also need to be examined. At present, it seems that a Commons category alone isn't sufficient to allow creation of a Wikidata item, and if Commons categories go away, even that wouldn't be available. What happens when you want to group images by a concept that isn't described on any other Wikimedia site, and perhaps doesn't even have reliable external references? --ghouston (talk) 00:26, 27 October 2016 (UTC)[reply]

Do you have an exemple of such a concept ? Léna (talk) 06:27, 27 October 2016 (UTC)[reply]

@Léna I have tons of such examples. In fact it is part of my workflow as a volunteer interested in 17th-century art. If I can't find the artist (or museum, or genre, or subject) then I create a category for it. Days, months or years later I might go and write an article about the person, thing, concept or whatever, and then I get around to updating the various Wikidata items involved. Sometimes I don't get farther than just Wikidata items and never bother with a Wikipedia article (such as grouping artworks in categories by collector - the collector may have an article and some of the artworks may have articles or items, but I never bother to create items or articles for the collection). Jane023 (talk) 08:53, 27 October 2016 (UTC)[reply]

I have the same kind of workflow but I usually create "correct" items and "bad" categories. For instance wikidata:Q27553312 has clear, structured, multilingual information while Category:Mission Gabriel Maget is really poor (I only created it to link it to the item). I find it more easy to express information through statements than by finding the right parents categories of the one I just created. Léna (talk) 09:06, 27 October 2016 (UTC)[reply]

Exactly - and the point is there is nothing wrong with such workflows. It is perfectly OK for someone to create detailed commons categories and not bother with Wikidata. The point is that on Wikidata we have loose definition of notability along the lines of "if it is linked to a notable item directly, then it's OK" and on Commons it's not so clear. Importing Commons categories of artists'artworks for existing items for artists is OK, but importing commons categories of artists when there is no associated item for the artist is probably not OK. Jane023 (talk) 11:27, 27 October 2016 (UTC)[reply]

I think that if you have a Commons category of a person that you can find in VIAF, LOC or other library catalog (find enough info to fill {{Authority control}} template) than it is notable enough for Wikidata. Article or no article. I think their criteria for notability is much lower than for other projects. --Jarekt (talk) 12:41, 27 October 2016 (UTC)[reply]

Examples I can think of: people who don't have Wikipedia articles, perhaps sports people or academics, where it seems worth keeping a photo of them in Commons in case they are needed some day. Random devices such as obscure models of mobile phones where there's no Wikipedia article. I'm not sure if that item mentioned above, wikidata:Q27553312, meets Wikidata:Notability, due to the clause "an item with only a sitelink to a category page in Wikimedia Commons is not allowed on main article items". --ghouston (talk) 23:57, 27 October 2016 (UTC)[reply]

I've been reading these paragraphs and I understand that it would be a good method to substitute categories and improve cataloging of files. That implies keeping many of the categorizations by creating Qs.
Let's take Category:Carrer Pasqual Arbós 5, Xirivella. It's a building, not a monument, lacks any relevance but its very existance, we pictured it because not many buildings of that sort have reached our days in Xirivella. I cannot reference it in any other form than saying "go there and look". A Q for it will be needed or the information would be lost (or very difficult to find).
I can think of odder things: Category:Water supply manhole covers in Sueca. We have found that manhole covers are a source of information and we usually photograph them. Using several properties (it's a manhole cover, it's in Sueca, it's related to water supply) can help, but SPARQL (quoting Asaf Bartov) is very difficult to use, so a better more user-friendly questioning interface is requiered. B25es (talk) 18:32, 27 October 2016 (UTC)[reply]

I assume that our categories would not go away, but would remain as a parallel way of keeping track of things. In the old days we organized files using galleries which were competing with categories. Categories won, but we still have thousands of out of date galleries nobody maintains. I think we can do the same with new system. As for SPARQL I assume that tools will be written to see all the images that meet some criteria without using SPARQL queries. For example (following my image in the sections above), if you pick tags: paintings, male subject, from France, and portraits you will get something similar to the content of Category:Portrait paintings of men of France. --Jarekt (talk) 18:59, 27 October 2016 (UTC)[reply]

In that case would it be up to the user to think of appropriate tags to restrict their search by? They'd also need some way to find out what relevant tags are available. A tag like "clock" could include a vast range of devices including single-function clocks and all kinds of multi-function devices that happen to include a clock, including practically every computing device. --ghouston (talk) 00:10, 28 October 2016 (UTC)[reply]

Exactly what I'm thinking, we would need a system of both suggestions and free navigation (not threw SPARQL requests, but for something way more reader-friendly). For instance, once you are in Category:Paintings by Vincent van Gogh, you would have a way to restrict the search (with suggestions such as "in a given museum (van Gogh museum, Orsay, other) / at a given period (on a "ruler" from 1878 to 1890) / about some topics (portraits, landscapes, still life, etc) / with given properties (copies of Millet's works)) or to extend the search (drawings by Vincent van Gogh, or paintings by other painters). Thus, the navigation would be defined "top down" : the "code" of Category:Paintings by Vincent van Gogh would swich from the bottom up

[[Category:Vincent van Gogh| Paintings]] [[Category:Paintings by painter|Gogh, Vincent van]] [[Category:19th-century paintings from the Netherlands|Gogh, Vincent Van]] [[Category:Paintings from the Netherlands by painter|Gogh, Vincent Van]] [[Category:Post-Impressionist paintings|Van Gogh]] to something like

Down
1. museums : list - van Gogh museum, Orsay, others
2. period : ruler - 1878-1890
3. topics : list - portraits, landscapes, still life, others
4. filter : inspired by Millet
Up
1. Works by Vincent van Gogh
Linked
1. Paintings by other artist : list - Anthon van Rappard, Émile Bernard , Paul Gauguin

So it would be required of the sofware to have a langage that expresses these kinds of links and that this language be easily used by the visual editor. The role of the Commons editor would thus to express which parts of Commons should be linked with one another. Léna (talk) 09:59, 28 October 2016 (UTC)[reply]

Yes something like that, although I don't understand how "subcategories" would work in the new scheme. Somehow these would need to be derived from the Wikidata relationships. Then if a file in Commons was tagged with "Samsung SGH-D600", for example, would it also be found in a search for mobile phones, or for battery powered devices, or would those tags need to be added to the file explicitly? Locations are also difficult, when selecting London you'd want to include everything with geographic coordinates within its borders, as well as anything tagged with a geographical subregion such as Westminster. --ghouston (talk) 23:14, 29 October 2016 (UTC)[reply]

Maybe a Commons Category to SPARQL-query translator is needed. I mean, there are hundred thousands categories and it is not possible to write hundred thousands SPARQL-queries by hand. We would need a way to store those queries and maybe a way to cache the results. Querying 34 mio items is expensive. Maybe we would also need a software that proposes Commons Wikidata statements. The software knew what SPARQL-queries already exist and therefore could say, that your media file is similar to those media files and therefore have similar statements. For example, the software could create a list of museums that have paintings of Vincent van Gogh and the user could choose from that list, instead of searching for the right Qnumber of the museum. Second point: That Vincent van Gogh query shows about 1000 pictures. No one wants 1000 pictures to look into. We would need an assistant that asks you questions to reduce the number of pictures. That is another reason why categories are created. Third point: Without categories, where should Wikipedia articles link to? Maybe such a software decides if this project fails or succeeds and maybe the developers should start with that software, not with building Commons Wikidata. --Molarus (talk) 00:04, 28 October 2016 (UTC)[reply]

I don't know for Commons Category to SPARQL-query translation, but I'm working on a translation from Categories to statements : Commons2Data. You can have limitations on queries (for instance, display only the 50 first results). And your last point is one of my points : we need entry points from Wikimedia to Commons (and, btw, both categories and galleries are not that good entry points). Léna (talk) 09:37, 28 October 2016 (UTC)[reply]

1) Maybe a Commons Category to SPARQL-query translator could use a en:Genetic algorithm. The right SPARQL-query is found, if the query returns more or less the same pictures as are in the category. 2) I don´t think just showing the 1000 pictures step by step is the right answer, I would rather see your filter proposal as a better solution. But those filters have to be created by software from the statements of those 1000 van Gogh pictures. --Molarus (talk) 18:20, 28 October 2016 (UTC)[reply]

Exemple[edit]

An illustration of what I would like to achieve : Category:1884 paintings by Vincent van Gogh, Nuenen. It's currently done manually (well, I wrote a script). What I would like to achieve is something like this :

Description :
- instance of (P31) : painting (Q3305213)
- creator (P170) : Vincent van Gogh (Q5582)
- inception (P571) : 1884 (or an ISO representation of dates)
- location of creation (P1071) : Nuenen (Q153516)
Split :
- genre (P136)
Label :
- English: 1884 paintings by Vincent van Gogh, Nuenen
- Français : Peintures de Vincent van Gogh réalisées en 1884 à Nuenen

So, let us break this. The description is the machine-readable translation of the name of the category. This description is precise, with no ambiguity, and multilingual by nature, as long as properties and items are translated. This description is a bit, well, machine-like to read, so we add the possibility to overwrite this with natural langage labels. So with that we "just" solve the problem of multilinguism in Commons. But ! The most interesting for me is the "Split" part, i.e. what concerns navigation. Just by saying to split according to genre (P136), the subcategories can be generated automatically : portraits, landscapes, still life, and genre. Along with that, other way to navigate should be possible : changing one of the property for instance. This allows for more fluild navigation than what is currently possible in Commons. Léna (talk) 19:08, 5 February 2017 (UTC)[reply]

Experimental "category contains" template[edit]

@Léna, Jarekt, Ghouston, and Jane023: @Astinson (WMF), B25es, and Molarus:

I don't know whether the following is useful or not. I suggested the idea a couple of weeks ago at Commons Village Pump, and it disappeared without comment.

Anyway, extending a bit what Léna was already talking about above, namely trying to turn Commons categories into statements (ideally as near automatically as possible), it occurs to me that it would be useful to be able to store and share on-wiki progress that people make on that. It's difficult to store that on Wikidata, because most Commons categories (particularly intersection categories) don't have Wikidata items. But we can store the information in a template on the category. As a bonus, if we translate the description into a fragment of SPARQL, then we can offer the user a standard query to see what matches it on Wikidata.

So I made a prototype, Template:Category contains, generating a strap-line that can be put at the top of the category page; or just before the items. As a default demonstration, with no arguments, it gives this: