Commons talk:Structured data/Get involved/Feedback requests/Good coverage

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

from the talk page history

1/What do you mean "the user uploading it tagged it with only "German Shepherd" "? during use of the UploadWizard will the user have to chose a tag or a category? or did you want to mean "the user uploading it with only "German Shepherd" as category? Christian Ferrer (talk) 14:28, 22 September 2018 (UTC)[reply]
2/may be a language misunderstanding from myself, but do you mean that an image " only tagged with "German Shepherd" " will not appear in "dog" search results, because it is not tagged as "dog"? and that we have to add the "dog" tag manually? Christian Ferrer (talk) 14:37, 22 September 2018 (UTC)[reply]
I can appreciate why you're considering this, but (as presented) I think it's a bad idea.
A key principle on both Wikidata and Commons has been to try to make statements as narrow and precise as possible, and to rely on hierarchy rather than permitting redundancy (eg: COM:OVERCAT, here on Commons).
The problem, as many have discovered, is that searching a hierarchy is expensive, far more expensive than a flat tag search. People writing bespoke queries may be prepared to wait 60 seconds for a full hierarchical exploration (and the SPARQL service is able support this relatively small population of searchers). But 60 seconds is not acceptable for the main search interface, nor would the query engine be likely to scale to support full hierarchical searching for the entire population of searchers.
Also there's the issue that the Wikidata ontology at the moment is simply not in good enough shape -- just not consistent and predictable or reliable enough -- to even specify what those hierarchical searches should be.
So going back towards something that can be implemented as a flat search starts to look like the only solution.
But IMO adding multiple redundant "depicts" tags for the same object in a wider image is to be avoided if at all possible. Keegan, you say that there has been a review of this "within the Commons community". I'm aware of a couple of times the question has been raised, eg here and here, admittedly without much take-up, but with a sense I think that this was not the direction the participants would prefer. It adds redundant clutter to the item. It makes it difficult to know whether there are two objects involved, or just a single one. It reduces the impetus to refine the description and try to describe the things really sharply (in my view the COM:OVERCAT principle strongly contributes to the activity of category refinement for images). It makes it less clear where qualifiers (like "shown with features" or "located within image") should be placed. And it goes directly against the principle used on Wikidata, on a system that's supposed to seamlessly combine with it.
As an alternative, I would suggest treating these additional tags added for search purposes as 'shadow tags', attached closely to specific (conventional) primary tags for items. So if something in the image is tagged "German shepherd", make "dog" an alternate shadow tag attached specifically to that "German shepherd" tag, rather than a free-floating tag in its own right.
That way we can keep things organised, preserve the impetus to try to refine the identification of things, and be clear about how many identified things there are -- that there is only one animal in question, not two. Jheald (talk) 20:32, 22 September 2018 (UTC)[reply]
A further issue is what will happen when a Commons image "depicts" something with its own Wikidata item. How is it proposed to handle this case? An item on Wikidata will not have redundant depicts values: it will not have an additional "depicts:dog" statement, if it is for a painting of a German shepherd. Jheald (talk)
The "shadow tags" would be a kind of cache and like any cache would easily become out of date if the underlying data is changed on Wikidata. But the alternatives don't seem very pleasant. Queries that take 30 seconds to complete? Tagging every photo of a human with "human", "homo sapiens", "person", "homo", "homininae", "hominidae", "primate", "ape", "animal", "onmivore" "two-legged animal", "organism", "thing",... I know I've missed a lot. --ghouston (talk) 11:12, 23 September 2018 (UTC)[reply]
@Ghouston: The team appear to have developed really cold feet about using Wikidata to populate the additional search tags -- see phab:T199119, and in particular the first substantive comment on that ticket, by Cparle on 10 July, for their quick initial read of some of the issues this would face. So I don't think there would be any intention to keep the additional tags sync'd with Wikidata. Instead I think the suggestion is to perhaps try to suggest a few extra tags at upload time, and then from then on let them follow their own separate destiny. (Otherwise your analogy with a cache would be spot on.)
Hence the 'shadow tags' existing in their own right. But I do think there might be mileage in e.g. storing them as a qualifier ("additional search tag" ?) to a depicts statement, rather than as independent depicts statements in their own right. Jheald (talk) 17:13, 23 September 2018 (UTC)[reply]
Jheald has accurately described some of the technical issues that prevent us from implementing the preferred approach. The idea of something like an "additional search term" qualifier has some promise, and is an approach we're still considering as a possibility, but we need to game out the consequences involved. There are other logistical issues like how we would display it consistently in the UI, and how we integrate that approach with other platforms/systems (like GLAM databases), and how this would work with search. If that approach turns out to not be feasible, the solution that covers all requirements without extreme workarounds is to simply have a number of depicts tags on the M item. Although some tags might be somewhat redundant to humans (but still useful for search purposes), we can probably mitigate the impact on the UI. We will have the "Make Primary" button/link that will allow users to essentially say "these things are the most important", and those tags would be shown first and be the preferred vehicles for qualifiers. Again, using the German Shepherd example, although the image may be tagged with "dog", "pet", etc., German Shepherd can be the primary tag and house the important qualifiers like "applies to part", "shown with features", etc. while the depicts tag "dog" doesn't need to be primary and can just hang out in the background minding its own business (we're also considering a "cutoff" where, after a certain number of depicts tags, the user will have to expand to see more). We also have other reasons for wanting to separate what we're calling "elemental" depicts tags, including making it easier to import data from sources that already have tags set up that way (like Flickr Commons, GLAM sites, etc). Depicts on Commons will perhaps be the most complex part of the project, and easy answers will be in short supply, but we think the end result will be a dramatic improvement in search and discoverability. RIsler (WMF) (talk) 22:35, 24 September 2018 (UTC)[reply]
@RIsler (WMF): Thanks for dropping by. It's good to know that something like an "additional search term" qualifier is still in consideration.
Regarding the use of "Make Primary", I am now a bit confused. I had understood, from the Depicts consultation that 'Primary' was to be used on "depicts" to indicate the overall topic of the image -- eg something like nativity scene (Q31732) or sacra conversazione (Q370665), rather than being used to prefer Virgin Mary (Q345) over woman (Q467) for one of the elements within the scene. I do think that for the latter a better approach would be to try to tie the two together more concretely, eg by making the one a qualifier value for the other. It would be a much better structure for people writing queries to be able to work out what is going on. The idea of introducing additional ranks beyond the three used on Wikidata is also interesting (but is this possible, technically, without major surgery to the code of wikibase?), eg to hive off secondary tags to a lower rank, so many applications could ignore them. But going down the road, I suspect that tying the secondary tag to the regular tag is probably information that will turn out to be useful. If an additional rank were going to be introduced for anything on CommonsData, I would put one for "inferred by machine; not confirmed" at the head of the queue -- I suspect it is a status we may be going to be seeing a lot -- to rank below a regular statement, but still be eligible to be included as a wdt: statement in the RDF, if there was no regular statement outranking it.
As regards data import, I suspect we're kidding ourselves if we think this is ever going to be easy. I'm working on an image upload project with a major GLAM at the moment, with simultaneous creation of Wikidata items for the underlying objects, and the reconciliation of names for people and places to Wikidata is brutal -- easily the most rate-limiting aspect of the whole process. This is probably as near as one can get at the moment, before CommonsData goes live, to what an upload involving Structured Data will entail. As an example, the current batch of images I've been working on contains 200 creators or contributors, with names that are supposedly normalised to the Library of Congress preferred form, if the LoC has heard of them. An initial match to the LoC and then Wikidata found 90 matches, 10 of which turned out to be wrong. By trying matching via VIAF, and then going through remaining candidates one by one, I've now raised the 'matched' count to 110 of the original 200, but it's taken a day and a half to do. And this batch is just 2% of the overall collection. Perhaps the universe of potential "depicts" tags is a more limited vocabulary, but the matching of a tag vocabulary to Wikidata, and then even more so the verification of that matching, is not a small job. I suspect that against all that, using machine methods to identify when one tag is probably just a less specific intimation of another tag, and should therefore be made subordinate to it, will likely add no more than a drop in the sea.
A further point is that Commons will still be expecting all uploads to be fully categorised, and for those categorisations to obey COM:OVERCAT, ie only categorise with the most specific indications. Structured Data should help a lot with that -- one of the reasons I'm so much trying to go the Wikidata route with my current project is to then be able to read off the appropriate Commons categories -- but to avoid OVERCAT the uploaders will thus need to work out in any case which tags are redundant to which other ones, so the effort of determining this to store them in qualifiers is not really an additional overhead. Jheald (talk) 18:52, 25 September 2018 (UTC)[reply]
For "make primary", we're exploring whether it can serve more than one purpose. Yes, its main use would be to identify the main subject of the media. But perhaps this feature (or something similar) could also say, either implicitly or explicitly, that the tag in question should be the one to host relevant qualifiers. Again, this is all still work in progress and we have a lot of different use cases to account for, so we certainly won't have anything solid on this until next month. RIsler (WMF) (talk) 18:03, 26 September 2018 (UTC)[reply]
Hope we misunderstood the comment made by Keegan (WMF), otherwise it is likely better to develop FastCCI tool, and to create a "tag" namespace in Commons that will work in parallel with category tree but that will not be subject to our over-categorisation rules. Example : if you categorize your file with Category:Dog then Tag:Canis, Tag:Canis lupus, ect, ect... are automatically added to the file by a BOT or a software, and when you click on Tag:Canis then you see all the images that have "Canis" as tag. This would allow to stop spending a signifiant part of the $3,015,000 USD of that project. Sorry for that last sarcasm. Christian Ferrer (talk) 12:03, 23 September 2018 (UTC)[reply]
  • @Christian Ferrer: 1. Refers to statement tagging, not category tagging. Categories remain an independent process 2. Correct, the file would have to be tagged with "dog".
I'll work on getting some more specific answers to other concerns and questions. Keegan (WMF) (talk) 19:05, 24 September 2018 (UTC)[reply]
ok thanks for the answer. Christian Ferrer (talk) 21:11, 24 September 2018 (UTC)[reply]
  • It seems to me that it is a disaster that the system will not automatically be able to make a search based on a hierarchy of tags. Would it be possible to offer both types of search, i.e. a simple tag search which would be fast and a hierarchical search which would be understood to be slow (perhaps limited in the amount of hierarchy which could be searched)? Strobilomyces (talk) 11:52, 25 September 2018 (UTC)[reply]
@Strobilomyces: I can't speak for the team, but as I understand it the sheer number of different ways different properties are used in different circumstances, plus the density of very odd glitches in the WD ontology, plus the difficulty of prioritising results to meet general users' expectations of seeing the results that they would actually want to see, have put the team right off offering any deep hierarchical search. (See the assessment by Cparle on the ticket I linked above for just a taster of some of the problems lurking under the surface). Any attempt in this direction would be a major research project, simply not on the agenda for the team trying to ship version 0.1
BUT --- all of CommonsData and all of Wikidata should be accessible from WDQS, so it should be possible to write queries in SPARQL that are as deep and complicated and bespoke and intricate as one could wish. And probably, soon enough, one will find that users who have a particular knowlege and interest in particular areas, understand the twisty details of the Wikidata hierarchy in those particular subject areas, and are prepared to put in the time to extend some of the data that is incomplete and fix some of the statements are wrong -- those users are quite likely to start producing ready-written query designs for particular subjects and disciplines, that somebody might well graft a user-friendly front-end onto. But nobody should underestimate the amount of data that is going to need to be improved on Wikidata, if those queries are going to produce good and solid results -- just look at all the data that is currently missing from the infoboxes on categories, just for starters, never mind all the data that is still needed to make sure the hierarchies behind those items are solid and robust. Jheald (talk) 17:20, 25 September 2018 (UTC)[reply]
Thanks for the answer.Strobilomyces (talk) 11:44, 26 September 2018 (UTC)[reply]
  • I have some doubts about this. From my experience with the Wikdiata ontology I have to admit that it might not be well suited for Commons because it is deeper than what Commons needs, and perhaps not as user-oriented as one would expect. The thing is, there is nothing stopping Commons users to create their own ontology or hierarchy of depicts items. So why not have an own collection of depict items on Commons itself and structure them as wished? Then they can be connected to Wikidata items where appropriate, and use whatever ontology the user wants.--Micru (talk) 07:51, 29 September 2018 (UTC)[reply]
@Micru: CommonsData is not currently projected to support generic items, only media-items for particular media files. Generic items are expected to live on Wikidata (per current plans, at least). Jheald (talk) 11:29, 29 September 2018 (UTC)[reply]
The question which was not studied is what should be done in wikidata ontology to allow a correct search using the wikidata ontology. Currently nobody try to improve the wikidata ontology because there was no reason to have a strict set of rules. But we can improve the ontology by fixing a set of simple rules like an item should not be an instance and a subclass at the same time or no reference cycle. Snipre (talk) 07:19, 2 October 2018 (UTC)[reply]
@Snipre: The comment by Smalyshev on wikidata-l is also worth reading [1] : The main problem is that there is no standard way (or even defined small number of ways) to get the hierarchy that is relevant for "depicts" from current Wikidata data. It may even be that for a specific type or class the hierarchy is well defined, but the sheer number of different ways it is done in different areas is overwhelming and ill-suited for automatic processing... One way of solving it is to create a special hierarchy for "depicts" purposes that would serve this particular use case. Another way is to amend existing hierarchies and meta-hierarchies so that there would be an algorithmic way of navigating them in a common case. This is something that would be nice to hear about from people that are experienced in ontology creation and maintenance... I think this is very much something that the community can do. Jheald (talk) 08:11, 2 October 2018 (UTC)[reply]
Perhaps it's a failure of my imagination, but I would have thought that when searching for black hole (Q589) then a standard way of getting a hierarchy would be that the search should return every file that has a depicts statement referring to a subclass or instance of a subclass of black hole (Q589). So I should get File:Black hole - Messier 87 (cropped).jpg if it depicts Pōwehi (Q3841190) because that is an instance of supermassive black hole (Q40392) which is a subclass of black hole (Q589). Isn't it always that simple? --99of9 (talk) 06:47, 16 April 2019 (UTC)[reply]

@Keegan (WMF): If I understand correctly: The current wikidata ontology is unsuitable for searching (e.g. related discussion) which is a huge problem. I do not think it is a good idea to cover up this mess with hundreds of different tags. Instead the image classification and searching algorithm should be a motivation and help people to fix the ontology. --Debenben (talk) 15:59, 3 October 2018 (UTC)[reply]

  • @Keegan (WMF): I fully agree with above, if "German Shepherd" is currently no linked (in the results of a potential search) with the taxon chain of Canis lupus familiaris, it is because the ontology is not well done, Structured data for Commons may be a good idea only in the extend that the "data" is indeed well structured. In Wikidata German Shepherd should be a "breed" (with "breed" as a property) of Canis lupus familiaris, however it is not. It is currently a breed of dog, which literally is true but ontologically totally wrong, "dog" is not a species but a taxon common name. I wonder how many items are affected by this kind confusion. As well woman (Q467) is a "female adult human" only in the description, but not in the statements, where you can indeed find "female" and "adult" but not "human", therefore women will never be highlighed if you search "female mammals". But that's not why I pinged you, has it been envisaged to have the possibility to add qualifiers to the depicts "tags", as it is shown for the Search prototype? That will be good. Sorry if it is already written somewhere and if I missed that. Christian Ferrer (talk) 05:24, 7 October 2018 (UTC)[reply]

I just came to this, and am in agreement that we would be better following COM:OVERCAT than adding every single tag that has our item as an instance of a subcategory. Otherwise these items will get messy very quickly. I understand that this will give a limited search functionality for now, but SPARQL queries will still allow the full scope of this data. Is this still the right place for this discussion, or should it now move to Commons_talk:Depicts? --99of9 (talk) 06:25, 16 April 2019 (UTC)[reply]