Commons:Village pump/Proposals/Archive/2022/07

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Duplikat-Erkennung: Commons Hochlade-Assistent und Commons-App

Seit ein paar Wochen habe ich Gefallen an der Commons:Mobile app gefunden. Ich habe vor, später (in ein paar Jahren) Teile meines Medienarchives mit dem Commons Hochlade-Assistent zu veröffentlichen und fürchte dann, versehentlich Bilder mehrfach zu veröffentlichen. Der Commons Hochlade-Assistent erkennt zuverlässig seine „eigenen“ Duplikate. Er kann aber nicht erkennen, wenn dasselbe Bild bereits mit der Commons Mobile App hochgeladen wurde. (Möglicherweise liegt das an unterschiedlichen Metadaten?) Ich habe hier auch schon darauf hingewiesen. Molgreen (talk) 15:24, 17 July 2022 (UTC)

Ja die Fotos unterscheiden sich um wenige Bytes und können daher nicht als Duplikate erkannt werden. Das Problem, dass Uploads von der App beim nochmaligen Hochladen im Wizard nicht erkannt werden lässt sich also leider nicht so einfach lösen. Das ein Foto aus der App zweimal hochgeladen werden kann, obwohl es die bitgleiche Datei ist, ist wohl tatsächlich ein Bug. --GPSLeo (talk) 15:46, 17 July 2022 (UTC)
Kann mir jemand sagen, ob ich das umgehen kann, wenn ich in der Commons:Mobile app alle EXIF-Tags in den Einstellungen auswähle? --Molgreen (talk) 16:14, 17 July 2022 (UTC)
Ich habe es mal versucht, wenn alle EXIF Tags in den Einstellungen der „Commons App" ausgewählt sind: auch dann erkennt der „Commons Hochlade-Assistent" nicht, dass es ein Duplikat ist:
  • Bild mit allen EXIF Tags per „Commons App" hochgeladen
    Bild mit allen EXIF Tags per „Commons App" hochgeladen
  • Duplikat mit allen EXIF Tags per „Commons Hochlade-Assistent" hochgeladen (Hochlade-Assistent erkennt „Commons App" Duplikat nicht)
    Duplikat mit allen EXIF Tags per „Commons Hochlade-Assistent" hochgeladen (Hochlade-Assistent erkennt „Commons App" Duplikat nicht)
  • Why not just use the mobile version of the MediaWiki Upload Wizard? The mobile app is notoriously bad at the basic functions it provides and is inferior to any browser-based web interface, just use the mobile upload wizard and the issues unique to the mobile app would disappear (for you, not for those unfortunate to use the mobile app). --Donald Trung 『徵國單』 (No Fake News 💬) (WikiProject Numismatics 💴) (Articles 📚) 16:18, 17 July 2022 (UTC)
    thanks for the hint. I will try it soon --Molgreen (talk) 16:49, 17 July 2022 (UTC)
    The "Nearby" feature mentioned here
    Hello Donald Trung 『徵國單』, thank you very much for the tip about UploadWizard the mobile version of the MediaWiki Upload Wizard. I have tried it. The upload works very well as expected. At the same time, the Commons app also has good features such as "Nearby" / "Photos needed".— Preceding unsigned comment added by Molgreen (talk • contribs)
    I am using this mobile app and I am perfectly fine with it Mateusz Konieczny (talk) 09:46, 28 July 2022 (UTC)
    Thank you, yes, that's exactly what I'm practicing successfully by now.
    It remains very unfortunate for me that the two upload wizards do not recognize each other's duplicates. --Molgreen (talk) 15:14, 21 July 2022 (UTC)
    @Molgreen Do you mean UploadWizard (the web version) successfully detected those duplicates with different metadata? If so, I wonder why. As far as I know, both of UploadWizard and the mobile app rely on hash values to detect duplicates. The two Schwanenteich images have different hash values, so both should fail to detect them as duplicates. (It's "sha1" in the result.) whym (talk) 22:28, 21 July 2022 (UTC)
    @whym thank you for your feedback. In my experience:
    • Does the UploadWizard reliably recognize its own duplicates
    • Does the Commons:Mobile app not recognize any duplicates (see Commons:Mobile_app/Feedback#Feedback_from_Molgreen_for_version_4.0.1~66f8f97d0_3 here
    • Commons:Mobile app and UploadWizard do not recognize duplicates of each other. This may be technically difficult. But I would really like it, because unfortunately I've accidentally uploaded the same file twice via Commons:Mobile app and UploadWizard. --Molgreen (talk) 05:51, 22 July 2022 (UTC)
      There seems to be two issues. 1) I wonder if something other than the two upload tools might have changed the file (or the metadata of the file). For example, you might have used Google Photo to transfer the file, and Google Photo might have modified/normalized something in EXIF. If so, another duplication detection method (that ignores EXIF) might be the solution. That would be a feature request to the MediaWiki developers. 2) File:Schwanenteich_im_Annatal.jpg and File:Schwanenteich_im_Annatal_4.jpg are completely identical, so if the mobile app showed no warning about duplicates, that is a software bug. I think this needs to be fixed by the mobile app developers. whym (talk) 03:28, 23 July 2022 (UTC)
     Support --Molgreen (talk) 09:52, 23 July 2022 (UTC)

    Another test

    to be on the safe side, I did another test. It seems to be very complicated:

    • in the following order:
      • I download Schwanenteich im Annatal.jpg to the download area of my smartphones
      • now something is different: duplicate is detected, but I can still upload
      • Upload wizard prevents uploading

    Another foto

    • 20220708-Test.jpg
    • App detects duplicate, but I can upload anyway
    • Upload wizard does not recognize the duplicate app (same source each time (Google Photo: Camera)

    Mass move of files containing space before the file extension

    I came across File:The Heart of a Hero (1916) .webm this morning, and I do remember seeing other webm files in the past that also had spaces before ".webm". I was wondering if it might be a good idea to use a bot to do a mass move of files with this specific issue, since it is an obvious, straightforward, and unambiguous error IMO. PseudoSkull (talk) 12:50, 21 July 2022 (UTC)

     Support with redirects.   — Jeff G. please ping or talk to me 23:18, 21 July 2022 (UTC)
    Redirects break links from outside Wikimedia, which is why file moves are discouraged unless really necessary (or maybe recently uploaded). This check definitely should be part of any uploading interface to try and prevent the situation in the first place, but there can be some consequences to fixing existing ones. Carl Lindberg (talk) 13:10, 28 July 2022 (UTC)

    Policy status for Commons:Harassment

    The page has stated that it is proposed for three years, with no complaints. Let's make it official, like en:WP:Harassment.   — Jeff G. please ping or talk to me 10:54, 6 July 2022 (UTC)

    What are examples of harassment on the Commons that have not been addressed, or not adequately addressed, by existing policy? In the absence thereof, how is this not w:WP:CREEP? Why is a policy that was copy and pasted from en.wiki being presented here for approval with only superficial localisation? "WP" shortcut prefixes, for example, were simply swapped to "COM" resulting in redlinks like "COM:HUSH"--or missed altogether, like the still-remaining "WP:HNE". The page is replete with links, see alsos, and references to en.wiki. This is the Commons. Эlcobbola talk 14:19, 6 July 2022 (UTC)
    What is the benefit of making it official? What is the benefit of its existence? Mateusz Konieczny (talk) 09:45, 28 July 2022 (UTC)
    @Mateusz Konieczny: The benefit is that, as a policy, it could be more easily and safely used as a reporting and blocking rationale.   — Jeff G. please ping or talk to me 13:14, 28 July 2022 (UTC)
    Are you sure that it is needed and that currently people are not getting for harassment and this would be changed with making this page official? Mateusz Konieczny (talk) 13:24, 28 July 2022 (UTC)
    @Mateusz Konieczny: Having that as policy would have helped in my ANU complaint about Debjyoti Gorai, among others.   — Jeff G. please ping or talk to me 13:32, 28 July 2022 (UTC)
    Can you link relevant discussion? (BTW, if you want to make something a policy because it would help in your complain about other user - it would be nice to mention it from start) Mateusz Konieczny (talk) 13:38, 28 July 2022 (UTC)
    @Mateusz Konieczny: I was writing of COM:ANU#Debjyoti Gorai, but there is also COM:HD#Need to talk to an Indian Commons admin.   — Jeff G. please ping or talk to me 13:54, 28 July 2022 (UTC)
    The user got no support in the linked discussions, and no one questioned your suggestions. I don't see how the suggested policy would help. –LPfi (talk) 08:50, 29 July 2022 (UTC)
    @LPfi: And yet, the user remains unblocked after over 39 hours. Note: I made this proposal 23 days ago.   — Jeff G. please ping or talk to me 10:06, 29 July 2022 (UTC)
    The one month block seems kind of week all things considered. I'm not saying having this as policy at the time would have helped make the block duration longer, but I do think there's a direct correlation between there being behavioral guidelines and how seriously administrators take inappropriate behavior. As well as how quickly they will act on it. It's also a lot easier for someone to excuse and arguing in favor of their position if the breach in conduct is only wrong because of implicit norms that aren't stated anywhere. Maybe Debjyoti Gorai wouldn't have been less inclined to beat the dead horse if we could have pointed to specific sentences in Commons:Harassment that they were violating instead of just doing a bunch of vague finger waging about them being a drama queen or whatever. --Adamant1 (talk) 23:42, 1 August 2022 (UTC)
    @Adamant1: Here's a vague finger-wave you can take to the bank (not directed at you): Alienating Admins is bad for your account's health.   — Jeff G. please ping or talk to me 00:00, 2 August 2022 (UTC)
    True. Debjyoti Gorai definitely didn't do their self any favors in that regard. --Adamant1 (talk) 00:21, 2 August 2022 (UTC)

    Block mass upload tools that don't prevent upload of duplicates

    Automated uploads of duplicates have gotten entirely out of hand. Any automated tool that uploads to Commons should not be permitted to do so until it prevents upload of duplicates.   — Jeff G. please ping or talk to me 11:43, 30 July 2022 (UTC)

    I meant exact duplicates of existing files (duplicate), of deleted files (duplicate-archive), and of old versions of deleted files (duplicate-version), as per bjh21's post below.   — Jeff G. please ping or talk to me 10:23, 31 July 2022 (UTC)
    I definitely support this. Additionally to the prevention of duplicates with the same hash value the tools should ask people to do at least a short search whether the files are already on Commons. --GPSLeo (talk) 12:09, 30 July 2022 (UTC)
    I'd say that this is probably not a good idea as it seemed, a large amount of images of museums, archeological sites, works of ancient architecture, Etc. comes from Flickr and Flickr2Commons is the best tool to import from. Generally speaking it is able to find duplicates but it doesn't detect all duplicates. The issue here isn't that people are allowed to use Flickr2Commons, rather it's the fact that it's not properly maintained and updated. Wouldn't it simply be easier to note that the tool in its current form has a bug and then have someone fix that bug than to prevent tens of thousands of educational uploads simply because a few of them might be duplicates which are usually tagged as duplicates and then deleted by a few admins specialised in this in a semi-automating way and redirected? This is just putting more burdens on the uploader who simply found a good high quality educational image with a free license and then attempts to upload this image. This is not behaviour that we should be discouraging by adding extra steps to. Simply ping user "@Magnus Manske: " or whomever is maintaining it now and let them fix these issues.
    Also the wording of the proposal can lead to all upload tools being blocked if they have this issue, which simply shifts the tool from being fixed to not being used at all. Flickr is probably the most important photography website on the internet (also note that places like Twitter, Instagram, and the Facebook all remove Metadata and shrink the files, so even if we would be able to import from them they're still inferior), we should be finding ways to make importing free images from Flickr easier, not more difficult. We used to have users like "" that did massive imports from Flickr using Flickr2Commons and a large amount of their images are used in Wikipedia articles. I'd say run the Flickr2Commons tool on any previously imported album and it detects basically all duplicates, it's quite rare for Flickr2Commons to actually not detect a duplicate if you're using the tool correctly, so it could be that Flickr duplicates aren't being found because of other reasons that simply don't happen with images that were previously imported using this tool. This requires a bug fix, not a ban. --Donald Trung 『徵國單』 (No Fake News 💬) (WikiProject Numismatics 💴) (Articles 📚) 12:25, 30 July 2022 (UTC)
    See Category:Duplicate as example. Problems with duplicates are not exceptions. Sometimes duplicates that were uploaded on Flickr were not detected in same batch. Either tools must be properly maintained and do not create unnecessary problems or not allowed to be used. --EugeneZelenko (talk) 13:53, 30 July 2022 (UTC)
    I think this proposal needs a clearer definition of "automated" and "duplicate". The upload API can issue warnings for exact duplicates of existing files (duplicate), of deleted files (duplicate-archive), and of old versions of deleted files (duplicate-version). Meanwhile, CSD F8 allows for speedy deletions of exact and scaled-down duplicates of existing files, and CSD G4 for speedy deletions of duplicates of certain deleted files. It might be reasonable to require tools not to silently overrule certain upload warnings. Requiring detection of scaled-down duplicates would be unfeasible since Commons doesn't have any useful facility to search for such things. Requiring automated tools to ask a human to search for duplicates, as suggested by GPSLeo would be difficult for bots. Should Geograph Update Bot wake me up every Sunday morning with a list of files that I need to search for before it can upload them? That might feasible, since the bot only uploaded 32 files this week. For GeographBot, which uploads about six files per minute, it would be completely impractical. --bjh21 (talk) 10:09, 31 July 2022 (UTC)
    Yes such a search can not be done by bots but this problem should be discussed in the task approval. In cases where many duplicates are expected there are some extra mitigations needed. Every bot should definitely be required to respect the duplicate warnings. --GPSLeo (talk) 18:51, 31 July 2022 (UTC)
    • Comment Personally, I'm kind of on the fence about this. On the one hand I'm not a super fan of duplicate files, but then on the other I've gotten in a few disagreements about if Commons actually allows for duplicate images/files or not. And from those discussions it seems like no one really cares. So while I'd support this in practice, I think it would have be implemented in-conjunction with a wider "no duplicate images/files" policy in general. Otherwise, it just seems weird to single out bots for doing something that isn't even against the rules. --Adamant1 (talk) 22:45, 1 August 2022 (UTC)
    • Obviously, bots increase magnitude of problem. Also mass uploaders did not always care enough about categorization and descriptions, so such duplicates lead to time waste other than administrators actions. --EugeneZelenko (talk) 14:09, 2 August 2022 (UTC)
    • Well, regarding "Also mass uploaders did not always care enough about categorization and descriptions, so such duplicates lead to time waste other than administrators actions." I'd say that the content of the media is more important than their categorisation (which the MediaWiki Upload Wizard itself notes as optional), there are users here who spend their entire time here categorising images and better categorising, are they "wasting their time"? I'd say that it's better to have an uncategorised educational image here than to have nothing at all. I often found high quality images in top categories like "Coins" or "Coins of Randomcountry" that should have been in more specific categories like "Coins of King Monarchpants XI of the Longgone Empire", but having the image here and then categorising it is better than not having the image and not being able to illustrate the subject with a free image at all. Don't get me wrong, discoverability is important, but categorisation is secondary to the content itself, especially since some users are pushing for a full abolition of MediaWiki categories once Wikidata-based Structured Data on Wikimedia Commons (SDC) items are becoming the norm. --Donald Trung 『徵國單』 (No Fake News 💬) (WikiProject Numismatics 💴) (Articles 📚) 09:59, 11 August 2022 (UTC)
    •  Comment, well, I thought that I had already written this but apparently I forgot and only (partially) addressed this in the main Village Pump. The issue with this proposal is that in its current wording it can technically ban the MediaWiki Upload Wizard. While browsing the files at "Category:Duplicates" I found that most I found in a random sample came from the MediaWiki Upload Wizard and not Flickr2Commons. Running my own experiment (example images at the Village Pump post that inspired this proposal) I found that Flickr2Commons actually does quite a good job of preventing duplicates to be uploaded. So what types of duplicates don't get filtered out?
    Well, that's the issue, these are files that are always impossible to filter out because they are technically not duplicates, files that have different EXIF data because some websites edit these. Now "User:1" imports files from Freefileswebhost.website and all these images are good educational content and get used on various Wikipedia articles, but as it turns out these images have edited EXIF data and were originally taken from Flickr. Now "User:2" imports all these same images with the "correct" (original) EXIF data (as Flickr doesn't edit EXIF data, while other websites like Meta's Facebook, Meta's Instagram, Twitter, Etc. all do) then is Flickr2Commons at fault here for not recognising that these files were already imported here from Freefileswebhost.website? Obviously not, as no tool could have recognised this.
    Had "User:2" have used the MediaWiki Upload Wizard's Flickr import tool then the exact same issue would have occurred, the MediaWiki Upload Wizard wouldn't have recognised them as duplicates and they would have still been uploaded.
    Sometimes uploaders only categorise media in specific categories. I came across an old 19th century French photographer that had made lots of images of Egypt, someone not familiar with how Egyptian categories worked could have only added them to stuff like "Pyramids in Egypt" or "Egypt in the 19th century" while the person who uploads the slightly different "duplicates" might have looked at the category for the specific pyramid by name. Are we asking users to literally go through hundreds of categories every time before they upload an image?
    More often than not human error, or rather human ignorance is at play here. A vague blanket ban will only prevent educational images from being uploaded and uploading these images one-by-one is tedious and takes even way more time than simply using mass-categorisation and mass-deletion tools that admins have access to. The only way this would work is if we'd say that the free time of an admin is more valuable than the free time of a content contributor and that doesn't seem like a wise judgement to make if the number one (#1) mission of the Wikimedia Commons is providing free educational content. --Donald Trung 『徵國單』 (No Fake News 💬) (WikiProject Numismatics 💴) (Articles 📚) 09:59, 11 August 2022 (UTC)
    Category:Duplicates contained ~ 9.000 technical duplicates ~ week ago and still contains more than 3.00. So this is completely preventable problem. It'll also save valuable time of content contributors. --EugeneZelenko (talk) 14:16, 11 August 2022 (UTC)
    The most obvious things that can be done in the short term are:
    • Add rate limits for Flickr2Commons and other mass upload tools, such as 100 files per 2-hour period, with exceptions when necessary for GLAM uploads etc. Mass uploads of thousands of files are simply far too large for the uploader to usefully check for duplicates or do other curation. This can be done very easily with an edit filter.
    • Establish basic rules for mass uploads: all files must have useful filenames, descriptions, and categorization. Users are much more likely to notice duplicates and other issues if they are finding correct categories than if they are adding hundreds of files to a single broad category. This can be done with community consensus; no technical action would be needed.
    These two actions, especially done together, would cut down on many of the issues of mass uploads: duplicates, out of scope files, poor-quality images, useless filenames and descriptions, files dumped en masse into broad categories, etc. While they would reduce the total number of files being uploaded, they would increase the number that are good quality and properly curated. This is comparable to policies implemented on other projects - for example, the Articles for Creation and New Page Patrol on enwiki, where new articles are required to meet basic quality standards like having sources. Pi.1415926535 (talk) 22:25, 13 August 2022 (UTC)
    That sounds like good ideas to me Mateusz Konieczny (talk) 18:25, 15 August 2022 (UTC)

    Subcategories of Category:Cultural history

    I am looking over the top-level categories by century/decade and there seems to be a large inconsistency on what falls under "Culture". I see cultural organizations, cultural events, entertainment, religion, education, art, cosplay, fashion. Is there a prior discussion on how to organize things at the top level downward? Ricky81682 (talk) 20:23, 31 July 2022 (UTC)

    Just a general comment, but what I've seen categories become completely worthless you get up to certain level of semantic abstraction like this one because people just the categories as random file dumps. Category:History is kind of the same way. There's so many different kinds of images in that category that it's essentially meaningless. Category:Music is another good example, 99% of the images in that category are COPYVIO SPAM images of non-notable musicians. Personally, I'd love to see such categories be gotten rid of. Same goes for a category like this one. Everything is "cultural" and everything is also "history." So what's the point in the category? There isn't really one. Conversely, is an image of a random person standing next to a tree "music"? Obviously not. --Adamant1 (talk) 22:31, 1 August 2022 (UTC)
    They serve two roles: 1) If they aren't too overpopulated with irrelevant content, somebody can find files they are able to put into useful categories (or delete as out-of-scope copyvios). The person leaning towards a tree in Category:Music is probably a musician and somebody might recognise them. 2) They can be used to find relevant subcategories. Everything is culture, including a boat, but a category hierarchy starting from there might be about heritage ships, concerts aboard a ship or somesuch. Sometimes it is difficult to guess at the relevant category, and then one strategy is to start high (low?) enough in the tree. –LPfi (talk) 09:31, 9 August 2022 (UTC)
    • @Adamant1 and LPfi: I agree with Adamant1. The whole structure from main topic down starts with weird vagueness and file dumps. I am just trying to get some consistency and move the same categories into the whole structure rather than have one or two decades being one way and earlier ones being other ways. It ends up with early categories being split into complex trees for one thing but otherwise you have dozens of things in the main tree for recent years. The worst problem is that there is no culture by year in English nor I think in Wikidata so it is a complex level of work for either one. Either that or dump the whole thing but there is no real place to discuss either option. -- Ricky81682 (talk) 19:44, 13 August 2022 (UTC)
    • "The person leaning towards a tree in Category:Music is probably a musician and somebody might recognize them." That clearly doesn't happen though since there's like 11,000 images of musicians in the category that no one has recognized or re-categorized. In the meantime it's just being used as a COM:SPAM dump. "
    "Everything is culture." Exactly. Which is why the category is essentially worthless. At that point we might as well have Category:Existence as the top level parent category for everything and use it as a file dump because everything exists and "someone can find the files and put them into useful categories" or whatever if we do it that way. In the meantime there's the option of just not categorizing something if you want people to put it into a useful category. Obviously categorize have to at least be tangentially related to what's being depicted in the image. Otherwise, you risk recreating the "no category" option in the space just so people can feel like they did something useful by putting the image in Category:Existence or whatever until someone else can figure out where it should actually go. Nothing stops somebody from finding the files and putting them into useful categories if the files just aren't categorized either. If anything using overly broad categories like Category:Music as file dumps just de-incentivizes people doing exactly that.
    Another perfect example of that is Category:Logos, which until recently had like 125,000 images in it. The only reason it doesn't have that many images now is because someone transferred most of them into Category:Unidentified logos‎, another worthless "no category" file dump. No one is going to look at those 125,000 images and think "hey, that looks like something I can spend a weekend tackling." Get real. People are just going to continue shuffling the images around into more meaningless non-category file dumps instead. --Adamant1 (talk) 23:57, 13 August 2022 (UTC)
    My question is a bottom up one though. If you look at Category:2021 in culture, that includes Category:2021 in art, Category:2011 in cosplay (which seems like it belongs under fashion), and Category:Fashion in 2011 but neither of those are the decade or century level include culture. Education is found at Category:Culture in the 21st century but not at the decade or annual level. Then you have Category:Entertainment by year which has a number of subcategories but no main category and is a mixed bag of locations. Again, there is zero place to have a discussion and it seems like it's just mixing and matching templates which only results in larger and more expansive parent categories with no logic or reasoning behind them since deletion is close to impossible to achieve. How do you even "fix" something like Category:17th-century culture, sciences and education in painting without a larger discussion of what subcategories for paintings should there be? -- Ricky81682 (talk) 01:24, 14 August 2022 (UTC)
    For Category:Cultural history there is no consistency and I would not expect any consistency Mateusz Konieczny (talk) 18:31, 15 August 2022 (UTC)
    @Mateusz Konieczny Would you suggest deleting the entire structure? Again, there is no entertainment by year which is probably a good thing. Ricky81682 (talk) 21:02, 24 August 2022 (UTC)