Commons talk:500px licensing data

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

To be clear[edit]

Tables were made based on what images on Commons link to. I don't check in any way if the image that was uploaded here actually matches the image on 500px. - Alexis Jazz ping plz 04:26, 17 July 2018 (UTC)[reply]

Errors[edit]

Report errors in the tables here. (do not report errors in the old table)

Missing files[edit]

File:Coma - lansare album Orizont.jpg is missing in the new table. De728631 (talk) 22:02, 16 July 2018 (UTC)[reply]

Is in experimental table -Jazz

Where are my polar bears like File:Polar bear Copenhagen zoo swimming.jpg? De728631 (talk) 23:55, 16 July 2018 (UTC)[reply]

Some files that were already reviewed were filtered, those reviewed by 1989 happened to be among those. Will include them later. -Jazz

Wrong IDs and url[edit]

File:Magic Man at UCSD.jpg has a wrong url. The original source link is dead. De728631 (talk) 22:16, 16 July 2018 (UTC)[reply]

File:West 23rd Street, Manhattan.jpg in experimental table. Original source link for Magic Man is indeed dead, so the file does not appear in the experimental table. (currently by design) -Jazz

File:Punta del Este lighthouse at night.jpg has a wrong ID. De728631 (talk) 23:46, 16 July 2018 (UTC)[reply]

Seems fixed again in experimental -Jazz

Duplicate list entries[edit]

File:Granville Island Public Market inside.jpg occurs twice. De728631 (talk) 22:31, 16 July 2018 (UTC)[reply]

As does File:Cathedral St. Stephen of Metz.jpg. De728631 (talk) 23:10, 16 July 2018 (UTC)[reply]
And File:Фестиваль в Фукуоке, июль 2012.jpg, File:Hates paparazzi by Alvaro A Novo.jpg, File:Lecco di notte.jpg, File:Tree-slaughter-20130221-thonon-fr74200.jpg, File:Крым, Новый Свет (2).jpg, File:Kapellbrücke with flowers.jpg, File:GuoJian6.jpg (ID 87502537 is wrong), File:Airbus A380 Singapore Airlines 9V-SKL at Changi.jpg (ID 73446669 is correct), File:Xenofon București by Laura Burcus.jpg (ID 76244825 is correct). De728631 (talk) 23:46, 16 July 2018 (UTC)[reply]
I think the dupe issues has been resolved in experimental. The 500px links with license were always correct, but too often had the wrong Commons file linked. -Jazz


Files reviewed by uploader[edit]

@De728631: Do File:Coma - lansare album Orizont.jpg and File:Xenofon București by Laura Burcus.jpg need a new review? The license is correct, but they were reviewed by the uploader. - Alexis Jazz ping plz 19:25, 20 July 2018 (UTC)[reply]

Strainu is a trusted image reviewer so that should be ok. For the sake of transparency I'll leave it to other reviewers to check my own uploads, but I don't see how these two files needs an additional review. De728631 (talk) 16:01, 21 July 2018 (UTC)[reply]
@De728631: it doesn't bother me personally, but these words from @Guanaco: didn't leave much to imagination:
""it appears you intend to use this to review your own uploads. Any license reviewer should know that this is not allowed" - Alexis Jazz ping plz 11:22, 22 July 2018 (UTC)[reply]
That's actually right, so thanks for digging a bit deeper. Commons:License review: "As of 21 February 2012, image-reviewers may not review their own uploads unless the account is an approved bot." So these two uploads were technically still unreviewed. I have now checked them and confirmed the licenses. De728631 (talk) 13:06, 22 July 2018 (UTC)[reply]

how to import[edit]

@Okki, De728631, Alexis Jazz, and Donald Trung: do you have any kind of clue how to import it?

I'm having a trouble to figure it out how to import to here. I didn't find the files, as they didn't uploaded here: [1] (or didn't tag correctly). The .py that they created to download all the files in association with metadata is not properly working for me, I already tried to contact the developer to see if I'm the problem or the app (not response yet). But, but...

There are already 2 TBs of files, I don't know how many of those are cc-by/cc-by-sa, the list of cc-by sa have ~600 items each with I don't know how many links... upload, all by myself, 500 GBs or more, could be a problem..

Do you have already a path to do this? -- Rodrigo Tetsuo Argenton m 23:40, 23 July 2018 (UTC)[reply]

Sorry, I don't know anything about Python and/or upload bots, so I'll have to leave this to the others. As has been noted at the Village Pump though, you should take care to use appropriate file names. And we don't just want to grab any and all free images, but whenever possible they should be realistically useful for Commons. If we had a list of all CC files, we should first exclude those already present on Commons and then do a quick review for those that should really be imported. Getting more out-of-scope stuff on top of the regular Flickr bots is not really what we want. De728631 (talk) 00:02, 24 July 2018 (UTC)[reply]
The 500px id should always be included in the filename. I may be able to add information later if that is present. - Alexis Jazz ping plz 21:09, 24 July 2018 (UTC)[reply]

An important update[edit]

The Brazilian volunteer Chicocvenancio is developing a tool to import the files and metadata (provided by Archive Team) in a responsible way, giving us the possibility to select files before import it.

You can see the embryo here, the importation will be authorized by oauth. So no verification of license and scope needed. Cheers.

-- Rodrigo Tetsuo Argenton m 19:38, 26 July 2018 (UTC)[reply]

Muito obrigado! This looks like a good idea. De728631 (talk) 19:43, 26 July 2018 (UTC)[reply]
Great, that would save me quite some work. @Chicocvenancio: can you make sure to include the 500px ID in the filename of the uploaded file, like Flickr2commons does? This would make it a lot easier to make adjustments later on. You may also want to take a look at the "keywords" column of my table, some even include {{Object location}}. - Alexis Jazz ping plz 19:51, 26 July 2018 (UTC)[reply]
Alexis Jazz, we have all the data available, I only loaded id, author name, license and full resolution url into the database to not have performance issues in toolforge (20 item pages of that data can be seen at toolforge:import-500px/page/1. The idea is to have the full data available to construct the metadata in the upload.
Is the id in the title necessary? If there is a consensus I'll keep it, but to me it seems burdensome to users of the images to have a non-meaningful id in the name. I agree it has to be on the wikipage in some form, though. Chico Venancio (talk) 21:19, 26 July 2018 (UTC)[reply]
It has become somewhat of a common practice to include such IDs in file names from mass uploads, e.g. File:Elizabethan Splendor, Montacute House (338479812).jpg from Flickr. De728631 (talk) 21:27, 26 July 2018 (UTC)[reply]
@Chicocvenancio: when the ID is in the filename I can create a temporary template that spits out additional information (like object location), add it using VFC and obtain the ID to pass as a parameter using {{PAGENAME}}. I can do no such thing when the ID is only in the wikitext. - Alexis Jazz ping plz 21:29, 26 July 2018 (UTC)[reply]
Ok, I still think we include all relevant information in the initial upload, but I'll leave the id baked into the filename. Chico Venancio (talk) 04:43, 27 July 2018 (UTC)[reply]

toolforge:import-500px can now be used to view all 231k images. The archiveteam plans to release a more polished viewing site for all the images (including NC and ND), so I'm not too worried about making this very usable for a wide audience. My goals are:

  • Allow Wikimedians to see the images so we can discuss if/how we should import. (hopefully done now).
  • Allow Wikimedians to import images into Commons.

On this second goal, I'd like some feedback, my current plans is:

  • keep this (lack of) layout on the list page and have a detail page for each image.
  • From that detail page allow uploading the image directly to commons after loging in through oAuth.
  • After an image is uploaded, drop it from the main list, creating a separate view for those.

Any special requests that could improve the workflow here? Chico Venancio (talk) 04:43, 27 July 2018 (UTC)[reply]

Great work, Chicocvenancio! What is the time schedule for the upload? What support --besides regular curation-- do you need? This massive upload should be reported in some meaningful way, IMHO, as an interesting workflow appears to have been put in place here. --Joalpe (talk) 11:48, 27 July 2018 (UTC)[reply]
@Chicocvenancio: Great work! I would like to be able to view all the photos (or at least a few hundred) that were taken by a 500px user, enable checkboxes for each photo (with "invert selection" option) and click "upload" to upload everything I've selected. Is that possible?
I'd also like to be able to either categorize images or just mark them as "do not import", dropping them from the main list without uploading. Suggestion to categorize: mark images as "License problems" (PD-mark stuff without PD rationale and suspected license laundering), "Female modelling" (generally just gets nominated for deletion here unless it's special) and "Artistic without educational value". I would like to do this for all (or many) photos of a user at once as well. - Alexis Jazz ping plz 12:31, 27 July 2018 (UTC)[reply]
@Alexis Jazz:
  • Having a per 500px user view is very possible, I'll try to create it soon
  • The default pagination is 20 photos per page, but it will be customizable soon
  • Having checkboxes and mass upload is possible, but requires us to define the wikitext based on the metadata beforehand (example2)
  • Marking as do not import or categorizing them is also possible but a bit more work, we need to think how this will be shared across users as well.
Chico Venancio (talk) 12:57, 27 July 2018 (UTC)[reply]

This is amazing! I'd like to note though that you can discard all images with Public Domain Mark (PDM). We do not accept this as a licence because it is just a statement the image is PD for some obscure reason. I think it would be too much work to dig into each of these PDM files to find out why they are PD. De728631 (talk) 12:45, 27 July 2018 (UTC)[reply]

Perhaps categorize those as well for whoever wants to wade through them. A single user may have scanned and uploaded many archived photos that could be imported easily. - Alexis Jazz ping plz 12:50, 27 July 2018 (UTC)[reply]
Ok, then it would be handy if the results could be filtered by licence. De728631 (talk) 12:53, 27 July 2018 (UTC)[reply]
I can filter PDM out from the default into a separate view. Chico Venancio (talk) 12:57, 27 July 2018 (UTC)[reply]
That's even better. De728631 (talk) 12:58, 27 July 2018 (UTC)[reply]
✓ Done at toolforge:import-500px/pdm Chico Venancio (talk) 21:40, 27 July 2018 (UTC)[reply]

Chico,

It is already slow to loading, having more pictures will be even worst, so 20 images is good.

My suggestion of workflow is separate not by the author, but by the subject of the image. The 500px community have some big tags that people include their images ([2]), for us, for example, the category "underwater" will have more valuable images than the category "nude", and in theory there are NSFW photos included in this pool, by the author some volunteers may have surprises during the process, okay there is NSFW flag, it's in metadata, but some users do not declare, even include the nude tag.

For selecting, will be better have the possibility to select the whole page, to not a conflict of volunteers trying to upload the same page at the same time, it's better to only have the option of selecting one page per upload. This will severe slow down the process, but will generate less errors. Image a situation that a volunteer selected 2k images during a whole day, and we have another volunteer selecting 1k of this 2k during a whole day also... this avoid conflicts, and also crazy volunteers that prefer select a whole catalogue not checking all photos, as we already have at Flickr2Commons, another reason to not have the possibility to select a whole work of an author.

Tagging as "do not upload" is okay, but we should have access to this, as the "out scope" concept is grey, so volunteers could have different views of same image. We also need to have a "copyvio" I so at least 3 suspects, in ~100 pages that I pass trough...

The name of the file... we sometimes uses the ID to not have conflict with a already existent names. But we should create a template to highlight the file with 45455615611 at the name, to a volunteer properly change it after. A hidden category 500px-import tool should be add in all files imported by it. Even to report...

The main work will not be upload this, the categorization, correctly description, and name of the files will demand more work.

Thanks for the effort-- Rodrigo Tetsuo Argenton m 21:34, 27 July 2018 (UTC)[reply]

I talked a bit with Rodrigo Argenton in IRC about this, but to document here:
  • The number of items per page is customizable now, it can go as high as 300 (depending on your device that may already be too much). To change it, for now you have to "url hack" changing the perpage= url parameter. This will be a more user friendly selection box soon.
  • It seems to me most images have non-useful tags, we can have a more thorough analysis of them, but I have not loaded them into toolsdb and don't think there is much to gain from that. We can revisit this later if needed.
  • I only found 24 images with the NSFW:true property.
  • I have made the page be random if not set to a specific number (hopefully getting users to look at different photos)
  • I have not started working on (yet):
    • checkbox selecting
    • marking images as "not appropriate for Commons"
  • Single photo upload to commons should be ready today or tomorrow.
Chico Venancio (talk) 22:06, 28 July 2018 (UTC)[reply]
@Chicocvenancio: great news! In general the tags seem to broadly (and often specifically) describe the image, like "city", "lake", "night", "landscape". I think these are quite useful. While not compatible with Commons categories, they do help to find images when searching. - Alexis Jazz ping plz 23:01, 28 July 2018 (UTC)[reply]

The first few files are uploaded! File:Chuck Berry (8275961).jpeg, File:Parliament (49119844).jpeg and ‎File:Bon App?tit (8667215).jpeg. There are a few kinks to work out before bringing the buttons to everyone but it should be soon. In the meantime, it would be nice for having a sanity check on the template I used to create the wikitext. Is there something I should change? Any other part of the metadata that might be useful there? Chico Venancio (talk) 07:50, 29 July 2018 (UTC)[reply]

Looking at the examples you've uploaded I'm a little concerned that they might attract erroneous deletion nominations for lack of evidence of license. Ideally the pages should link to an archived page showing the license in addition to an archive link to the actual image. If the license has only been archived in a database, maybe consider including a link to your forthcoming "detail page" in the source or permission portion of the template. That would also be convenient if someone wants to review the other metadata associated with the image that is currently not imported into Commons. —RP88 (talk) 14:37, 29 July 2018 (UTC)[reply]
RP88 thanks for pointing that out. It is in the source HTML files at both https://500px.com/photo/8275961/chuck-berry-by-pablo-vaz (still?) and https://web.archive.org/web/20180729070431id_/https://500px.com/photo/8275961/chuck-berry-by-pablo-vaz but neither of them are visible in-browser. Since expecting commons users to dig through source-code doesn't seem like a good idea I'll link to a on-tool detail page. Chico Venancio (talk) 16:02, 29 July 2018 (UTC)[reply]
@Chicocvenancio: Exciting to see this moving forward. When the description/keywords are empty, use the title instead. Also see https://commons.wikimedia.org/w/index.php?title=File:Chuck_Berry_(8275961).jpeg&diff=312838193&oldid=312754142. The files should actually get a license review I think, but perhaps another bot should do that afterwards. (similar to how Flickr2Commons doesn't review the files it imports) - Alexis Jazz ping plz 19:15, 29 July 2018 (UTC)[reply]
RP88 I've now included a detail page with the metadata. Do you see any problems with using this a "proof of license"? Chico Venancio (talk) 19:19, 30 July 2018 (UTC)[reply]
Very nice Chico Venancio! Your detail page looks just fine to me. To be clear, my encouraging you to add a mention of the license to your existing plans for the image detail pages was to head off any issues due to license not being user visible in the archived pages, ultimately this comes down to the community trusting in your tool, which I am confident will be the case. Anyone sufficiently interested in a "proof of license" would, of necessity, presumably examine the source of the archived page to get the license name by looking at the code which converts the embedded license number to a license name. —RP88 (talk) 09:36, 31 July 2018 (UTC)[reply]

"The files should actually get a license review I think, but perhaps another bot should do that afterwards"

I utterly disagree, we are manually selecting the files, and a "bot" is checking the metadata and them including the license information, another bot will not help at the process; this is over necessary. In the same way, I'm finding a little bit weird the affirmation of "lack of evidence of license", for Panoramio we didn't request this "detail page", as you can see here in any photo. What you may want to do, is to include the tool as a "trusted bot", and than the template automatically.

But, Chico already did what you requested...

-- Rodrigo Tetsuo Argenton m 01:49, 31 July 2018 (UTC)[reply]

@Rodrigo.Argenton: one way or another it's probably a good idea to include a license review template. - Alexis Jazz ping plz 09:16, 31 July 2018 (UTC)[reply]

Debugging[edit]

Upload function is now working ! The option mark as "not in scope" or some other categories is being developed now (backend done). Chico Venancio (talk) 20:49, 31 July 2018 (UTC)[reply]

Great work so far. Is there a way to close the metadata popups? When you click at various photos, the metadata lists just keep piling across each other. De728631 (talk) 21:33, 31 July 2018 (UTC)[reply]
De728631 I know it is not intuitive, but clicking on the photo again closes the popover. Not a priority, but on the task list to improve this. Chico Venancio (talk) 21:46, 31 July 2018 (UTC)[reply]
Alright, I just figured it out. De728631 (talk) 21:47, 31 July 2018 (UTC)[reply]

I can't upload this image. Could that result from the title beginning with a hyphen: "-ss-rotterdam-by-appie-verschoor"? De728631 (talk) 01:34, 1 August 2018 (UTC)[reply]

@Chicocvenancio: With this image also not getting uploaded, I'm pretty sure that the naming pattern is the reason. "...in concert2..." would not make a valid file name because of the leading dot. E.g. on a Linux system this would result in a hidden file. An automated removal of leading punctuation characters in the titles might fix this issue. Likewise, there are also question marks and possibly other disruptive characters in the titles that should be replaced. De728631 (talk) 06:11, 2 August 2018 (UTC)[reply]

Most of these issues have been dealt with today. Mainly there were 3 separate issues:

  • Files on the commons titleblacklist
    • These are now retried with "500px photo (ID).jpeg" as filename
  • Files with leading spaces
    • There was already a process to strip most non-url safe characters from the filename, but sometimes this would leave a space at the start of the file, commons issues a warning in those cases, we are now stripping leading spaces
  • Descriptions with links on the spamblacklist
    • I've stripped all links of their protocols, this will be enough to not trigger the spamblacklist, uploaders should analyse if the links should be kept when creating a more informative description.

Chico Venancio (talk) 02:20, 3 August 2018 (UTC)[reply]

List of files not uploading[edit]

A warning[edit]

There are NSFW so beware. Not a problem to me, but... -- Rodrigo Tetsuo Argenton m 21:30, 1 August 2018 (UTC)[reply]

PDM[edit]

What we will do about those files? For some reason, maybe I was trying to import files from the author, but by accident I uploaded one file of PDM list... what we will do about those files, simply not upload it? -- Rodrigo Tetsuo Argenton m 03:13, 3 August 2018 (UTC)[reply]

i would flag it for review. if it is art that looks old, it could be PD-art. but there are those who doubt the assessment of others, and routinely nominate for deletion. i.e. PDM =\ CC0. in larger sense, i would try to sort into task buckets, review category, review description, review license. cheers. Slowking4 § Sander.v.Ginkel's revenge 01:33, 4 August 2018 (UTC)[reply]

Comment needed on title= paramater[edit]

Please comment on phab:T201239. Should we use {{Untitled}} on import or leave that to review efforts? Chico Venancio (talk) 19:17, 4 August 2018 (UTC)[reply]

If there is no title at 500px, why don't we just discard the |title= parameter? Shouldn't be that hard to filter. De728631 (talk) 22:43, 4 August 2018 (UTC)[reply]
I agree with De728631, in cases where there was no title for the photo at 500px the title parameter should be left empty. —RP88 (talk) 16:26, 5 August 2018 (UTC)[reply]

How should we deal with photos with html tags in Exif? Chico Venancio (talk) 02:16, 5 August 2018 (UTC)[reply]

Copyright[edit]

Hi:

Why such file as File:Dsc 7173 Jpg (106623559).jpeg, with a copyright clearly on the image, should be considered as corresponding to the free license demanded by Commons? Would'nt an OTRS authorization be necessary before uploading it? Why this 500px site be considered as free licence without question?

Pierre cb (talk) 11:31, 5 August 2018 (UTC)[reply]

Pierre cb it is not at all "free licence without question". Photographers chose a license when sending their work to 500px and before June 2018 they could choose several Creative Commons licenses. Those that choose licenses compatible with Wikimedia Commons are listed in the tool I created. There are several possible situations that would still make a photo non-free, but in that image File:Dsc 7173 Jpg (106623559).jpeg I do not see any. The watermark with the copyright corresponds to the same author as the 500px user. Unless we have any indication this was not the case, it seems to be a valid license chosen by the copyright owner. Chico Venancio (talk) 14:16, 5 August 2018 (UTC)[reply]

Incorrect UTC time in descriptions[edit]

While discussing a certain image taken in Germany, we noticed that the Exif time didn't match the UTC time that can be found in the description. I've then checked a number of files in Category:Photos imported with import-500px and found that in most cases the "Date" field of the description contained a UTC time that corresponds to the file's Exif time plus 4 or 5 hours, irrespective of the part of the world where the images were taken (examples: Germany, Australia, India, Virginia, USA, Alaska). In other cases the alleged UTC time was the same as Exif time, but that wasn't correct either (ex1, ex2). This seems to be a bug of the import tool. You can't specify the UTC time based on Exif without knowing which time zone the Exif time corresponds to, but we usually don't have this information. Instead of stating UTC, the tool should use Template:According to Exif data and copy the Exif time to the "Date" field.
This applies not only to new upolads, but there are more than 100.000 files in Category:Photos imported with import-500px that possibly need the date to be replaced.
(I don't know who's actually responsible for the tool, so I'll just ping some users who have been active here. @Alexis Jazz, De728631, and Rodrigo.Argenton: ) --Sitacuisses (talk) 20:22, 6 August 2019 (UTC)[reply]

Actually it was Chicocvenancio who developed the tool. De728631 (talk) 20:53, 6 August 2019 (UTC)[reply]
This is might be a bug. The relevant code looks not at exif, but at the metadata provided by 500px (not quite sure how that is generated, might be cloned from exif). It looks to the field `taken_at`. In both cases it normalizes the string to a python datetime object using pendulum then it normalizes the timezone (making the adjustment) to UTC and outputs a datetime string as `YYYY-MM-DD HH:mm:ss (zz)`.
The difference from exif in File:S3a2441 Jpg (105808971).jpeg is due to `"taken_at": "2015-04-19T11:36:50-04:00"` in the photo's metadata being used and normalized by the code. I'm not entirely certain the code is wrong in what it is doing, but perhaps the data is not to be entirely trusted. Maybe 500px messed up the timezones in the data. Chico Venancio (talk) 21:16, 6 August 2019 (UTC)[reply]
I think our best source of data is the duplicate files in both 500px and Commons that were uploaded by the photographers themselves. Chico Venancio (talk) 21:25, 6 August 2019 (UTC)[reply]
Looking at a few duplicates I could find it indeed seems 500px is assigning (server?) timezone of UTC-4 or UTC-5 irrespective of actual timezone on the files or user/photo location. I missed this when coding the tool and trusted the data.
I will remove the timezone handling from this data immediately and leave it as timezone naive datetime (as is more common). I can also develop the scripts to fix dates on the already uploaded images (not terribly difficult, though renaming of files means some work). Once we define correctly what should be displayed on the date field I can ask for bot permission for this task. Thank you Sitacuisses for catching this. Chico Venancio (talk) 14:41, 7 August 2019 (UTC)[reply]
✓ Done for new files (eg File:Fishermen S Village (81243037).jpeg). Chico Venancio (talk) 14:57, 7 August 2019 (UTC)[reply]
Chicocvenancio, could you provide an example that shows the result to people with limited understanding of the code? What values are available for evaluation? There's Exif time provided by the image itself, which may not be available for all files (?) and which also may be incorrect due to wrong settings by the photographer. Then there's a time given in the 500px metadata which may be manipulated by the server (?). Is there a way to find out if the uploader has specified a time (not manipulated by the 500px server) to correct an incorrect or missing exif value? --Sitacuisses (talk) 12:21, 8 August 2019 (UTC)[reply]