Commons:Bots/Requests/FlickypediaBackfillrBot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

FlickypediaBackfillrBot (talk · contribs)

Operator: Alexwlchan (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information), working for the Flickr Foundation

Bot's tasks for which permission is being sought:

  • Improving structured data for Flickr photos which have been uploaded to Wikimedia Commons, e.g. adding creator, license metadata.
  • Adding the new ‎Flickr photo ID (P12120) property to all files, to make it easier for other tools to work with Flickr photos

Automatic or manually assisted: unsupervised

Edit type (e.g. Continuous, daily, one time run): manually triggered

Maximum edit rate (e.g. edits per minute): tbc, probably 5–10 edits per second

Bot flag requested: (Y/N): Y

Programming language(s): Python

Discussion
Interesting proposal. I made one edit to Commons:Flickypedia/Data Modeling, otherwise this looks good. Curious how you will handle conflicting existing SDC claims? --Schlurcher (talk) 17:53, 1 November 2023 (UTC)[reply]
Good question! My general approach with these things is to be extremely conservative – imo the V1 bot should be purely additive, and any conflicts should be flagged for manual inspection.
Then a couple of things might happen:
  1. The existing SDC looks wrong, so I make a manual edit from my account to fix it. e.g. I’ve already been looking at the use of source of file (P7482) for Flickr photos in the SDC snapshots, and I found ~200 cases where the URL points to the Flickr URL’s profile (/photos/{username}) rather than the photo itself (/photos/{username}/{photo_id}). Those got dropped on a queue and I’ve been gradually tidying them up by hand – opening the files in question and making a manual edit from my account to point to the more specific URL.
  2. The existing SDC looks right, so I work out why the bot is disagreeing. Is it a bug in my code, have I interpreted the data mapping wrong, is the data mapping at odds with the community approach to SDC, is the bot missing some bit of info on the Flickr photo. But the bot won't do anything on its own.
There might also be cases where the existing SDC is wrong in large numbers and we'd want to write an automated fix, but that's somewhat risky and I’d want to be extremely careful before doing that. Two possible examples spring to mind:
  1. License versions. Flickr photos use CC 2.0 licenses, so that's what the bot will write into the SDC. But what if it finds a Wiki Commons file which links to the 4.0 version of the CC license? That sounds like an easy candidate for a fix buuuut I think there are Flickr users who leave descriptions on their photos saying "I license this as CC 4.0". A human copying their photo across would notice that; the bot might not. So in this case the bot would likely leave it as-is to avoid deleting info.
  2. Date granularity. Flickr has different levels of granularity for "date taken". Most photos are DDMMYY, but there are some which are MMYY or YY or "Circa YY". If there are lots of cases where there's an imprecise data but the SDC claims it's a full DDMMYY, we might consider automating that. (It's pretty obvious when this has happened – Flickr always returns a full timestamp from its API, but it sets all the unknown values to 0/1. So a YYYY becomes taken="1950-01-01 00:00:00" takengranularity="6".) The bot could be written to fix these. But I don't know if that's a widespread issue in practice.
If/when the bot does start editing existing SDC claims, I'll make sure we document those with examples – and if there are cases that seem contentious, I'll bring them back for community discussion before actually implementing them. Alexwlchan (talk) 08:13, 2 November 2023 (UTC)[reply]
To return to this question of "how does the bot handle conflicting edits":
Right now the bot will flag any conflicts as "unknown", not make any edits, and put them in a manual queue for review. I’ll look at them and decide if we need to update the bot code, do a manual edit to the SDC, or leave it be.
Example: license has changed since upload to WMC
I just ran it against File:MINDANAO BLEEDING-HEART DOVE (6939195884).jpg.
This confuses the bot, because it wants to write a different SDC statement to what’s currently in Commons – so it flags it as “unknown”.
I went and had a look at it, and I can see that the license has changed since the initial upload – there’s a license history feature on Flickr, and it was changed from CC BY 2.0 in April 2014, a year after it was uploaded to Commons.
(And now I'm going to look at tweaking the bot code so it gets the license from when the photo was uploaded to Commons, and uses that rather than whatever the license is now. But license is a pretty well-populated field, so I may not need this in practice.) Alexwlchan (talk) 08:22, 13 December 2023 (UTC)[reply]
Brief addendum to this: I’m going to take license out of the bot for now.
1. Licenses are already pretty well-populated in SDC, so the potential gain here is less.
2. I’m encountering a lot of cases where Flickr users have changed their license after the fact, which makes the bot unhappy.
It is possible to see license history on Flickr as far back as 2008, or I could inspect the Wikitext, but I’m going to leave it for now. I can come back later and see how many Flickr photos are actually missing a license in practice. Alexwlchan (talk) 14:45, 13 December 2023 (UTC)[reply]
To add another example to this:
If the bot encounters conflicting information in the "date taken" field, it flags a warning but doesn’t do anything.
e.g. File:STS059-238-074 Strait of Gibraltar.jpg is a photo which was posted to both Flickr and a NASA website. On Flickr the taken date is "April 1994", but on NASA's website we get the more precise date "17 April 1994", which is what's used in the SDC.
Flickypedia would write a statement "April 1994" if it was copying the photo fresh from Flickr, but it doesn't overwrite the existing, more-precise statement when it does the backfill. Alexwlchan (talk) 11:02, 15 December 2023 (UTC)[reply]
👍 I’ll probably get to making some test edits early next week, and I’ll link them here for inspection when they’re done. Alexwlchan (talk) 07:46, 2 November 2023 (UTC)[reply]
I know it’s been a couple of weeks and nothing has happened on this.
I am planning to get back to this bot eventually, but right now I’m prioritising getting the “uploader” part of Flickypedia working. Once that’s done, I’ll come back to the Backfillr bot. Alexwlchan (talk) 09:47, 23 November 2023 (UTC)[reply]
Test edits are done! You can see some examples of the bot's changes here:
File:Neasden Temple - Shree Swaminarayan Hindu Mandir - Power Plant.jpg
File:Traditional vessel (Stone Town).jpg
File:TimesSquare-500px.jpg
File:Rfid implant after.jpg
File:Bryn Athyn Cathedral - Pennsylvania (4825981267).jpg Alexwlchan (talk) 08:08, 13 December 2023 (UTC)[reply]
Thanks for doing the test edits. Content looks good. I only have technical comments.
  • Please combine these four edits into one: [1]
If you use a JSON data specification this can be done by simply merging all the different claims.
  • Please tag the edits with "BotSDC" as lots of user use this tag to filter out SDC edits
If you use a JSON post request this can be done by adding { "tags", "BotSDC" }
  • Please make sure you specify a maxlag for your edits as this got me into trouble once and avoid database overload
If you use a JSON post request this can be done by adding { "maxlag", "2" }
  • In the edit summary, please link the phrase structured data to [[Commons:Structured data|structured data]] or this bot request so users can find out more if needed.
I would appreciate if you could perform another set of bot edits that incorporate this. --Schlurcher (talk) 08:46, 13 December 2023 (UTC)[reply]
Thanks for the quick feedback! I’ve addressed all four of your suggestions.
1. Done, duh. For some reason I got it into my head that you can’t modify multiple properties at once, but I think that’s just a limitation of the visual editor? API seems fine with it, so that’s changed.
2. Done.
3. Done. I’m also planning to drop a note to somebody who works on the structured data team before I start running the bot at large scale, as a courtesy – backfilling Flickr data means 10s of millions of new statements, and I figure it’ll be easier if they have a direct line to the person adding database load.
4. Done. I’ve also added the property IDs, which I figured might be useful.
Some more test edits:
Alexwlchan (talk) 12:39, 13 December 2023 (UTC)[reply]
Thanks. No further comments from my end. My database issue was described here [2] and as I learned, as long as we respect maxlag it should be fine. As I've myself added 100s of millions of statements, I would not be too concerned about this request. Contrary, I think it is an excellent addition to improving SDC use. --Schlurcher (talk) 13:27, 13 December 2023 (UTC)[reply]

Please summarize: Have all issues been addressed? --Krd 04:28, 21 December 2023 (UTC)[reply]

@Alexwlchan:  ? --Krd 13:57, 30 December 2023 (UTC)[reply]
Hi @Krd – sorry for the delay, I took a couple of weeks break from working on that. Getting back into it now, hope to wrap my head around what’s still needed soon! Alexwlchan (talk) 15:37, 22 January 2024 (UTC)[reply]
@Krd I believe all the issues have been addressed now. :D Alexwlchan (talk) 13:58, 26 February 2024 (UTC)[reply]

@Alexwlchan: I just looked at your test edits. Edit summary like "Update the P12120, P1433, P170, P7482 properties in the ..." is quite cryptic. Either change it to something human readable, or use magic links like "Update the ‎Flickr photo ID (P12120), published in (P1433), creator (P170), source of file (P7482) properties in the ...", you can do that by using d:Special:EntityPage/P170. Does make the summary very (too) long. Multichill (talk) 20:29, 6 February 2024 (UTC)[reply]

Thanks, I’ll change it to "Update SDC based on metadata from Flickr"! Alexwlchan (talk) 10:35, 26 February 2024 (UTC)[reply]

Approved. --Krd 12:53, 6 March 2024 (UTC)[reply]