Commons talk:Batch uploading/Geograph

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

database of locations

[edit]

I have a database of locations (including lat/lons) and was wondering if the methodology you apply here with these photos is somewhat akin to what I need to do to batch create geotagged wiki pages in my wiki which has the google map extension enabled?

Kim --IP 10 August 2010

Really long filenames

[edit]

Are you going to continue using the extended filenames with geograph.co.uk in them? I never did get why this was necessary the first time around. Jarry1250 (talk) 12:39, 17 August 2010 (UTC)[reply]

Talk page is the place for comments

[edit]

Things seem to have got a little confused here, everyone is using the project page for comments- I have a go at straightening it out over the week end if no-one objects --ClemRutter (talk) 21:53, 17 March 2011 (UTC)[reply]

As there have been no objections, I have copied across the discussions, and will now rebuild the project page. ( I haven't copied over the history as i am in a quandary as to whether it would be the best thing to do. --ClemRutter (talk) 21:54, 3 April 2011 (UTC)[reply]

Discussions

[edit]

Geograph

[edit]

In the Village pump Perry Rimmer brought up the suggestion of copying all these files to Commons. Geograph is a site containing about 1.5 million {{Cc-by-sa-2.0}} images of the British Isles. The Isles are divided in 1 km by 1 km squares and the goal of the project is to get at least one photo of every square. 250,397 grid squares, or 75.5% of all squares currently have an image. Most of the images we use at the English Wikipedia to illustrate villages in the United Kingdom come from this site. The quality of the images is not that high, but nevertheless this is a very rich resource. Dumps of the databases are available and also torrents containing the files. I will contact the people behind this project if we can make some sort of cooperation project of it. Before I start actually uploading images I want to do several things:

Imported the database dump at the toolserver. It should be straightforward to extract all information from the database. Categories on the other hand is probably going to be a nice challenge. Found several possible tools

- Multichill (talk) 16:44, 19 November 2009 (UTC) (more to come)[reply]

Update December 2009

[edit]

I downloaded the first 250.000 images (about 25GB) and the database dump. With these two combined it's quite easy to generate descriptions and filenames. I modeled this after geograph_org2commons as everybody seems to be happy with that. Categorization on the other hand is hard. I take the following approach:

  • Get locations from http://ws.geonames.org/extendedFindNearby
    • I mapped some id's (earth/Europe/countries/counties/etc) to a Commons category in a database. I first take a look in this database
    • If the id is not in the database I'll take a look if I can find a category at Commons with a similar name
    • I know have a bunch of location categories at Commons
  • Get the topic from Geograph
    • I mapped Geograph categories (=imageclasses) to Commons categories. See if the Geograph category is in the database
    • If the Geograph category is not in the database look if you can find a category at Commons with a similar name
  • Combine location and topic: Try to find categories deeper in the tree. So for example not Category:Churches, but Category:Churches in England
  • Filter the categories
    • Follow redirects
    • Remove disambiguation categories
    • Filter out overcategorization

This seems to work alright, but for now I'll add {{Check categories-Geograph}} to the images to be sure. Some issues I expect to encounter:

  • Disambiguation problems. Some images will end up in strange categories because a lot of names aren't properly disambiguated
  • Not properly filtering out overcategorization because the tree is broken. For example we have a lot of Topic in Europe categories, but Topic in the United Kingdom is not a subcategory
  • Some categories will get crowded because the tree for the United Kingdom and Ireland hasn't been build yet

The source is available (work in progress!). Oh, and btw, the usual tricks apply so filenames get cleaned up and no duplicates are uploaded. Multichill (talk) 18:06, 3 December 2009 (UTC)[reply]

Did a test upload of 365 files. Feedback is appreciated. Multichill (talk) 23:24, 3 December 2009 (UTC)[reply]

This is good news, I think it will bring many interesting images. I looked at the ones here. Just a few points for now:
  1. "in Europe": #398 is in both "Buildings in Europe" and "Buildings in Dorset" (and also "Buildings in England"), possibly because Category:Buildings in Dorset is not in "Buildings in England. #256 is in both "Churches in Europe" and "Churches in Hampshire", despite the later being a subcategory of the former.
    Yes, a lot of trees are incomplete. These will show up. Already fixed a couple. Please fix the tree if you spot problems like this.
    Will do.
  2. Dot: When the image title doesn't include a "." at the end, one needs to be added when combined for the description (samples: #256, #106). If there is one in the title, it doesn't necessarily need to appear in the file name (#38).
    This is something so minor, i'll just keep it this way
    A minor fix, but could look at #106? "A view from East Cliff Beach across to Charmouth and Stonebarrow Hill This image is taken from the last concrete groyne" just just looks odd.
    Oh right, now I understand you. Thought you were talking about the title, but you're talking about the description. Will if. (note: strip(), if not last char . -> add char). Multichill (talk) 17:50, 4 December 2009 (UTC)[reply]
  3. Filename: I'd limit the "geograph.org.uk - 38" to something like "GG0000038". I don't see much benefit to use the domain name there. I was going to suggest to add the date to the file name, but for some of the files, this seems to be "unknown". 00000000 could be an option for these. Combined with the number this could be "GG20091204-2113138" for a file of today.
    I modeled the filename after Magnus' tool. I like it because it prevents name collisions and is easy to understand
    The full domain name seems excessive and I think the date should be in there. For the Navy pictures, this allows easy sorting.
  4. #256 is in category:Ibsley instead of Harbridge, but I figured out why.
    Yup, the location tool is not right all the time.
  5. #106 is in Category:Coasts and #36 in Category:Hills. Such general categories might fill up quite quickly.
    For 106 this happend because the location tool didn't return a suitable location (Coasts in the English Channel doesn't sound very suitable to me)
    For 36 an intersection between Category:Hills and Category:Isle of Man should be made. Creating these kind of categories will prevent them main categories from filling up
  6. Template: The layout of {{Geograph}} could need some work, but this isn't really related to your upload. I already made a request to remove the interwiki from the template (#iw)
    I would like to have a similar layout as {{Fotothek-License}} or {{KIT-license}}, but less discus that at Template_talk:Geograph
  7. Stray text: Some images still have the "Importing image file" text (e.g. #278)
    Yeah, noticed that too. Something went wrong with the import. Removed it from most of the files.
  8. Headers: As the headers are optional, I think we should drop at least {{int:filedesc}} .
    I like to add them for the non-English speakers
    "Description" is translated too, so "Summary" isn't really needed.
  9. The images seem to be fairly old, maybe it's worth doing a test with more recent ones.
    Despite this list, I think the overall quality of the import is good. It is likely to give quite a lot of categorization to do. -- User:Docu at 06:16, 4 December 2009 (UTC)[reply]
    That's right. I started with the oldest images and work my way to the newer images. I'll add some more manual categorization mappings. If trees are build and corrected for the most used categories (like Churches) before I compile my batches it will save a lot of time. I'll make a list of important categories to work on. Multichill (talk) 10:45, 4 December 2009 (UTC)[reply]
    I replied above. -- User:Docu at 17:00, 4 December 2009 (UTC)[reply]
    At User:Multichill/Geograph/categories I put a list of categories. This is based on the 1.5M files in the database. This covers about 60% of the files. I'm working on raising this to at least 80% (will update the list accordingly). For all these categories the trees should be checked and build. The layers to check:
    Sometimes a topic in Europe category exists (for example Category:Churches in Europe). The country categories should be made a subcategory of this. Generally all the topic by location categories should have two or more parent categories. If not, there's probably something missing. Multichill (talk) 13:27, 4 December 2009 (UTC)[reply]
    Given the mere amount of images, it might be worth making county categories for topics that otherwise might not be categorized that way. This until we have subcategories for specific features or structures. Your bot, is it already set to make them? BTW could you add "heading:?" to the coordinates? -- User:Docu at 17:00, 4 December 2009 (UTC)[reply]
    Most of the categories in my list are already divided by country. It's more about the lower layers. I don't have a bot to create these categories automagicly.
    Heading is added when it's known, see for example File:Aldershot - Home of the British Army - geograph.org.uk - 177.jpg. Multichill (talk) 17:50, 4 December 2009 (UTC)[reply]
    I made a matrix of categories here Commons:Batch uploading/Geograph/cat-matrix. So all red categories should be created? If they have the right in/of :-) --MGA73 (talk) 19:07, 4 December 2009 (UTC)[reply]
    Yes, for many probably also the subcategories for counties. Keep in mind that the final count could be easily be 2-4 times the quantity listed. -- User:Docu at 05:46, 5 December 2009 (UTC)[reply]
    All done now. Maybe Multichill can get a bot to push excisting images down in the new tree? --MGA73 (talk) 21:56, 7 December 2009 (UTC)[reply]
    Or maybe someone else can write it ;-) Maybe something in combination with {{Populate category}}. If the image is in both parent category, move it to the underlying category. Have to think about that. What to do if it has more than 2 parent categories? Etc. Multichill (talk) 17:16, 8 December 2009 (UTC)[reply]

<unindent>To make sure they get categorized, I made five empty categories for Llyns (Special:PrefixIndex/Category:Llyns). Would these work for your bot that way? If I redirect them to corresponding lake categories, would that work too? -- User:Docu at 10:55, 6 December 2009 (UTC)[reply]

If a en:Llyn is just an other word for Lake shold we not just have Multichill tell the bot that Llyns = Lakes? --MGA73 (talk) 18:09, 7 December 2009 (UTC)[reply]
My second question aims at that. Otherwise, I can merge them later. Obviously, I prefer to see them categorized as Llyns rather than not at all. -- User:Docu at 19:37, 7 December 2009 (UTC)[reply]
This will work, but it's probably easier to add a database entry so that my bot knows that Llyn means category:Lakes. I already did this for the top categories. Should cover about 80% of the images. Feel like helping to increase this hitrate? Multichill (talk) 17:16, 8 December 2009 (UTC)[reply]
If someone wants to help they can work on User:MGA73/Sandbox Commons:Batch uploading/Geograph/Sandbox. --MGA73 (talk) 17:33, 9 December 2009 (UTC)[reply]
Yes but look at the reason. If filename is wrong then we can rename. --MGA73 (talk) 14:39, 12 December 2009 (UTC)[reply]
I don't disagree on part of the request, but the requestor also added "it contains inappropriate information about the source" and removed " - geograph.org.uk - 1040807". -- User:Docu at 14:53, 12 December 2009 (UTC)[reply]

Categorization

[edit]
  1. Match location id's to Commons categories. Almost done.
  2. Match geograph topic categories with Commons categories. Working on it, see User:MGA73/Sandbox Commons:Batch uploading/Geograph/Sandbox.
  3. Create topic by location categories. Working on it, see here for a list and here for a matrix
  4. Some geograph to Commons category matches turned out to be somewhat strange. Check and correct this list. List here

Multichill (talk) 23:52, 12 December 2009 (UTC)[reply]

For (4): in the list, there are a few matches I don't understand: why does "sea loch" match "lake" rather than "sea lochs"? "Loch" should match "lochs", not "Bodies of water".
Currently there is "Village_sign -> Category:Signs": Is there a way to create just "Category:Village signs in the United Kingdom", etc. to avoid that they go into too general categories?
To avoid problems of the "Churches in Europe" type above, maybe the matching should either work around missing continent links or we should try to run a bot to fix the categories before. I tried using CatScan2 to find such categories, but it seems to time out.
BTW I added a redirect for Bogs. It was missing despite there being "Category:Bogs by country (we would still need to make UK/IE specific categories. Please check if this works. -- User:Docu at 03:55, 13 December 2009 (UTC)[reply]
Category:Sea lochs was missing: fixed that. -- User:Docu at 15:09, 13 December 2009 (UTC)[reply]
Matches are work in progress. A lot of them have been changed.
I don't want the categories to be too specific either so we have to find a balance.
Looks like I tackled all the Europe categories. If I missed some it's easy to fix (create link, use bot to filter the category). Multichill (talk) 16:32, 13 December 2009 (UTC)[reply]
I moved my sandbox to Commons:Batch uploading/Geograph/Sandbox. Better we work there so my own "testing" does not ruin something. --MGA73 (talk) 08:12, 14 December 2009 (UTC)[reply]
For Europe/UK, you got almost all of them: I fixed 11 missing ones: list. -- User:Docu at 15:23, 16 December 2009 (UTC)[reply]

If someone thinks we should have more categories please leave a note Commons:Batch_uploading/Geograph/cat-matrix#Sub-matrix_for_counties. --MGA73 (talk) 09:48, 17 December 2009 (UTC)[reply]

Please note that Category:Trees is a Main category and should not have files added directly to it. Please only use subcategories of it! As it is, I've just been saddled with 95 files which I'll have to recategorise now :-(( Thanks - MPF (talk) 01:59, 31 January 2010 (UTC)[reply]
See #Comments on ongoing upload. Multichill (talk) 09:23, 31 January 2010 (UTC)[reply]

Progress December 2009

[edit]

This table is to keep track of the progress of the upload. All directories are located in /mnt/user-store/geograph/torrents at the toolserver.

Source dir Destination dir Prepared Imported
geograph_vol001_image_0_to_49999/00 geograph_vol001_image_0_to_49999_prepared/00 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol001_image_0_to_49999/01 geograph_vol001_image_0_to_49999_prepared/01 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol001_image_0_to_49999/02 geograph_vol001_image_0_to_49999_prepared/02 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol001_image_0_to_49999/03 geograph_vol001_image_0_to_49999_prepared/03 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol001_image_0_to_49999/04 geograph_vol001_image_0_to_49999_prepared/04 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol002_image_50000_to_99999/05 geograph_vol002_image_50000_to_99999_prepared/05 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol002_image_50000_to_99999/06 geograph_vol002_image_50000_to_99999_prepared/06 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol002_image_50000_to_99999/07 geograph_vol002_image_50000_to_99999_prepared/07 ✓ Done Multichill (talk) 22:41, 20 December 2009 (UTC)[reply] ✓ Done
geograph_vol002_image_50000_to_99999/08 geograph_vol002_image_50000_to_99999_prepared/08 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC)[reply] ✓ Done
geograph_vol002_image_50000_to_99999/09 geograph_vol002_image_50000_to_99999_prepared/09 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC)[reply] ✓ Done
geograph_vol003_image_100000_to_149999/10 geograph_vol003_image_100000_to_149999_prepared/10 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC)[reply] ✓ Done
geograph_vol003_image_100000_to_149999/11 geograph_vol003_image_100000_to_149999_prepared/11 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC)[reply] ✓ Done
geograph_vol003_image_100000_to_149999/12 geograph_vol003_image_100000_to_149999_prepared/12 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC)[reply] ✓ Done
geograph_vol003_image_100000_to_149999/13 geograph_vol003_image_100000_to_149999_prepared/13 ✓ Done Multichill (talk) 17:20, 23 December 2009 (UTC)[reply] ✓ Done
geograph_vol003_image_100000_to_149999/14 geograph_vol003_image_100000_to_149999_prepared/14 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol004_image_150000_to_199999/15 geograph_vol004_image_150000_to_199999_prepared/15 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol004_image_150000_to_199999/16 geograph_vol004_image_150000_to_199999_prepared/16 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol004_image_150000_to_199999/17 geograph_vol004_image_150000_to_199999_prepared/17 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol004_image_150000_to_199999/18 geograph_vol004_image_150000_to_199999_prepared/18 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol004_image_150000_to_199999/19 geograph_vol004_image_150000_to_199999_prepared/19 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol005_image_200000_to_249999/20 geograph_vol005_image_200000_to_249999_prepared/20 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol005_image_200000_to_249999/21 geograph_vol005_image_200000_to_249999_prepared/21 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol005_image_200000_to_249999/22 geograph_vol005_image_200000_to_249999_prepared/22 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol005_image_200000_to_249999/23 geograph_vol005_image_200000_to_249999_prepared/23 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done
geograph_vol005_image_200000_to_249999/24 geograph_vol005_image_200000_to_249999_prepared/24 ✓ Done Multichill (talk) 20:55, 15 January 2010 (UTC)[reply] ✓ Done

Uploading

[edit]

I've been told, that new space should be ready this week :-) --MGA73 (talk) 13:00, 25 January 2010 (UTC)[reply]

Yeah, new disk space is there. Batches are ready to be imported. Multichill (talk) 19:29, 27 January 2010 (UTC)[reply]
Assigned to Progress Bot name Category
Multichill Batches prepared ready to import GeographBot Category:Images from the Geograph British Isles project

Do you have an update on which new batches have been uploaded? This is more exciting than Christmas. --ClemRutter (talk) 18:52, 22 February 2010 (UTC)b[reply]

No, not really, image 1 to 250.000 have been uploaded and now we're in the process of getting these images properly categorized. Multichill (talk) 18:44, 24 February 2010 (UTC)[reply]

Comments on ongoing upload

[edit]

were added to Category:Lakes. Shouldn't they appear in some subcategory? -- User:Docu at 10:55, 30 January 2010 (UTC)[reply]

Yep, they should. Looks like some images didn't get a location based category (probably because the location tool wasn't working at the time of compiling). Will have my bot go over them to find better categories. Multichill (talk) 13:31, 30 January 2010 (UTC)[reply]
BotMultichillT is now busy finding better categories for all images which ended up in a topic category (like Category:Lakes) and not in a topic * location category (like Category:Lakes of England). Multichill (talk) 17:26, 31 January 2010 (UTC)[reply]
Recently uploaded File:Mitcham - geograph.org.uk - 107685.jpg is the same as File:Mitcham Railway Station.jpg which I uploaded over two years ago. Also - the remark has been made above - but the way the bot names and categorises images creates a lot of extra work. Ravenseft (talk) 21:00, 31 January 2010 (UTC)[reply]
Images are similar but they are not the same. Ohe has the number 107687 and the other one 107685. --MGA73 (talk) 21:19, 31 January 2010 (UTC)[reply]
107685
107687
The images are not the same, see here on the right.
How does the naming of images cause extra work? Multichill (talk) 21:22, 31 January 2010 (UTC)[reply]
Aside from recategorising, the bot has the unfortunate habit of transforming a photographer's blurb into an image title. So one ends up with, for example, File:Old station building - geograph.org.uk - 14935.jpg which provides no clue as to what station we're talking about. Such images will all have to be renamed, as will those where the photographer has wrongly identified or spelt the station name. All extra work in an area where precision is key. Would it not be possible to prevent the bot from uploading images into the Disused Stations categories? Ravenseft (talk) 21:33, 31 January 2010 (UTC)[reply]
Another comment which could be made is that some geograph images do not merit to be uploaded into Commons - the 2nd Mitcham image above being a perfect example. Bulk uploading is simply going to make it harder to spot the wood from the trees. Ravenseft (talk) 21:36, 31 January 2010 (UTC)[reply]
Since the two images are not identical I fail to see why the 2. one should not be uploaded to Commons. As for the comment on not putting images in "disused stations" I wonder where the images should be placed? If the images shows a disused station it looks like a good place to put it. An finally yes some images might be named wrong if they are mis labled on geograph. But it is hard for us to know. If mistakes are found please just add {{rename|new name.jpg}} and it will be fixed. The alternative is that someone (you) upload 1,5 mio. images manually. That sounds like an impossible project. --MGA73 (talk) 22:44, 31 January 2010 (UTC)[reply]
Disused stations images should be manually uploaded to ensure that they are correctly named and categorised. This is what I and others have been doing until now and it seems to work well. I've put in some requests for renaming so let's see if they are accepted. I also wonder if the bot has picked up other images and put them in non-disused categories - I just have to hope that some other user will find the right category. Ravenseft (talk) 11:32, 1 February 2010 (UTC)[reply]
Thanks for the quick response on the lake images. Overall it seems to work quite well. -- User:Docu at 21:47, 31 January 2010 (UTC)[reply]
This was Commons’ file #6,000,000

It took a lot of work to come this far and still a lot of work to do. But I think it was worth it. Now we have a lot of images of UK etc. (also look to the right) :-) --MGA73 (talk) 08:54, 1 February 2010 (UTC)[reply]

Geographbot

[edit]

To centralize discussion, we might want to redirect its talk page (User_talk:GeographBot) here. Currently it leads to User talk:Multichill -- User:Docu at 10:51, 30 January 2010 (UTC)[reply]

✓ Done. Multichill (talk) 13:31, 30 January 2010 (UTC)[reply]

Sysadmin instructions for starting/resuming batches

[edit]

To start or resume a batch you need to know its batch number, which is a number between 00 and 24. Batch numbers 14 and below are suffixed with _new . The following uses batch 09 as an example.

To start or resume a batch:

$ /home/catrope/bin/importBatch 09_new

Be sure to run this in a screen session on hume.

The importBatch script reports to #geographbot on IRC when it starts and stops. Sometimes, the import script will die with a database error and a batch will be reported as having completed even though that's not really the case. To check whether a batch has run to completion:

$ ls -r /home/catrope/upload/09_new | head -n 2
Zig_Zags_Sron_a_Chuilinn_-_geograph.org.uk_-_98327.txt
Zig_Zags_Sron_a_Chuilinn_-_geograph.org.uk_-_98327.jpg

Verify that Zig_Zags_Sron_a_Chuilinn_-_geograph.org.uk_-_98327.jpg exists on Commons.

Fords are not the Ford Motor Company. Lukas 3z (talk) 15:51, 1 February 2010 (UTC)[reply]

Uploading over existing redirects?

[edit]

File:Mill Lane Oversley Green - geograph.org.uk - 129468.jpg was a redirect before it got bulk-uploaded. Now a new image has been uploaded, but the page data still just has a redirect in it. I suspect this may have happened elsewhere too. --bjh21 (talk) 16:47, 1 February 2010 (UTC)[reply]

[edit]

Please see this edit. Surely it would not be too difficult to teach the bot how to convert Geograph's internal links properly? — RHaworth (Talk | contribs) 19:38, 1 February 2010 (UTC)[reply]

Should it not link to the matching file on Commons (File:Longford Pump - geograph.org.uk - 136930.jpg) then? --MGA73 (talk) 22:06, 1 February 2010 (UTC)[reply]

To some extent, depends how long the upload process is going to take. To do it thoroughly, one would check first if the upload here had been done. Failing that, they can all be created as links to Geograph and a bot can come round and convert them to Commons links as the images get uploaded. — RHaworth (Talk | contribs) 01:12, 8 February 2010 (UTC)[reply]

The bot identified images of Dartington as being of Darlington, and mis-categorised these images as a result (example). Not sure why.--Nilfanion (talk) 11:27, 2 February 2010 (UTC)[reply]

See the this tool link in the big yellow template for the source of this problem. Multichill (talk) 08:52, 3 February 2010 (UTC)[reply]
Commons' file #6,000,000.
Category:Ullswater has 76 files (up from 7 files)

Thanks to this upload, the category has now a much larger offer of images. A month ago, there were only 7 or 8 subcategories.

Similarly Category:Ullswater, home to image #6,000,000, is now at 76 files from 7 files a week ago. Not all of the additional ones are new though, but the existing ones were categorized (manually) with the new ones. -- User:Docu at 07:08, 3 February 2010 (UTC)[reply]

That's great! It's nice to use sumitup on the new categories, see for example Category:Angle Tarn (Langstrath). See User:Multichill/monobook.js for an easy link (should probably be a gadget btw). I have a simple bot to create categories based on Wikipedia articles, this also includes information from sum-it-up (used it to create all the village categories). If other people find this useful I can publish it somewhere. Multichill (talk) 08:21, 3 February 2010 (UTC)[reply]
Great work indeed. Sum-it-up is very nice, but would be even better if, for long descriptions, it encapsulated them in {{Mld}} or if that's too complicated, in {{Collapsed}}. --Foroa (talk) 09:42, 3 February 2010 (UTC)[reply]
Already asked Magnus, but he doesn't seem to be very responsive. Multichill (talk) 10:05, 3 February 2010 (UTC)[reply]
Maybe the five-year warranty expired and WikiMedia didn't pay for the its extension ;) If you send him a patch, he will probably apply it.
I agree that descriptions with sumitup would be nice to have, but I wouldn't mind a few tweaks to its layout either (e.g.). Ideally, an empty category page would fill itself directly with the text.
Additionally, for that to work here, I will have to write a series of articles first. Usually it's the other way round: there is a set of articles about lakes and they need images. Obviously, once they exist, maybe a bot can create the category descriptions. -- User:Docu at 16:53, 3 February 2010 (UTC)[reply]

Cats to clean up

[edit]

I made Commons:Batch uploading/Geograph/cats to clean up. Big categories should (and will) be first be checked by a bot, than it would be nice if users could help out. Multichill (talk) 22:10, 3 February 2010 (UTC)[reply]

This seems to be dumping a lot of images into Category:Buildings. Normally, I try to keep that category under control by finding better places to put virtually all images that are dropped there, but this is far more than I will take on. I hope someone plans to move these to some more geographically specific categories. - Jmabel ! talk 01:24, 5 February 2010 (UTC)[reply]
Hi Jmabel, did you see the part about categories first need to be checked by a bot? The bot reduced the number of items in Category:Buildings to a more acceptable level. Multichill (talk) 09:17, 5 February 2010 (UTC)[reply]
Can I just point out that the subject categories are not the only ones with severe problems. The high-level geographic categories have also been completely overwhelmed. (5 of the worst examples are: Category:England, Category:Scotland, Category:North Yorkshire, Category:Devon and Category:Aberdeenshire which currently contain 2276, 1612, 1151, 1060 and 1041 files respectively. Providing more precise location details is likely to be difficult for a bot as if a more exact location was available in the Geonames database then Geographbot would have be able to use that on initial upload.
Jmabel has hit my concern here, many of these categories are not maintained by large numbers of users so the addition of over 1000 images in a few days will take weeks to clean up.
The obvious bot fix to this problem is to make a bunch of category intersects (or use the existing ones), so instead of Category:Beaches and Category:Cornwall it would be Category:Beaches of Cornwall (The in/of issue needs addressing but that's not directly relevant to this). The intersection of major subject and county level information should work for the majority of images for the time being, and will prevent both the subject and location categories from being flooded. Ideally this should be done at upload and not have another (set of) bots running around afterwards. Could modifications be made to GeographBot before it starts on the next batch?--Nilfanion (talk) 03:11, 5 February 2010 (UTC)[reply]
As you might have noticed we created a lot of intersection categories prior to this upload. If you create intersect categories these will be used in future uploads (no need for modifications). To intersect current categories I created {{Intersect categories}} a while ago. I added it to Category:Beaches of Cornwall. I'm not sure if it will catch a lot of images, let's see.
As a side note. At User:Multichill/Location based categorization I put an idea about location based categorization and how it could be used and improved. Multichill (talk) 09:17, 5 February 2010 (UTC)[reply]
I agree that it would be better if right categories was added during upload. For that to happen we need 1) that information is on the image on Geograph so bot will know where to put it 2) all relevant categories should be created and categorized correct 3) bot should be adjusted correctly.
We tried to make the most relevant categories (Commons:Batch uploading/Geograph/cat-matrix) but for some reasons some images did not end up in the right categories. One of the problems is "bad names" for example top category is Category:Water wells but subcategories are named Category:Wells in England. Bot does not like that for some reason.
Right now we are trying to fix the problems with help from bots. So if you find categories with way to many images in then put a note here and we will look at it. If it is just a handfull of images please wait or do it manually. Any help is welcome. --MGA73 (talk) 09:26, 5 February 2010 (UTC)[reply]
I'd point out that I've already made ~500 manual edits relating to this (mostly from Category:Devon) though I'm doing fundamental work on improving that category and its tree as opposed to the just the simpler upload fix I suggested. There's some problems I'd like to see addressed:
  1. The cat-matrix didn't work perfectly, the water wells being a prime example. Another problem seems to be examples that got placed in * in England categories, when * in county already existed and was in the matrix. File:Bull Field - geograph.org.uk - 164427.jpg should have been placed in Category:Fields in Somerset but was instead put in Category:Fields in England, when the bot also correctly identified the appropriate village in Somerset.
  2. The need for hundreds of county-level intersects. Its hard to identify just which ones are required, but I'd suggest that if any subject category or one of the subject in England/Scotland/Wales/N Ireland (or UK) categories has more than 100 Geograph images at present then ensure that it has full county-level breakdown, and that the bot will use those. Even those county-level subject categories are going to become unwieldy after the complete upload (I'd estimate Category:Fields in Somerset is liable to have well over 500 images), but that is a reasonably precise category. Could you provide a full breakdown of the categories used by the upload bot?
  3. When the location tool was not working for whatever reason, the bot went ahead and just added to subject-only categories. If this occurs again will the bot do the same thing again or will it notice this and not upload (and try the tool again)?
Oh and as for categories with way too many files: Many of the subcategories of Category:Counties of England, Category:Counties of Scotland, Category:Counties of Wales and Category:Counties of Northern Ireland have in excess of 100 files, and Category:County Tyrone is exceptionally bad with 2,600+ files. If any category has so many files in it that only subcategories up to F are displayed I'd say its usability has been seriously affected and needs major work to fix.--Nilfanion (talk) 12:45, 5 February 2010 (UTC)[reply]
Such regionally intersected categories can help improving categorization by breaking it down to a reasonable size, but we shouldn't forget to find additional categories besides these basic ones. Many topics with "by country" structures have a detailed "by sub-topic" categorization can get neglected. -- User:Docu at 13:00, 5 February 2010 (UTC)[reply]

Oh and suggestion, when uploads start again please do a relatively limited test batch (~5-10,000?) to see if there are still further problems with categorisation or if its sorted. A bot shouldn't be making any errors like the Somerset field example I just gave - fact that it did is indicative of a problem. I like the idea of having many of these files (some are junk, but lots are useful), but if its making the UK location categories unusable that is serious harm.--Nilfanion (talk) 12:50, 5 February 2010 (UTC)[reply]

Agreeing with Nilfanion here. The Devon cat which I have worked on since I arrived here three years ago went from quite well organised to a bomb zone in no time at all. There is a stack of work to do and very few people dealing with this aspect. Care and thought before further dumps please --Herby talk thyme 13:19, 5 February 2010 (UTC)[reply]
Category:Gwynedd is an utter disaster. I've been working on the Category:Conwy County Borough and wondering why so few files are present. Now I know the answer - a large number of them are categoried in Gwynedd! The same is true for many files that should be in Category:Anglesey. The Gwynedd cat now has nearly 4,000 unsorted files, many of which are for places not even in that county; I can't even begin thinking about sorting that lot out. Something needs to be done. (PS GeographBot seems to live in the past: the old Gwynedd included these areas but the present county does not.) Anatiomaros (talk) 18:35, 6 February 2010 (UTC)[reply]
This seems to be a wide problem. For example (though on a much less scale), Category:Horsell, contains(ed) images from Woking, Mayford, Kingfield These are different places, and have their own cats. How any why does this happen. Especially, how does an image titled "Mayford Roundabout" end up in category:Horsell, several miles away? Arriva436talk/contribs 18:56, 6 February 2010 (UTC)[reply]
Gwynedd seems to be a separate problem that needs to be looked into. If one wants to look at images other than those from Geograph in the meantime, CatScan can help
http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php?wikifam=commons.wikimedia.org&basecat=Gwynedd&basedeep=1&mode=ts&templates=Geograph&untagged=on&go=Scan&format=html&userlang=en
Things like "Mayford Roundabout" probably happen when coordinates of the image are closer to Horsell than of any other place. -- User:Docu at 19:10, 6 February 2010 (UTC)[reply]
There are a couple limitations with that suggestion: 1 most users wouldn't know how do generate that catscan and 2 sub-categories like these are hard to reach. That's not to mention the additional strain on the toolserver all the queries may generate. I've got two idea here: 1 is remove all unreviewed files from the county categories and place them in a special sub-cat. 2 add a header template to the badly affected categories explaining situation and giving clear links to catscan and the 2nd half of the category list.
With respect to the Mayford Roundabout the bot ignores the filename and goes with what the geonames database returns. This causes an additional problem for commonly used place names (I've had to remove a bunch of images from Category:Luton as they were associated with Luton, Devon not Luton, Beds.--Nilfanion (talk) 19:30, 6 February 2010 (UTC)[reply]
OK, that makes sense. But how many other images that have been uploaded could be in the wrong category, in a similar manor to Mayford Roundabout? Also, when they are checked, is it likely for the checker to miss the fact it's in the wrong category?
Another issue (going slightly off topic), is how bad some of the Geocoding is. At Commons we aim for accuracy of 7 meters - I some cases the Geograph images are miles out. Going back to Mayford, there's a large amount of Mayford images that were taken in different places, but all show up as being taken in the same place in the middle of a field. Arriva436talk/contribs 20:04, 6 February 2010 (UTC)[reply]
The inaccuracy of the geocoding is inevitable to a point unfortunately. Geograph's goal is to photograph every grid, and most images uploaded there are placed no more precisely than that. When we translate that to here, those images get placed at the centre of the grid square, which in case of Mayford the centre of SU9956 is in some fields. There is no way to obtain more precise information, unless you can place them more accurately yourself using whatever your favourite geocoding tool is.--Nilfanion (talk) 20:23, 6 February 2010 (UTC)[reply]
To avoid that people get overwhelmed by geograph images in categories, I added a sortkey to Template:Check categories-Geograph. This should make these images appear after the already checked ones. -- User:Docu at 13:57, 16 February 2010 (UTC)[reply]

Linking

[edit]

Seemingly as part of the Geograph batch upload, categories have been created, based on the en.Wikipedia article. These appear to take the first part of the lead of the relevant article, and convert the links to cross wiki ones. However, as on the original en.Wiki page (they don't link to themselves), the commons categories don't include the most important link - i.e. the link to the article the category is for. An example is Category:Burpham, Surrey, which links to everything except en:Burpham, Surrey. The difference can be seen by looking at Category:Merrow, Surrey, where I have fixed the issue. I know it does have the link in the toolbar down the side, but this isn't very obvious, especially when the description it covered in other links.

Another funny thing I've noticed is that two images taken in Burpham, and with Burpham in the file name, were in Category:Merrow, Surrey. Is there an explanation for this (like the bots takes the general area Geograph says, which in this case would be Merrow?) Arriva436talk/contribs 21:00, 4 February 2010 (UTC)[reply]

Top categories

[edit]

At Commons:Batch uploading/Geograph/top categories I put a list of most populated categories. When most of the bot work is done at Commons:Batch uploading/Geograph/cats to clean up I'll shitft my bot's attention to these categories. Multichill (talk) 13:19, 6 February 2010 (UTC)[reply]

For roads, farms and forests, a regional break-down would probably help. I guess it's unlikely that this ends up in Category:Loch Ness? If yes, I will try to work on that. -- User:Docu at 13:54, 6 February 2010 (UTC)[reply]
One potential concern looking at those lists are ones like Category:Roads in Devon, where the subject/location intersect is already at county level but there are still nearly 1,000 images in that category. I'm not as concerned by that as I am by county or subject categories with too many files, but I'm less sure of how to tackle it. I can see two ways to split that:
  1. By subject: The Roads in * are in particular awkward as I can't see that many natural subtypes. We can split out the classified roads, but that doesn't address the large number of images of unclassified roads. We could also split out signs, roundabouts, junctions etc.
  2. By location: Could always take it down to the next tier - which would be districts (so have Category:Roads in West Devon etc).
I'm cautious as to to having subject by location categories with the location being that precise, I'd prefer greater precision in the subject so I guess take the first option as far as it can go and see how bad situation is then.--Nilfanion (talk) 17:47, 6 February 2010 (UTC)[reply]

Source

[edit]

Two suggestions for the "Source" item: a) I hate to see links from the general to the specific and b) given that Geograph is all about grid refs, it seems a pity that grid ref info should be lost altogether. I suggest that instead of:

From geograph.org.uk

we should have:

from this image at geograph.org.uk for grid square SK0754

RHaworth (Talk | contribs) 02:28, 8 February 2010 (UTC)[reply]

Bad categories at upload

[edit]

Obviously the bot is working within the limitations of the database and will make errors (be best if we could alter the db...). However, I've seen a number of problematic categories where the name relates to multiple settlements or completely different topics. In one specific instance, BotMultichillT readded an inappropriate location after I had already removed it once (an accurate location category has been provided).

List of ones I have found so far: Category:Luton, Category:Dartington, Category:Saint Agnes and Category:Bradford. As for how to handle this, perhaps have the bot check that the image is approximately the right for that location. If the location is more than 50 km from Luton, Bedfordshire then it probably doesn't belong in Category:Luton. That's a complication, but it should only matter for a relatively small number of categories - could have specific check for them?--Nilfanion (talk) 00:52, 16 February 2010 (UTC) Bound to be more, c[reply]

File name

[edit]

Is there any need to put ' - geograph.org.uk -' in the filename? We do not insist on it when people do pure hand uploads, so why do we need it on bot or tool uploads? — RHaworth (Talk | contribs) 01:28, 22 February 2010 (UTC)[reply]

Yes and no. If someone upload a image without there is no reason to move images to get it in the file name. Same thing goes the other way around. Normally we let the uploader decide the name. Personally I think it is a good way to identify where the image came from and it looks like that Multichill thinks the same. --MGA73 (talk) 09:31, 22 February 2010 (UTC)[reply]
In general, I think it's a good idea to identify such sources in the filename, even when doing manual uploads. Personally, I don't think the chosen way is ideal, but just "Geograph" wouldn't have been of much use. -- User:Docu at 17:25, 24 February 2010 (UTC)[reply]

Image corruption

[edit]

These ten Geograph images produced corrupt results here. Since it does not appear to be the bot's fault, I have reported it at the Village pump. — RHaworth (Talk | contribs) 01:28, 22 February 2010 (UTC)[reply]

I tried to save the image and upload it manually. I got an error (Files of the MIME type "application/x-php" are not allowed to be uploaded.) so it looks like it is not the bot. The easy way to fix it is do download files. Fix the problem and and click "Upload a new version of this file" at the bottom of the corrupted file pages. --MGA73 (talk) 09:37, 22 February 2010 (UTC)[reply]
These images should be fixed now. But would be nice if we could find all images like that. --MGA73 (talk) 10:02, 22 February 2010 (UTC)[reply]

Confirmed to be a known bug. No need to wait for the bug to be fixed - just edit the images as you have done. Yes, for the stuff done by the bot, we can find all corrupt images. They will be in this list of Geograph images where I have not found the corresponding image here. The list started with about 210 items. I have whittled it down and will continue to do so. But if anyone else wants to join in … The object is to find the image and add a valid {{Geograph}} tag to it. — RHaworth (Talk | contribs) 17:08, 24 February 2010 (UTC)[reply]

ONO it's OLU

[edit]

The bot has done a very thorough job - with one curious exception: it has refused to copy any of these 800 images uploaded by Geograph user 14997, OLU. The few images that have been uploaded have all been done by people. Why?? User 9685, Wilson Adams seems also to have been ignored. — RHaworth (Talk | contribs) 17:08, 24 February 2010 (UTC)[reply]

Geotags

[edit]

This area seems to be a bit shaky. By chance I happened over to Durham, to upload some of my own images. Yes I had to play a bit with categories as I expected- Durham the city, in County Durham the county. Yes all the geograph images were geotagged but

  • a few were accurate.
  • a few were 112 m from where the photographer stood.
  • a few tagged the object accuartely
  • a few tagged the object but missed by 112 m
  • a few tagged the grid square- but the +- 500m precision made it difficult to see if the intended point was 112m out.

It looks if the bot is cascading down from WGS84 to OSGB to gridsquare This is fine but if we could tag the location tag about which it was a further bot could then do the Helmert conversion to wipe out the discrepency. I have put a comment on Commons talk:Geocoding asking for ideas.--ClemRutter (talk) 18:25, 11 March 2010 (UTC)[reply]


Status

[edit]
Assigned to Progress Bot name
Multichill Did first 250.000, still 1 million files left to do GeographBot

The upload is complete and only categorization is now going on. Or do we expect more images?--Diaa abdelmoneim (talk) 12:54, 26 March 2010 (UTC)[reply]

I did the first 250.000, still 1 million files left to do. Multichill (talk) 17:43, 26 March 2010 (UTC)[reply]
Any idea on timescales on that? For what its worth I'm only a fraction of the way through sorting the categories from the first upload in the one county that I'm concentrating on...--Nilfanion (talk) 21:40, 26 March 2010 (UTC)[reply]
Not sure. Depends on the torrents being available. Multichill (talk) 09:46, 2 April 2010 (UTC)[reply]

Automatic categorization update

[edit]

I modified the bot to use the OpenStreetMap tool (the one included in {{Check categories-Geograph}}). Looks like I'm now able to find better categories for the overcrowded categories. Commons:Batch_uploading/Geograph/cats to clean up and Commons:Batch uploading/Geograph/top categories will be rescanned. Multichill (talk) 19:30, 5 April 2010 (UTC)[reply]

Good news. It did seem to me that the link looked better than the result. Can it help fine tune some of the localities too? -- User:Docu at 19:43, 5 April 2010 (UTC)[reply]
Yes, and please be aware that the old "preserved county" of Gwynedd which the first batch uploads used is about twice the size of the present-day county and included all of Anglesey, the western half of Conwy County Borough and a small part of Denbighshire. Sorting that lot out is still not completed, although I think I've managed to "reclaim" most of the Conwy files : please, please be very careful that Geographbot doesn't undo all my work (many hundreds of files!) and that the same thing doesn't happen again. Also I'd agree that sometimes the localities are not always perfect but that's a relatively minor problem if the files at least end up in the right county category. Thanks, Anatiomaros (talk) 21:57, 5 April 2010 (UTC)[reply]
I just did Category:Churches in England. This reduced the number of images from 1900 to 400. I'm using a difference source (OSM) now. I hope it doesn't use old information so files end up in the right county. I just fired up Category:Roads in Scotland so keep an eye on it. Multichill (talk) 08:25, 6 April 2010 (UTC)[reply]
Would it be possible to have the bot stop putting images in Category:Cleveland, Ohio? - EurekaLott (talk) 14:58, 7 April 2010 (UTC)[reply]
Either nuke Category:Cleveland or create Category:Cleveland, England. Multichill (talk) 16:57, 7 April 2010 (UTC)[reply]

Anglesey, Gwynedd and Palestine

[edit]

Gwynedd seems to have produced not a few problems for Geographbot. Bethesda is not the same as Bethesda: several dozen files for the Gwynedd town ended up in Palestine. The Bangor area seems to confuse the bot. Most of the files for the cathedral city of Bangor ended up across the Menai Straits in Category:Menai Bridge (in Anglesey). Conversely, a lot of files for the Bangor area ended up in Anglesey, including a file named 'Bangor Cathedral' which was put in the category for Llandegfan - a tiny village in Anglesey - and in the main Gwynedd cat and that despite the fact that it was correctly placed in Category:Bangor Cathedral, which is the only category it needed. Seems a bit bizarre! Quibbles apart, I think we should move on to the next batch(es) - better to have the remaining files now and get on with it, perhaps? Any news on the torrent availability yet? Anatiomaros (talk) 22:35, 2 May 2010 (UTC)[reply]

Use this trick to clean up disambiguation problems like this one. I proposed a move of Category:Bethesda.
I haven't heard anything about new torrents yet. I'll contact them again in the next couple of weeks. Multichill (talk) 11:32, 3 May 2010 (UTC)[reply]
Thanks, that seems to work. By the way, I've been working - slowly! - on the files cluttering up Category:Wales and one thing I've noticed is that most of them were placed in deeper cats as well, e.g. this one was placed in *Farms in Wales and also in the parent Wales cat, which is completely unneccesary and creates a great deal of work. I hope there is some way of fixing that for the next upload as the bot should avoid placing files in higher/parent categories except as a last resort, i.e. if it's not able to find a deeper category. Anatiomaros (talk) 18:30, 4 May 2010 (UTC)[reply]
An even more bizarre example is this file, correctly placed in *Carmarthenshire but also in both the *Wales and *United Kingdom categories... Anatiomaros (talk) 18:48, 4 May 2010 (UTC)[reply]
See the logic of the bot. If the overcategorization filter is acting up, you'll get these kind of results. Multichill (talk) 21:59, 4 May 2010 (UTC)[reply]
I added a static mapping from geonameId 2655804 to category:Bethesda,_Gwynedd. Multichill (talk) 10:27, 30 May 2010 (UTC)[reply]

Broken Geotags

[edit]

And several others all tagged to the same incorrect location. There are several other bot mistagged geograph images in Southampton. I do not know how widespread this is, but it needs to be addressed. 188.222.170.156 17:50, 2 May 2010 (UTC)[reply]

The locations are probably just lacking precision. Geograph works by a number of squares on a grid. You can fix this by adjusting the coordinates. -- User:Docu at 18:04, 2 May 2010 (UTC)[reply]
Yes and No. Looking at the cigarette factory. the bot has captured the user coordinate- which is measured in OSGB36, then displayed it on the map which uses WSG84. No Helmert conversion has been done so we expect a 112m inaccuracy. Why the others are there is open to question.--ClemRutter (talk) 09:23, 3 May 2010 (UTC)[reply]
It would be interesting to know what's in the input used for the upload. -- User:Docu at 10:25, 3 May 2010 (UTC)[reply]
See top: these dumps of the database. Multichill (talk) 11:33, 3 May 2010 (UTC)[reply]
There are several other incorrectly marked locations in Southampton, some like the above, are all marked to one incorrect location. Here's an example of one on its own though, File:Redbridge Flyover, Southampton - geograph.org.uk - 28728.jpg, bot copied from here. The source states a location of 50°55.2873N 1°28.0549W (translates to DMS 50:55:17.238N 1:28:3.294W), yet the Commons upload translates this to 50:55:11.09N 1:28:4.39W. There needs to be a manual sweep of geograph bot uploads, because right now, it's poisoning the tools that use this information such as map layers. Suitcivil (talk) 21:09, 4 May 2010 (UTC)[reply]
Just spotted that something very similar is mentioned above at #Geotags. Suitcivil (talk) 21:16, 4 May 2010 (UTC)[reply]
At least in the case of 28728 (Redbridge Flyover) I beleive that is our (Geograph's) fault. Multichill's GeographBot, just copies that lat/long from the dump files mentioned above. Inspired by this thread I went digging and it seems that the Geograph code doesnt update the said lat/long, when an image is relocated within a gridsquare. So the lat/long for the said image is the old location before it was updated via the Geograph site. We've never noticed because its just cached there for the search engine, the actual photo page calculates it live from the easting/northing. The best I can offer to do is create a bot to check the lat/long columns (as used in the dump) are correct, and update it as nesserically. And create a new dump. Will also have the script output a changelog, so it can be used by a another bot to correct the coordinates within Wikimedia. I have no way of estimating how many images will be affected by this yet. (Oh and of course will fix the bug within Geograph !) Sorry all for the confusion! BarryHunter (talk) 23:30, 4 May 2010 (UTC)[reply]
I just checked the four images mentioned in the opening to this section, and yes all have been moved within the gridsquare, so will have inaccurate lat/long in the version uploaded by GeographBot BarryHunter (talk) 23:34, 4 May 2010 (UTC)[reply]
Look, we have a million and a quarter excellent images- and potentially a million and a quarter inaccurate geotags. The errors are multiple-File:Church Street, Shirley, Southampton - geograph.org.uk - 26621.jpg back converts to SU 39500 13500 giving 100m accuracy (or a hell of a coincidence) but going back to geograph we see that is the subject location. In this case it is also noted as the photographers location- but the previous one I checked the photographer was at a different location! Here, from the description, the photographer was at //location dec|50.9223|-1.4333// or SU 39929 13802, so again different.
Could I propose that we upload all the images as soon as possible but with a new template {{Geoglocation}}- this would include all the geo information from geograph, and all the parameters of {{Location}}. A bot could then be written that carefully verify the input fields- and when satisfied, copy the correct photographer information- accurately converted from OSGB36 alphanumeric into WGS84 {{Location}} or add a textual comment that the object location was XXXX XXXX and the degree of precision. This would also allow these images to be put in a to be checked category. We can't do this at the moment by bot or manually with out refering back to Geograph page, as not all the relevant fields have been copied across.
I do propose a two part solution- as so many images are erroneous that one no longer has confidence any one of them being correct, and the UK Map is being saturated with errors. Once we have got all the data on hand we can discuss the best algorithm to use in order to correctly geotag them. Initially I would be happy if we could generate our own photographers lat/long by running the Helmert transformation on the photographers location grid reference. I have the code here in js [1].
Just as a summary, here are the errors I have found
  • Sloppy tagging. The Geograph user just typed in the wrong reference.
  • Geograph works to 100m precision- we attempt at least 7m precision
  • Geograph tags the object not the photographer
  • Geograph gives the object location in OSGB based alphanumeric grid references and WGS84: we use WGS84
  • Geograph gives the photographer location in OSGB based alphanumeric grid references only: we use WGS84
  • Geograph fills an empty photographer location field by copying in the object location
The good news is the Geograph and I get the same result when we run a Helmert conversion (OSGB36 to WGS84)
The cummulative effect of all these errors is impossible to quantify- but runs from 112m ± 100m through to several kilometres.--ClemRutter (talk) 14:31, 5 May 2010 (UTC)[reply]
The gridimage_geo database table, from the geograph dumps (already loaded by Multichill onto the Toolserver), contains the full eastings/northings for photographer location as accurate as geograph has it, many are 10m precision. This is more reliable than the wgs84_lat/long columns already used by GeographBot - due to the bug mentioned above. BarryHunter (talk) 17:08, 5 May 2010 (UTC)[reply]
I've corrected the column on geograph, and will recreate the actual dump files shortly. However have put the changelog here: http://data.geograph.org.uk/tmp_fix_log.mysql.gz - which someone can use to create a bot to correct the coordinates. I've put in the old coords too, so it can only replace them if somebody hasnt already updated them on wikimedia. I'd do it myself, but having never created such a bot, would probably just wreak more havoc! 776 images below id 250000 where corrected, so thats a ballpack figure for the number of images affected on wikimedia. (and yes this is still our subject location) BarryHunter (talk) 22:55, 5 May 2010 (UTC)[reply]
One comment which is related here. Regarding subject / photo locations: We (Commons) care about the photo location for geolocation purposes, but the subject location for effective categorisation... (So if different, don't throw out the subject location data).--Nilfanion (talk) 23:03, 5 May 2010 (UTC)[reply]
I'll put the temp table on the toolserver too and use it to correct the incorrect geotags here. Don't know when. Multichill (talk) 13:19, 22 May 2010 (UTC)[reply]
I have just discovered that commons has a {{Object location}} tag- that may be useful. I really don't know any more, and I have never seen it in use. --ClemRutter (talk) 22:12, 22 May 2010 (UTC)[reply]
Ok, fixing geotags now. Multichill (talk) 15:27, 14 August 2010 (UTC)[reply]

Update may 2010

[edit]

Hi guys, it's time for an update!

  • I spend some time on categorization. I'm now using the OSM source and the results seem to be much better. I'm thinking about combining both sources, but I'm not sure about that yet (Gwynedd problem would happen again).
  • No news on new batches. No new torrents available yet and no word from the Geograph guys yet. When we're going to upload new batches, the categorization for these new batches should be much better because the category tree is much more extensive now.

Thanks everyone for helping out! Multichill (talk) 11:11, 23 May 2010 (UTC)[reply]

I'd be curious to know how many {{Check categories-Geograph}} are still in place.

A few problems I've noticed with the categorisation, which you may want to think about before the next run:

  1. Category:Calstock (in Cornwall) and Category:Bere Ferrers (in Devon) are adjacent and the border between them is very complex. Almost all the files I've seen for that area have identified the location as Calstock, and the county (for subject cats) as Devon. It would be nice if it could get it right some of the time...
  2. The current re-categorisation is (inappropriately) adding city categories for files well outside city boundaries (Example - >5 km from the city boundary).
One thing that would definitely be worth exploiting is the Ordnance Survey OpenData; in particular the Boundary-Line product. When the geolocation of the file is right, correct interpretation of that data would identify the most precise administrative region (typically the parish in England - so village level), which in turn would correctly categorise it (most of the time). If it was only used to county level depth, it would guarantee the Gwynedd issue doesn't recur.--Nilfanion (talk) 11:45, 23 May 2010 (UTC)[reply]
I ve been chatting away at the HotCat pages in an attempt to speed up the cat checking process, by making HotCat more template friendly- with little success though. --ClemRutter (talk) 15:28, 23 May 2010 (UTC)[reply]
208911 files are tagged with {{Check categories-Geograph}}. I'm thinking about not adding it to new uploads, what do you guys think?
As for the incorrect categories: My bot is as good as it's sources. If the Ordnance Survey OpenData is freely available and of better quality it's sure worth to look into it.
But it's also possible to increase the quality of automatic categorization with the current tools:
  • I have a list of static mappings from geo id's to Commons categories. The full list of id's for the United Kingdom can be found here. I now changed the mapping of 2647716 (Gwynedd) from Category:Gwynedd to Category:Wales to prevent the flooding of the Gwynedd category. I will reread this page for more problematic locations (I remember something with Ohio) and will add them to the static mappings. Of course more static mappings can be added to improve categorization, it's probably useful to have all counties of the UK and Ireland mapped.
  • I have a list of imageclass to category mappings. It's probably worth checking the list of unmapped topics and map some more, but not a lot to gain here.
Multichill (talk) 10:15, 30 May 2010 (UTC)[reply]

Improved category intersection

[edit]

Hi guys, I updated the bot which works on the categories tagged with {{Intersect categories}}, see Template talk:Intersect categories#Subcategories!. This should really improve our ability the split out crowded categories. Multichill (talk) 12:11, 30 May 2010 (UTC)[reply]

Conwy

[edit]

Hi Multichill. I've just come across something which needs sorting out pronto. Some time since I last visited it, about 200 files have been added by bot to Category:Conwy. Conwy is a town, not a county. Almost all of these [new] files belong in Category:Conwy County Borough or (preferably!) its subcategories. Please could you sort it out and try to make sure it doesn't happen again? As I write this I'm also now wondering if the same thing has happened in the case of other categories where the county is named after a town, e.g. Category:Caerphilly / Category:Caerphilly County Borough. I hope not, as I've spent a lot of time trying to get some order in the Wales cats, although I can't go everywhere and do everything, of course. Best wishes, Anatiomaros (talk) 21:05, 12 August 2010 (UTC)[reply]

Of the files I've checked just now, it seems that they are not new files but just "old" ones marked as needing categories checked. Theses included one which I'd sorted months ago but somehow must have forgotten to remove the note - so although it was in three correct subcats (village name, *fields in, *rivers of) the [wrong!] parent category was still added by the bot. I've also had a look at Category:Wrexham and see that it's the same story there - couple of hundred files of countryside scenes that belong in Category:Wrexham County Borough. So is the same thing happening with Bridgend/Bridgend County Borough and Caerphilly/... ? I'm afraid to look :-) 21:42, 12 August 2010 (UTC)
Ok. I'm sorting some stuff out.
Multichill (talk) 13:36, 14 August 2010 (UTC)[reply]
Ok, did that. Now the next step is to rebuild the tree under Category:Caerphilly County Borough because the naming is incorrect and some categories are under Category:Caerphilly (town). Multichill (talk) 14:27, 14 August 2010 (UTC)[reply]
Thank you very much! That should solve the problem for future uploads as well. Anatiomaros (talk) 16:19, 14 August 2010 (UTC)[reply]

ughhhhhh!

Did I speak too soon? The following categories have appeared in Category:Geography of Wales:

Category:Geography of Bridgend County Boroughhh
Category:Geography of Caerphilly County Boroughh
Category:Geography of Conwy County Boroughhh
Category:Geography of Wrexham County Boroughhh

They have subcats. I haven't had the time to look as I've just "popped in" now, late in the day, but maybe there are more? Simple and hopefully isolated error by MGA73bot2 but they obviously need deleting. Anatiomaros (talk) 23:58, 16 August 2010 (UTC)[reply]

Huhhhhhh? I guess that's a typo. Will nuke them when I've found the correct categories. Multichill (talk) 06:11, 17 August 2010 (UTC)[reply]
Oooops! --MGA73 (talk) 06:39, 17 August 2010 (UTC)[reply]

Greedy UK

[edit]

On another note, I was checking out Category:Wells in the United Kingdom today when I noticed that wells in the Republic of Ireland had been included in that category (and Category:Wells in Ireland completely ignored). I don't think that will make a very good impression on the Irish, especially if this has happened with other categories... Anatiomaros (talk) 21:09, 12 August 2010 (UTC)[reply]

Wells is problematic. City, water wells, a bit of a mess. The whole Category:Wells by country tree should probably be renamed to Category:Water wells by country. Multichill (talk) 12:54, 14 August 2010 (UTC)[reply]
Ok, did that for the relevant categories now. Multichill (talk) 14:59, 14 August 2010 (UTC)[reply]
I think his issue was more that Ireland is not part of the UK (well, apart from the northern bit). -mattbuck (Talk) 15:31, 14 August 2010 (UTC)[reply]
That was indeed the point. Perhaps I could have made it clearer. This touches on a much wider point regarding our having Category:Ireland for the republic and Category:Ireland (island) for ... Ireland(!). I've brought this up on Category talk:Ireland but should probably get around to opening a discussion elsewhere. That whole category tree is very confusing and misleading and contains a number of glaring anomalies (IMHO of course). Some input would be good.
By the way, whilst I've no objection to having *Water wells... for *Wells, this does mean that the UK cats are now at variance with all the other *Wells cats. Anatiomaros (talk) 16:17, 14 August 2010 (UTC)[reply]
I know, the other categories still need to be renamed. Multichill (talk) 17:04, 14 August 2010 (UTC)[reply]

The big one

[edit]

Earlier this week I got a hard disk containing all the 1,8 million Geograph images + a recent dump of the Geograph database. I'm transfering the files to the toolserver now, I updated my local copy of the Geograph database and changed the bot a bit to reflect some changes. I'm about ready to generate new batches to be imported. Multichill (talk) 17:09, 14 August 2010 (UTC)[reply]

As the old military expression goes: "Incoming!". How many batches are you planning and will there be intervals between them? Good luck and tnx again for all your hard work on this project - these images have really transformed our coverage of Wales and Britain at Welsh Wikipedia and I'm sure the same is or will be true with the other Wikipedia editions. Anatiomaros (talk) 18:14, 14 August 2010 (UTC)[reply]
I'm starting at image 250.000 and will be working in batches of 10.000 files. These batches are compiled and tarred up so one of the shell users can download them for import. I'll probably just keep one program running depending on the speed. I hope to be doing around one batch a day, but I'm not really sure.
For locations I'm combining the result of the two source (Geonames and OSM). These locations + topics + a lot of filtering gives the final categorization results. A lot of categorization issues have been solved so at least we shouldn't be running into these. If new categorization issues arise we'll just have to tackle them.
Commons:Batch uploading/Geograph/cats to clean up and Commons:Batch uploading/Geograph/top categories will be updated every once in a while to hunt down overflowing categories. Multichill (talk) 18:30, 14 August 2010 (UTC)[reply]
One problem with the top category list is the number of X in county categories high on that list. For example, Category:Fields in Devon does not have meaningful subcategories (all there is is one geographic subset), so with the present structure should be incredibly bloated (and needs more precise subject categorisation).
It would be helpful if the lists were split into those that need more precise location categories (such as "X in England") from those that need more precise subject categorisation (such as Category:Cumbria or "X in County").--Nilfanion (talk) 18:56, 14 August 2010 (UTC)[reply]
Any news on when "The big one" is happening?86.169.41.141 10:49, 18 October 2010 (UTC)[reply]
Commons:Wiki Loves Monuments took a lot of time. I'm currently generating new batches. These will be slowly uploaded in the next couple of weeks. Multichill (talk) 11:00, 18 October 2010 (UTC)[reply]
I'm keeping track at Commons:Batch uploading/Geograph/Progress. Multichill (talk) 17:30, 21 October 2010 (UTC)[reply]

New upload problems: Slough

[edit]

Hi, have just stumbled across some newly uploaded images from the BOT and have come to the fourth one that has had Category:Slough added to images on the East Riding of Yorkshire/Lincolnshire boundary. The problem is that Slough is in Berkshire which is miles away. For example take a look at File:Adlingfleet - geograph.org.uk - 251390.jpg. Keith D (talk) 19:32, 27 August 2010 (UTC)[reply]

I can't reproduce this. My assumption is that one of the two sources incorrectly returned Slough for some time. Multichill (talk) 13:46, 28 August 2010 (UTC)[reply]
I'm removing Category:Slough from all files in Category:Images from Geograph needing category review as of 25 August 2010. Multichill (talk) 13:51, 28 August 2010 (UTC)[reply]

Same thing has happened a second time. Category:Slough currently contains tens of misplaced images. Looking at a few at random, all seem to have been uploaded on 14 Dec 2010. Sorting out manually is possible, but would take a long time. -- PeterJewell (talk) 17:13, 10 January 2011 (UTC)[reply]

More upload problems

[edit]

Hi I have noticed a batch of images of central London which have had the Category:Hertfordshire added to them, possibly related to the issue above? Here are the images I have found, I have recategorised the last two File:Houses of Parliament and the Thames - geograph.org.uk - 251103.jpg File:Houses of Parliament from the River Thames - geograph.org.uk - 252435.jpg File:Oxford Circus - geograph.org.uk - 254485.jpg File:Cable Street - geograph.org.uk - 253565.jpg File:Buckingham Palace and Victoria Memorial - geograph.org.uk - 251093.jpg File:A view from the Hub - geograph.org.uk - 253933.jpg Thanks for uploading so many images 88.109.13.229 19:45, 30 August 2010 (UTC)[reply]

OSM thinks that City of Westminster is in Hertfordshire. Multichill (talk) 19:55, 30 August 2010 (UTC)[reply]
Then they are insane (and we should ignore that stupidity) :)--Nilfanion (talk)

Fordon

[edit]

Hi, can you relocate the Geograph images in Category:Fordon to the correct category at Category:Fordon, East Riding of Yorkshire. Thanks. Keith D (talk) 17:25, 1 November 2010 (UTC)[reply]

✓ Done, more fun at User:Multichill/Zandbak. Multichill (talk) 19:00, 1 November 2010 (UTC)[reply]
Thanks. Keith D (talk) 21:02, 1 November 2010 (UTC)[reply]

Problems ahead

[edit]

I have just made a comment on this deletion as the outcome of the discussion there could have serious implications for the import of Geograph images on to Commons. Keith D (talk) 18:02, 5 November 2010 (UTC)[reply]

Why exactly? Multichill (talk) 22:05, 5 November 2010 (UTC)[reply]
De minimis and FOP, etc. There was a comment that similar files from this and other projects might have to be deleted. Personally I think it's an over the top reaction but as you'll see from the DR others disagree. Anatiomaros (talk) 22:15, 5 November 2010 (UTC)[reply]
If they delete the image in question then a very similar image of the same plaque from Geograph which would be loaded under the bulk loading would need to be deleted under similar rational. If this is the case then each of the Geograph images would need to be examined to see if it is acceptable or not. Personally I think they are going too far but we shall see. Keith D (talk) 00:52, 6 November 2010 (UTC)[reply]
We're talking about over 2 million images. Of course some of these images will be deleted for various reasons. All part of the game. I'm not going to care about that. The copyright paranoid people are welcome to examine all the images after I've uploaded them. Multichill (talk) 11:20, 6 November 2010 (UTC)[reply]
Yes, it's no big deal as a proportion of our 2 million files. Inspecting them all could keep the "copyright paranoid" busy for a year or two - if they'd like to add a few categories whilst they're at it that would be great. :-) Anatiomaros (talk) 00:58, 7 November 2010 (UTC)[reply]

Categorisation

[edit]

I'm inclined to say we need a page specifically to report bad categorisation on upload: This page is getting too long and ought to be about more general concerns. Another example: Images of St Thomas are getting placed into Category:St. Thomas (in the Caribbean!)--Nilfanion (talk) 12:25, 13 December 2010 (UTC)[reply]

We could create Commons:Categories needing disambiguation? At User:Multichill/Zandbak I have a nice list to start with. Multichill (talk) 14:32, 13 December 2010 (UTC)[reply]
That's a starting point yes. However, the proposed system would not capture everything because some categories are correctly located and still get irrelevant Geograph images dumped in them. A couple examples:
In both these cases, the bots really should not be applying the completely inappropriate location cats.
My initial suggestion would be to add some sort of sanity check: Simplest would be don't add a location category if it is not a subcategory of in Category:United Kingdom or Category:Ireland. I realise that check would not work (Category:United States and Category:Australia are both subcats of the UK), but idea is viable to my mind - and doesn't need us to write a huge list of exceptions.--Nilfanion (talk) 22:31, 13 December 2010 (UTC)[reply]
Copied from User talk:Multichill
OK, I am getting a little irritated by GeographBot's stupidity on upload now. It will always put some images in the wrong locations as it will get the location wrong sometimes (especially as categories care about subject location not camera location) - that's not a problem. Putting rural snaps into the category of the nearest city (for example File:Valley below Membland - geograph.org.uk -_295892.jpg) is unfortunate, as is the bot failing to identify anything and just putting it in the county category. But in both cases, the categorisation isn't completely incorrect and whilst it should be avoided, the category it gets placed in is likely to be maintained by someone who can correctly categorise the file.
However, some of yesterday's batch have ended up in wildly incorrect categories. Geograph images should never be in Category:Melbourne or Category:Boston (but they are [4] and [5]). Not only are those categories severely wrong, their maintainers may not know what to do with a file that has no direct interest to them and they may cause problems for users. The first image I mentioned could be correctly used as a "Rural view near Plymouth", but if someone used this as a "Rural view near Boston, MA"...
(As you know) I've already mentioned this on Commons:Batch uploading/Geograph. However, if GeographBot continues to upload files and placing them in categories for different countries, I am inclined to block it as malfunctioning.--Nilfanion (talk) 11:48, 15 December 2010 (UTC)[reply]
End of copy
I already do a lot of sanity checking to keep crap out. Undisambiguated categories are really problematic because it's very hard to make a distinction between the different possibilities. Fortunately the number of categories not disambiguated on purpose are very small so we could just create a list of them and I'll just add it as (another blacklist). Do you happen to have an idea how high the error rate is for this specific error? Multichill (talk) 16:42, 18 December 2010 (UTC)[reply]
  •  Comment Bad categories is annoying but if we blocked all users that does not add the right categories we would have to block a lot of users. Just have a look at Category:Media needing categories.
Unlike what many users that contributed to the category mentioned above Multichill and other users does A LOT to try to find the right categories for Geograph files so I see no reason why the bot should be blocked.
The bot do make some mistakes but once the files has been uploaded it is possible to correct many of the errors with a bot. All we need is someone to report "I found some files in category xx and it is not even in Europe." and then all we need to do is to find the right category and move all the files from the bad category to the right category.
Sometimes only local users can tell if a photo should be in category x or category y and as said abowe it is most likely that someone who can correctly categorise the file will do so.
Another alternative if we are not willing to accept some mistakes is to make the bot upload them in some "top category" and let users do all the categorization by hand.
So I suggest that we let the bot upload the files and every time a problem is found it is reported here so the bot can be fixed or a second bot can hunt and correct the errors. --MGA73 (talk) 17:03, 18 December 2010 (UTC)[reply]
  • The block bar for a bot is a lot lower, and this is part of why no one bot should do many different tasks, if it malfunctions in one role a block may be necessary but would stop it doing all the other things it does. Incorrect categorisation is different to insufficient or imprecise categorisation. A human who persistently adds an blatantly incorrect category to imagery would be liable to a block (for disruption), that's different from a human who is lazy and doesn't categorise at all.--Nilfanion (talk) 23:02, 18 December 2010 (UTC)[reply]
  •  Comment I have a different take. Some place name in the English speaking world are derivative from a well known UK place name. Melbourne, Derbyshire; New York, Lincolnshire; Washington, Co Durham are places that have been so honoured. Can someone write a simple bot that any user can launch on the Root of a Category tree,(for example: Category:Nova Scotia that recursive steps through that tree- looking for filenames that contain the text geograph.org.uk. Finding one, it will edit the text replacing the current category name, with the text Category:Misplaced geograph image. (For instance, the bot in the example above will find Category:Halifax Regional Municipality and remove at least 13 rogue files.) If recursive scanning is too dangerous- then perhaps a manual confirm should be made before anything is written. --ClemRutter (talk) 19:52, 18 December 2010 (UTC)[reply]
  • My point really is how hard would it be to blacklist Category:United States and all subcategories (and Australia etc)? Identifying the correct category is more complex and a bot will always make errors. But if a category is a category related to the USA, its definitely wrong for a Geograph image and shouldn't be added. The bots should upload to most precise category possible, and then humans should fix it. We shouldn't use incorrect categorisation as an intermediate to aid maintenance. If that means high-level categories can/will get bloated that's unfortunate, but it is never incorrect (and future bot runs can go over the high-level categories to add a more precise location).
  • And speaking personally as a major maintainer of Category:Devon: I'd much prefer Geograph images of Devon that cannot be correctly located by bot being dumped in the county category instead of being scattered into incorrect location categories: I can fix them if they are there, but if they are in a random category I'll never find them.--Nilfanion (talk) 22:52, 18 December 2010 (UTC)[reply]

I've now blocked the bot, see Commons:Administrators' noticeboard/Blocks and protections#GeographBot. I'm be happy for block to be removed if there is a concrete fix in place to sort mis-categorised files (or prevent it happening in first place (please?)).--Nilfanion (talk) 23:04, 22 December 2010 (UTC)[reply]

General feedback

[edit]

Sorting Geograph images is a 3 stage proces really:

  1. Initial categorisation at/before upload
  2. Recategorisation by bot
  3. "Final" tweaking by human.

I have been doing extensive work on that "final" stage and can draw some conclusions from what I've seen:

  • In general, the subject categorisation (houses/fields/trees in X) is fine, though additional subject categories may need adding.
  • The location categories for images of cities, towns and villages are usually fine - though city cats are inappropriately added to the surrounding rural areas.
  • However, the error rate for categories of rural imagery is much higher.
  • The issue raised above with mis-categorised files is not that common in terms of raw numbers - the problem is the severity (if it is not localised to the correct neighbourhood its much harder for users with "local" knowledge to sort).

Files can be have an incorrect location cat in three ways: Non-location categories being treated as a location such as Category:Treen, the Melbourne situation and clearly wrong locations within the UK (Two examples: 1 - the village is several km from the photo and there are other villages between them and Category:Corntown - a Welsh village with English content in the cat).

Incidentally when the file is miscategorised, I think the bot is already picking it up. In general, even if the location identified is "wrong" it still adds the correct x in county image - the example I just gave correctly placed it in Category:Moorlands in Devon, even though it applied a Somerset village category. If it merely identified this conflict and just added the Moorlands in Devon cat it would be correctly (if inadequately) categorised.

The error rate for rural imagery is significant, and is a natural result of the "nearest village" algorithm. In my last 100 edits in file space (almost all Geograph checks) ~20% corrected the location. If you bear in mind this is really a problem for rural stuff - if this sample of files is representative then the error rate on rural images may well be closer to 50%. These figures are high enough that I cannot trust the bot categorisation and have to verify manually.

Oh and what I mean by error in this context: The bot identified location is not in the correct civil parish. For example, File:Haws near Butland Wood - geograph.org.uk - 274186.jpg is a view of Modbury parish (as can be verified from OS mapping) but was categorised on upload as being in the adjacent location of Kingston. Its correct to say the image is of Modbury, it is not correct to say its of Kingston.--Nilfanion (talk) 10:42, 23 December 2010 (UTC)[reply]

Cleveland

[edit]

I'm getting tired of cleaning up after the ill-behaved GeographBot, and it's making me very cranky. Because the bot is unable to determine that Cleveland, Ohio is on another continent, it's dumped hundreds of images into Category:Cleveland, Ohio over the past few weeks. I've dutifully removed the category each time. I eventually grew sick enough of mopping up that I created Category:Cleveland, England, as suggested above, despite the fact that nobody in England had bothered to do so. Now that the category exists, the bot is ignoring it entirely, and dropped another batch of around 130 images into Category:Cleveland, Ohio. This is getting ridiculous. - Eureka Lott 00:36, 31 December 2010 (UTC)[reply]

That's because someone redirected Category:Cleveland to Category:Cleveland, Ohio. This will move the images. Multichill (talk) 06:19, 31 December 2010 (UTC)[reply]
In this case, GeographBot probably should not be adding Cleveland at all, as its a former county that does not exist today. That category should be populated by the relevant sub-categories. File:Borough Beck, Helmsley - geograph.org.uk - 331844.jpg looks typical - Cleveland doesn't get mentioned in either of the geonames or OSM datasets, and the bot is placing them correctly in the appropriate N Yorkshire village and subject categories: In this case just don't add the category at all.--Nilfanion (talk) 10:39, 31 December 2010 (UTC)[reply]
Oh and this problem was reported in July. This shouldn't be a problem with the current uploads (and Category:Cleveland should not be a disambiguation page).--Nilfanion (talk) 10:44, 31 December 2010 (UTC)[reply]
The disambiguation page was created today. Is there somewhere we should discuss this? - Eureka Lott 17:25, 31 December 2010 (UTC)[reply]
@ Eureka Lott, I understand your frustration. If you find other problems like this just leave a note here. Bots can clean up.
@ Nilfanion, this sounds strange. Untill a permanent solution is found I think the best is to add the images to Category:Cleveland, England. Then users can either move them manually or a bot can try to find a better place. --MGA73 (talk) 11:09, 31 December 2010 (UTC)[reply]
The bot is finding the better place already - look at the images in Category:Cleveland, England - they all have more precise town/village (but with accuracy issues, the 20% error rate mentioned in previous section) level categories, so in this case removing Cleveland altogether is OK. Its probably worth humans adding the appropriate towns to the Cleveland cat. Not a clue why the bots are using Cleveland when its not mentioned in the tools...--Nilfanion (talk) 11:20, 31 December 2010 (UTC)[reply]
I did report the problem here in April, but was brushed off by Multichill, as if it was a problem with the redirect and not his bot. - Eureka Lott 17:25, 31 December 2010 (UTC)[reply]
This upload is a big project and there are thousands of things to fix so I'm not surprised if a few things is not fixed the first time they are reported. Well in the future we can perhaps all do better. Also this is a Wiki so if a redirect causes problems just fix it :-) --MGA73 (talk) 13:09, 1 January 2011 (UTC)[reply]
Both of Multichill's proposed category changes were implemented (the creation of Category:Cleveland, England and the conversion of Category:Cleveland to a disambiguation page), yet the bot is still adding files to Category:Cleveland, Ohio. Is there any point to reporting problems here? - Eureka Lott 20:19, 2 January 2011 (UTC)[reply]
Hm... We can always try this [6] to remove the files in the wrong category while Multichill is trying to figure out what the problem is. --MGA73 (talk) 20:55, 2 January 2011 (UTC)[reply]
I asked Multichill yesterday and he said the problem is this: Geograph upload is done in a number of uploads. Each upload "package" has to be compiled before upload. Some of the packages was prepared in august (before the problem was fixed). There is still a few of these "old" packages left. So one option is delete the packages that allready has been prepared and compile them again (it takes a lot of time). Another option is to upload the images and fix the problems when packages are uploaded (takes much less time). So that is why bot still uses the old category.
As you can see from the "trick" I did above it only takes a few minutes to move all the images from the wrong category to the right one once images is uploaded. So we should just make a note of categories with problems and fix for new uploads and get a bot to clean up the old uploads. --MGA73 (talk) 18:28, 3 January 2011 (UTC)[reply]
Same concern with category:Moscow. I cannot imagine the reason why Scottish photos end up there but yes they do every week or so (example). ??? NVO (talk) 07:31, 3 January 2011 (UTC)[reply]
If you click at the link this OpenStreetMap tool in the category check box you can see that it says "<hamlet>Moscow</hamlet>" = en:Moscow, East Ayrshire / Category:Moscow, East Ayrshire. A bot can move the photos tho the right category. --MGA73 (talk) 18:11, 3 January 2011 (UTC)[reply]
Yes, it can. No, please don't. NVO (talk) 22:11, 3 January 2011 (UTC)[reply]

Disambiguation problems

[edit]

I want to attack the disambiguation problem. For that I need a list of problematic categories. Please add them to Commons:Batch uploading/Geograph/Disambiguation problems. I'll go through this page to see what categories were already mentioned, but I might miss some. There will be two approaches:

  1. Cleaning up after upload (for the already uploaded batches and the batches already compiled).
  2. Prevent new images from ending up in the wrong categories.

Multichill (talk) 11:11, 1 January 2011 (UTC)[reply]

This can only ever be a partial fix (but still helpful): For the simple reason that it is reactive not preventative, and it relies on users reporting the bad cats.
A less serious but more frequent problem is bad categories within the UK. For example, File:River Piall in Slade Park - geograph.org.uk - 275016.jpg and File:A48 - Brocastle - geograph.org.uk - 286704.jpg are close to two different Corntowns. In this case a straight disambiguation (to Corntown, Vale of Glamorgan and Corntown, Devon) is reasonable, but in some the major use is going to be overwhelmingly more important and so should not be disambiguated at all (Luton for instance). Some mechanism to report/handle these is also needed,
Incidentally the "when to disambiguate" question really should be handled by community-at-large, so I'll start discussion at VP later. I'm not convinced "disambiguate absolutely everything all the time" is optimal. Started discussion on this point at VP: COM:VP#Disambiguation of categories.--Nilfanion (talk) 23:04, 1 January 2011 (UTC)[reply]

Melbourne images

[edit]

Can you please stop adding Geograph images to the Melbourne category which is in Australia! These images should be placed in Category:Melbourne, East Riding of Yorkshire. Can you move all of these across? Thanks. Keith D (talk) 18:52, 10 January 2011 (UTC)[reply]

Arras images

[edit]

Can you pleas stop adding Geograph images to the Arras category which is in France! We have no settlement category for this and these images should be distributed in to the appropriate categories Category:Etton, East Riding of Yorkshire, Category:Goodmanham or Category:North Newbald depending on the civil parish that image is in. I have relocated the images from the recent upload. Thanks. Keith D (talk) 18:52, 10 January 2011 (UTC)[reply]

Slough images (again)

[edit]

Same thing has happened a second time. Category:Slough currently contains tens of misplaced images. Looking at a few at random, all seem to have been uploaded on 14 Dec 2010. Sorting out manually is possible, but would take a long time. Please resolve this soon. Thank you. -- PeterJewell (talk) 13:37, 12 January 2011 (UTC)[reply]

Preston

[edit]

Category:Preston is taking a hard hit- gathering images from Brighton, Kent, Devon and LB Brent- each of these counties has a hamlet of Preston that may be doing the damage. I have knocked off a few in passing. --ClemRutter (talk) 00:10, 13 January 2011 (UTC)[reply]

There are some for the town in the East Riding of Yorkshire that I have pulled from the cat, may be more. Keith D (talk) 12:04, 18 January 2011 (UTC)[reply]

Process of checking categories with HotCat

[edit]

Just a comment for the next time we process a million images. I know of no way of removing the {{Geograph- please check cats}} template while just using HotCat- so I am just doing the obvious changes but failing to remove the tag- which is not wrong but doesn't need to be there.--ClemRutter (talk) 00:10, 13 January 2011 (UTC)[reply]

Characters in file names

[edit]

The Commons filenames are derived from the file name on Geograph - most of the time these are ok, but some may need tweaking because they aren't that useful. No way that can be sorted out until human review of course.

However, there are a few non standard characters in Geograph file names, for example File:King’s Nympton, towards Highridge - geograph.org.uk - 267515.jpg. The use of the non-ASCII apostrophe makes linking/using the file more awkward and the character may not render properly for all users. Could that character just get mapped to the standard apostrophe ' ? (Incidentally, some archiving might be nice)--Nilfanion (talk) 23:17, 17 January 2011 (UTC)[reply]

Gay Street

[edit]

I'm confused as to where the images in Category:Gay Street should belong to. It's nice to see stretches of farmland in downtown Manhattan but ... :)) NVO (talk) 22:36, 23 January 2011 (UTC)[reply]

Oh, that's a good one! I've sorted the files (moving them to Category:Pulborough (or its subcat Category:North Heath) or Category:West Chiltington as appropriate - as these are the relevant civil parishes. Not one of those images can be described as an image of Gay Street, West Sussex (so no need for Category:Gay Street, West Sussex).--Nilfanion (talk) 00:01, 24 January 2011 (UTC)[reply]
Is there any mileage in suggesting that Manhattan should apply to become a twin town to Pulborough? There was an image showing common roots. :-) --ClemRutter (talk) 10:08, 24 January 2011 (UTC)[reply]
Properly disambiguated by now. --Foroa (talk) 12:14, 24 January 2011 (UTC)[reply]

Indefinitely on hold

[edit]

Over the last year I spend a lot of time on this project. In this project we got a lot of images, it was a fun thing to do and I got a lot of positive feedback. Over the last couple of months this changed. I got a lot of negative feedback from a small group of people and barely any support from the people who like this project. With this topic we hit rock bottom, this project is no fun for me anymore so I'm waste my time on something else. I hope you're all happy with that. Multichill (talk) 12:30, 24 January 2011 (UTC)[reply]

100% support here- the images I am waiting for are after 733,000- with a the most modern uploaded yesterday. I try to keep my head down when it comes to squabbling with rogue admins- I generate material on Wiki, and upload geotagged photos to Commons and a short spat on the Admin pages saps you of time and the will to live. There is only one criteria- will it be used- not has some modern town named itself after a village that has been existence for two millenia and then is surprised if they receive images of their namesake in their cat! So all I am asking is that you restart the bot- but run it a lot lot faster- in return there is coffee on stove and a warm welcome should you pass by --ClemRutter (talk) 14:33, 24 January 2011 (UTC) Rochester, Kent.[reply]
I'm sorry to hear this. I'm no admin and don't really have the nessasary debating skills to do much about this, but know this; your work is massively appreciated by many silent users. I'd like to be able to change your mind but am unable to do anything about a few noisy nitpickers Oxyman (talk) 16:22, 24 January 2011 (UTC)[reply]
...and double damn. Please reconsider, Multichill. I can well understand how you must feel but this is quite possibly one of the most important projects we have ever seen on Commons. I have used many GeographBot images on Welsh Wikipedia and have even created articles I'd probably not have started had I not come across an image here whilst categorising these files. Your work is transforming the Britain and Ireland geo-cats and is greatly appreciated by many of us, here and on the various wikipedia editions. Some problems with categories are just inevitable in a project of this size - and I've dealt with my share of them - but responses like "block the bot!" are not warranted. Between that and the fact that some people seem to be here on a mission to have as many images as possible deleted on the slightest of technicalities, I sometimes wonder what is becoming of Commons these days. Anatiomaros (talk) 20:02, 24 January 2011 (UTC)[reply]

Suggestion to resume upload

[edit]

I would really like to se the rest of the files uploaded. In my opinion a few wrong categories are acceptable but sadly not all users share this idea.

So I suggest we upload all the files but do not categorize them with a bot. Instead they are places in a category like Category:Uncategorized files from Geograph. Once the files are uploaded users can work on the files manually or perhaps someone can design a bot than can categorize some of the files (once they are on Commons EVERYONE can work on them).

So if you support this idea please add your name below and hopefully we can get Multchill to upload the files.

  1. Rapidly upload the remaining images, but don't categorize them, tag them with {{Uncategorized-Geograph}}
  2. Slowly work on the uncategorized Geograph files and improve the categorization algorithms even more
The code for the second part is public so anyone can run it or improve it. Multichill (talk) 20:38, 25 January 2011 (UTC)[reply]
+1 to this for what its worth; I'd support a rapid, uncategorised upload if that means the exercise can be done more quickly (the categories are the bottleneck anyway - distributing that stage more widely seems like a plan).
Incidentally, there has been one major change in the background since the start of the uploading: The OS OpenData release. Furthermore, (I've just noticed) it has been converted into a reverse lookup system: This file gives this, which correctly identifies it as Modbury, unlike Geonames, which incorrectly goes for Kingston, whilst OSM picks up Lower Torr (a hamlet in Kingston parish). If a bot made use of this service, it would accurately identify the civil parish (in England) on 99.99% of occasions. As I've mentioned above, the civil parish is a "correct" location.
I can also see a method using that database, that (for England and Wales only unfortunately) will correctly categorise nearly all files and avoid all the bad location categories mentioned in all threads above. That is, create a matrix that links all civil parishes and communities to their MaPit ID (or the ONS one) and to their category on Commons. Then a bot could look up the coordinates on MaPit, extract the ID and get the correct category.--Nilfanion (talk) 21:15, 25 January 2011 (UTC)[reply]
  • One for bots to do it is just create categories for all grid squares in the right 500km2. After the upload is completed, go through and delete the empty squares. With additional thought, we can reduce the amount of pointless creations/deletions significantly: For example, all photographed squares in SX are north of SX3833, so no need to create any of SX**00 through to SX**32 (saving 3,200).
  • The grid squares are shown at [7] - 75 in all. Some of those have a very low number of squares with images: I've created all necessary cats for MC, OV, HW and HX manually. Some will need 10,000, 13 of them (NN, SE, SJ, SK, SO, SP, TL, SU, H, M, N, R and S) are nearly 100% land.
  • Alternative approach, could be to look at geograph's metadata and figure out which squares are needed that way? (I'm not sure on feasibility there).--Nilfanion (talk) 11:30, 7 February 2011 (UTC)[reply]
  • Considering that the Category:Images from the Geograph British Isles project needing categories by grid square has now been emptied (of course, there are still files which could be categorised better, but there is always room for improvement) and deleted, we might have reached a point when Geograph photos could be uploaded again automatically. However, I would like to suggest - similar to what others have written above - that all Geograph images in Commons should keep or be given a category corresponding to the grid square. This could be done automatically on upload, or for existing files it could be derived from the coordinates of the object. Administrative borders may change, but the position of an object does not - at least not perceptibly during our lifetimes, plate tectonics is too slow for that ;-). Also, the upload should pre-categorise files at least according to the county, and perhaps it might be possible to identify key features from the title or description of the photos, at least in a significant minority of the cases (such as "Railway lines in Wales", "Roads in England", or "Rivers of Scotland")? Hence, I cautiously support the resumption of the upload. --Schlosser67 (talk) 07:09, 25 May 2018 (UTC)[reply]
    Wow! I'm impressed. If we were to resume bulk uploading, I'd suggest that it be done quite slowly. We're eight years and 4 million photos behind, so I think it would be quite acceptable to take several years to process the backlog. I don't know if Multichill is in a position to revive GeographBot, but if not I might be able to construct a new bot that could help. It's a bit of a scary prospect, though. --bjh21 (talk) 16:55, 25 May 2018 (UTC)[reply]
  • Support, but I would oppose leaving any new uploads uncategorised, as many people checking through them are likely to look in the categories covering their expertise, not a meta-category containing every single one of the however-many-million images awaiting uploading. WT79 The Engineer (talk) 16:53, 19 June 2020 (UTC)[reply]

Mangled image

[edit]

I just wanted to let you know I have re-uploaded File:Rolls_Royce_Cars,_Hellens,_Much_Marcle_-_geograph.org.uk_-_583392.jpg from original source - the bot uploaded some binary data of the same length, but that was not the image. You might want to investigate this, if you have time.  « Saper // @talk »  01:57, 14 February 2011 (UTC)[reply]

Adding categories- going on a wikiwalk

[edit]

During the last week I have been spending many pleasant and nostalgic hours walking in the Peak District-- all from the comfort of my computer. I am now at the stage of taking off my boots, having a pint and discussing the many things I have seen.

Geograph photos are brilliant... but back to categorisation and a few rambling thoughts! Firstly in lowland Britain many towns have no displaced photos but cathedral cities seem to include photos of every church in the diocese, maybe 50 of so from villages 40 miles away. Would it be acceptable to create a category Category:Churches in the diocese of Rochester then Cat-a-lot the whole batch there?

When using Cat-a-lot you cannot remove the geograph-template. Do we need to create Category:Geograph photos that have been examined but are awaiting a a bot to remove the tag? It is time consuming to process them individually with Hot-Cat in (++) mode.

Secondly, up in the Peak District- I moved to Category:Glossop. I have previously worked and photographed here. I have spent time drilling down the sub cats so the the town had a sensible number of settlement related photographs. I forget the number of geophotos I how found here- maybe I have moved about 120. It is understandable as the town of Glossop is the last sizeable settlement below the hills in the Peak District National Park. One shot was 14km away and in all, it seems to cover about 150 sq km of moorland. OK finding the location of a shot called Peat bog.jpg or Bridge over troubled water was fun but... Well I have categorised most of them using existing Cats, such as Category:Bleaklow, Category:River Kinder and created a few more such as Category:Shelf Moor, Category:William Clough. It is here that I am starting to have questions about the correct way to cat moorland, I guarentee I will have got it wrong and then what location category do you use.

Category:Moorlands above Glossop would be an easy way, though not quite accurate.
Category:Tintwistle CP uses the notation from the OS map CP being civil parish- this could separate village shots with shots from the area that is nominally administered by a parish council that has taken its name from that village. (I can't find a Glossop CP- as urban areas are not parished and a parish council may chose a name other than that of the village where they meet; this is not a surprise, but is not BOT friendly). Should we do this? To an outsider CP could be seen as a anti wiki? Do we create artificial CP categories?
Category:Bleaklow,Category:Shelf Moor- does one put both as Shelf moor could be seen as a small part of Bleaklow. Is is better to classify the moor, or the brook, beck, bach, Clough draining the moor, such as Category:Shelf Brook. I have gone for both. I but the brook as a subcat of the river it feeds into, and Category:Streams in Derbyshire? Does one go further and create Category:Streams draining Bleaklow? Do we whinge about the cat Category:Moorlands in Derbyshire when the plural of Moorland is Moorland, and as a natural feature it should be Category:Moorland of Derbyshire?

Now we come the other images on Google maps- the ones transfered more recently. Many advantages but a pain to categorise.

There is nothing there to work with as none of the Geograph cats have been transferred. Each image has to be processed manually. We have all the angst about the location of the square :Category:Tintwistle CP :Category:Crowden Great Brook as above, and then no clue what the image is: Footpath, Wooden bridge, Wheatear, Moorlands in Derbyshire, Peat, Heather. Is there sufficient geograph metadate to run a bot- to add all the categories other than location which seems to have been the biggest sticking point in the past?

So here are a few thoughts I like to share, before I put my boots back on again 'toddle yam o'er them moors' --ClemRutter (talk) 11:25, 12 February 2011 (UTC)[reply]

Now that's a lot of stuff. To get through the various component questions:
  1. The problem with cities is not related to religion and would probably not align with the diocese boundaries anyway, so that cat-a-lot idea isn't workable IMO. It is a result that the coding for cities is "greedy" and includes a lot of the surrounding countryside. Some cities are much worse than others, Plymouth was exceptionally bad before I fixed it all these were categorised to Plymouth, but most are not of the city.
  2. On the moorland / CP issue: Personally I like everything to be placed to the correct parish (or better), so all hamlet cats are included in their parish categories. This makes the CP a useful way to search and exploit via category intersects. This does mean that in some cases (eg Category:Peter Tavy), countryside stuff overwhelms the few snaps of the village proper. This could be handled by splitting the parish from the village, if this is done Category:Peter Tavy (parish) is better than Category:Peter Tavy CP ("Peter Tavy CP" doesn't exist but the CP called "Peter Tavy" does, but needs then disambiguating from the village), and this is not an artificial construct as it represents a real area. The unparished area of Glossop is an equally well defined area - just don't call it Glossop (parish) because it isn't :)
  3. As for the subject categorisation: Do what seems most appropriate. Create categories if there's an article on WP on the subject, which would benefit from it. Don't bother creating intersection categories unless/until the parents are too bloated, in which case do so. Don't lose sleep over it in any case, otherwise you'll go round in circles for months. In the case of Derbyshire, "Streams in the Peak District" is a natural sub-cat. Think about it in urban terms: At what point would you create Category:Darnley Road, Rochester?
  4. The recent transfers do include the grid square, which is better than nothing. The grid square readily gives the civil parish (or whatever), so potentially provides greater accuracy for the localisation - You could use Cat-a-Lot to move everything in Images from the Geograph British Isles project needing categories in grid SE2933 to Category:Leeds, which is where GeographBot would probably have put them. They would still need subject categories, so shouldn't have the "give me categories" tag removed. The bots were always pretty good with subject categorising, they (nearly) always get the <subject> in <county> category right. Extracting the subject tag from Geograph would help there, and could always be done by bot either on upload or later. Remember, the subject categorisation always benefits from a manual look, as Geograph only records one subject matter, and images may relate to multiple subjects - no bot can add the info when it just isn't there.

Think that covers it :)--Nilfanion (talk) 19:07, 12 February 2011 (UTC)[reply]

I think there is enough there to formulate an advice note.
New problem: I found a source square. Category:Images from the Geograph British Isles project needing categories in grid SJ9504- I found the target cat Category:Essington- I fired up Cat-a-lot- Selected all, and clicked on move, wheels whirred - and it processed all 13 files then spat up -
Done.
All pages are processed
Return to Page
The following files were skipped because the old category could not be found.
Then the names of 13 files
Please say its just me! --ClemRutter (talk) 19:27, 13 February 2011 (UTC)[reply]
It hasn't gone away! --ClemRutter (talk) 01:10, 21 February 2011 (UTC)[reply]
Unfortunately, that's a problem with Cat-a-Lot. The grid square category is actually not on any of the file pages, but it is included via the {{Uncategorized-Geograph}} template. You may want to bring it up at Mediawiki talk:Gadget-Cat-a-lot.js and see if the developers of the tool can help.--Nilfanion (talk) 12:27, 24 February 2011 (UTC)[reply]
I see you have been over there in the past. I have added a request for help, as i think that with 1.5 million files to process we have a good case. You are right, there is nothing resembling a category- either cat-a-lot must be liberalised, or we need a BOT to trawl through the database --# searching gridref=AA9999 writing to file Category:OS grid AA9999- but BOT writing exceeds my pay grade. We can hope that User:DieBuche comes up with a solution. --ClemRutter (talk) 18:04, 24 February 2011 (UTC)[reply]

OK I did an auto-deletion request thing and didn't realise the deletion request would be here. The only reason I was suggesting the image should be deleted is because it appears to be exactly the same as File:PSndwnTB34.JPG, the uploader on Geograph obviously chose to upload it here to Commons as well and there's little point in both being on Commons. Editor5807speak 11:54, 23 February 2011 (UTC)[reply]

In future is it possible to copy & paste the geocoding from the Geograph image onto the image to be kept so that as much info as possible is retained? Oxyman (talk) 23:59, 23 February 2011 (UTC)[reply]
Fixed -- Common Good (talk) 19:52, 28 February 2011 (UTC)[reply]

Small Thumbnails

[edit]

The bot seems to have uploaded a cuple of small thumbnails rather than the actual images File:Arrival at Aberystwyth - geograph.org.uk - 580828.jpgFile:Vale of Rheidol Railway - geograph.org.uk - 775530.jpgOxyman (talk) 18:31, 1 March 2011 (UTC)[reply]

Ah, the archive we sent to wikimedia containing the images, may contain a small number of thumbnails like this. The filenames would be ending ..._60XX60.jpg . Ideally the box should exclude any filename containing "X" or "x" charactors (a normal image filename will never contain those letters). BarryHunter (talk) 16:37, 5 March 2011 (UTC)[reply]
Also medium sized thumbnails? like File:Ottendorf Green.JPG. - Category:Images of the Geograph British Isles project requiring attention seems a good place to put these Oxyman (talk) 00:39, 19 March 2011 (UTC)[reply]
Looks like that one was manually uploaded by Northmetpit, given the wrong licensing tag and missing the Geo-data. Keith D (talk) 12:02, 19 March 2011 (UTC)[reply]

Corrupt files

[edit]

Please have a look at the following:

--DieBuche (talk) 15:56, 6 March 2011 (UTC)[reply]

The old OSGB36 WGS84 problem

[edit]

While doing a little categorising of files on the Northumberland coast, it is easy to see some whose geotags are 112 m out. (They are in the sea!) Do we have a OSGBerror category we can Hotcat onto the image so it can be cleaned up by a bot later? Opinions? — Preceding unsigned comment added by ClemRutter (talk • contribs) 2011-03-12T01:09:45 (UTC)

Deletion Request Notification moved to sub page

[edit]

To head of the problem of this page being swamped by DRNs, we have created a subpage Commons:Batch uploading/Geograph/Deletion requests tweaked the links and moved existing DRNs.

Disambiguation This page is not for deletion request notifications of files uploaded by the GeographBot.

If you want to view the deletion requests go to: Commons:Batch uploading/Geograph/Deletion requests and watch that page.

--ClemRutter (talk) 10:17, 14 March 2011 (UTC)[reply]

Getting the remaining images categorized

[edit]

Hi everyone. We have an awful lot of uncategorized Geograph images. So what to do next? The first batches of Geograph images where categorized like this:

  1. Get topic category (Geograph added a keyword to each images, we mapped these to Commons categories)
  2. Get location categories (lat+lon-> location tree like Europe, United Kingdom, England, etc etc)
  3. Intersect the topic category with the location categories
  4. Filter over-categorization

This works pretty well, but sometimes the system contains some errors:

  1. Wrong village: Boundaries between villages aren't very clear so we end up in the village next door
  2. Disambiguation problems: More than one place has the same name, image ends up in the wrong category

For the first problem it was suggested to base the location categorization on a different source than OpenStreetMap and geonames. I don't have time in the next couple of months to do this and I'm not under the impression that anyone else wants to do this so this option is not feasible. At least the image ends up close to the actual location so this shouldn't be too hard too fix over time. The second problem is very hard to tackle before, but easy to fix when the bot is done. Say the bot puts the images in Category:A, but the images should be in Category:A (other A), you just have to put {{Intersect categories|A (other A)|Images from the Geograph British Isles project}} on Category:A and just wait for the bot to clean up. This only works if all Geograph images in Category:A need to be moved (for example when A is in the USA).

What do you think? Should I just fire up the bot again? With the current manual approach it will take forever to get the images categorized. Even worse is that the topic categories from Geograph are not used. Multichill (talk) 11:57, 7 May 2011 (UTC)[reply]

The topic categories need importing somehow. I am slowly working through the grid squares as this gets the right location in about half of cases immediately, I got many of the big city categories done quickly but its a struggle for the rural ones. A 1-2-1 mapping of grid square to civil parish exists for many but not all cases.
The proposed fix to the disambiguation problem isn't perfect either - if A is a place in England, and other A is also a place in England it will fail. It would work if some county information can be extracted first. For example, if an image of Luton is in grid TLxxxx, its Luton, Bedfordshire. If its in TQxxxx its Luton, Kent and so on. That ought to give a method for sanity checking.--Nilfanion (talk) 17:54, 7 May 2011 (UTC)[reply]
I think we should run the bot and accept some errors. The file contains a template saying that categories needs to be checked so users should know that there is a risk that the categories are not 100 % ok. When a new and improved bot is ready in x months it could probably check the categories of all files that still have a "check categories" on. Meanwhile it is possible to fix manually with "Intersect categories" by working on all Geograph images that is in a category that is within the United Kingdom etc. Files that are categorized within the United Kingdom but in a wrong part of the United Kingdom could perhaps be fixed (semi) manually.
If there are known problems we could perhaps ask the bot to put the images in a category one step higher. Like "England" or "Ireland". --MGA73 (talk) 18:41, 7 May 2011 (UTC)[reply]
The most important point to me is retention of the grid reference (or a similarly precise info such at lat/long to 0.01degree) until we know the location is right, preferably as a category so we can use CatScan, Cat-a-Lot and all the other category based tools to fix things. Once that category info is thrown away it becomes a lot harder to work out the true location. And I'd oppose just resuming things as we did before the pause in December, without some measure in place to stop the "pollution" of blatantly incorrect categories like Category:New York or Category:Moscow.
To expand that second bit, my concern is the files never reaching the correct UK/Ireland category. I don't mind if the files temporarily go into non-UK location cats, but we have to find them. The specific cases before were typically the local maintainer finding Geograph images and complaining here. Other maintainers might just remove the misplaced Geograph imagery from "their" category, without attempting to correct and so we lose the locality info completely. IMO the minimum is some sort of error-detecting algorithm which reports somewhere and we can fix manually. We have a good chance of fixing all misplaced files then. Bot error-correction would be even better, to save the manual legwork, but its detection that's the important bit.--Nilfanion (talk) 19:46, 7 May 2011 (UTC)[reply]
User:Nilfanion used IMO above, but his oppinion is far from humble. he stops everyone else from progressing because of a very few miscatagirisations, his oppinion is in the minority but it seems it's the only oppinion that counts. There is allways going to be some miscatagorisation in an undertacking of this scale. If the amount of miscatogorisation can be reduced easily then it should be done, but it should not be nesacary to have to rewrite a new bot when there is a perfectly good one. I think the only realistic choice is either run the existing bot possibly with a few small alterations or continue with the mannual process that in all likelyhood will never actually get completed. Id choose the former Oxyman (talk) 02:49, 8 May 2011 (UTC)[reply]
I made a query to compare coordinates, you can find the result here. It shouldn't be too hard to build a tool around this to find images with articles where the coordinates are not in the vicinity. This way you can easily hunt down images which ended up in the wrong category. Multichill (talk) 10:37, 8 May 2011 (UTC)[reply]
Multiple CatScans can also work, but are very tedious (do them across every country..).
One substantive change I'd suggest is to stop the bot adding city categories to rural imagery - This one had Plymouth incorrectly added because of the OSM query, the Geonames query avoids that
And please keep that grid square info on the page: With it you can use easily the OS maps to work out the final correct location, something that can't be done with WGS84 lat/long info (the process becomes much more complex). I could process 100 images in 1 grid square in less time than it takes for a 5 in a town category.--Nilfanion (talk) 12:59, 8 May 2011 (UTC)[reply]
The gridsquare information remains available at the Toolserver (u_multichill_geograph_p) or you can use a copy from http://data.geograph.org.uk/dumps/. Multichill (talk) 13:25, 8 May 2011 (UTC)[reply]
Can you just retain it in category format (as it is via {{Uncategorized-Geograph}}), its the simplest option as it means all Commons existing category-based tools can be used. I'd say best thing would be to add the category such as Category:Images from the Geograph British Isles project needing categories in grid SX4870 after the bot-generated cats (no point fretting about the name - its still a temp cat), and if its hard-coded Cat-a-Lot can grab it to use.--Nilfanion (talk) 15:44, 8 May 2011 (UTC)[reply]
Going back to the distance thing. I hacked up this list (based on query and result). It doesn't parse all coordinates yet, but it gives a good impression. This could be turned into a tool to hunt down images with incorrect categories. Multichill (talk) 12:34, 19 July 2011 (UTC)[reply]
There's certainly some potential there. One tweak I'd suggest is if the script thinks the coordinates are 0,0 (because it can't parse the coordinates, the en article has no coordinates or whatever) that it gives an error message instead of computing the distance from Africa to the UK. Another thing to try might be to filter the output: If the error is <100km it is likely to be correct; and >1000km is certainly wrong.--Nilfanion (talk) 00:24, 23 July 2011 (UTC)[reply]
Yep, the parser doesn't understand everything like now. This is just a proof of concept. This might be the basis for a useful and user-friendly tool. Multichill (talk) 08:30, 5 August 2011 (UTC)[reply]

Date categories

[edit]

How difficult would it be to automate the addition of month/year categories e.g. Category:July 2007 in England or Category:2005 in Wales? It seems that most Geograph images come with this information already provided. It's not as high a priority as the more sophisticated categories by location (and even e.g. distinguishing Wales/England and other borders is difficult near the edges, although should be "obvious" for most grid squares) but it is a task that could be undertaken at the same time. TheGrappler (talk) 03:16, 20 June 2011 (UTC)[reply]

What's the use of these categories? They will contain thousands of images. How does this help the user? Multichill (talk) 08:37, 20 June 2011 (UTC)[reply]
That's what I first thought, skeptically, when I noticed this category schema. But it is a widespread and standardized system used to classify thousands of images, and could be easily implemented for the Geograph uploads, so I think the onus is more on "why not?". We endeavour to make geolocation information both machine-readible and useful for end-users (e.g. by using geolocation templates as well as by-locality categories); it stands to reason we ought to do the same for dates too. This is likely to be become more of an issue as Commons matures: for now we are used to most of our photos being from 2005 to 2011, and therefore being "contemporary". But by 2020, we will be looking back at our 2005 images as to some extent "historical" and ought to distinguish them from our photographs of the late 2010s! This is particularly true for photographs of places which change over time (rather than e.g. photos of animal species which are essentially timeless). In fact the Geograph image set contains quite a lot of images from the 1980s and 1990s (and even some from earlier!) which already raise the question of identifying and classifying their dates.
As for "how does it help the user?" ... I've become less skeptical since taking a look at e.g. Category:1983 in New York City or Category:1973 in England. There's obviously potential here: the by-date categories contain a fascinating record of change. In the long run I expect category union and intersection tools will provide the best way to browse by both date and location (e.g. to view "images of Lincolnshire between 1995 and 1999") so the number of files need not be an impediment to helpfulness for end-users. Alternatively editors may choose to split the categories such as "2004 in England" by creating subcategories as date-locality intersections (I'd suggest at a county level to start with; in some cases e.g. Category:2002 in Somerset this work is already underway). But the first step to getting there, is to categorize the images at a basic level: judging from the current category scheme, "YYYY in England/Wales/etc" will do for images prior to 2000, and "MONTH YYYYY in England/Wales/etc" for images in 2000 and later (so long as month data is available). TheGrappler (talk) 20:15, 20 June 2011 (UTC)[reply]
We're hitting the limits of the category system here. See User:Multichill/Next generation categories. The second point (Efficient intersections/searching) would be very nice for this. Multichill (talk) 11:21, 19 July 2011 (UTC)[reply]

The check categories template currently refers to Geograph, and to the Geonames and OSM databases. I've mentioned above that these additional tools make significant errors at times, inappropriately assigning rural locations to the nearest city, and assigning a rural picture to the nearest village - when the location is actually part of a different parish.

The tool http://mapit.mysociety.org/ is derived from the OS Boundary-Line datasets and provides a lat/long lookup. The key difference is that the information provided will correctly identify the relevant administrative areas - making it a better match for the category scheme. If it was added to the template (http://mapit.mysociety.org/point/4326/{{{2|}}},{{{1|}}}.html), this would be a more useful database for manual checks than the other two. This makes checking location a lot simpler, as only one click is required to get reliable info.

I could just be bold and add it myself, but I'd want to (a) see if there are objections and (b) bring this to the attention of others trying to categorise this stuff.--Nilfanion (talk) 10:36, 15 July 2011 (UTC)[reply]

Great suggestion! I added the link. Someone could also add automated harvesting from this source, but that would probably require setting up a mirror. Multichill (talk) 11:19, 19 July 2011 (UTC)[reply]

End of the batch

[edit]

I'd like to know what the last file uploaded by GeographBot was. As in, what was the latest Geograph ID? The hard disk was supplied sometime ago, and Geograph has had a lot of contributions since then. Its also worth saying that the recent uploads are likely higher value, due to Geograph allowing higher resolution imagery.

Once we know what the last file was, we can work out what to do with the remainder. Noting what the last one is (both here and on Geograph itself), and encouraging manual upload of the good stuff seems sensible at minimum; as well as considering further batches too. I'm inclined not to grab more files until we process the ones we have got though.--Nilfanion (talk) 21:04, 3 August 2011 (UTC)[reply]

My plan is to first get the images categorized by the bot so that we don't have a lot of uncategorized files left. I plan to send the hard disk to the Geograph guys again to get all the new files and the higher resolution versions uploaded since I did the first uploads. Don't expect this to happen anytime soon (maybe in a couple of months). Multichill (talk) 08:27, 5 August 2011 (UTC)[reply]
Yep, fair enough on that (I agree with your plan, and have no inclination to rushing). What is the last ID on the hard disk? I'd like to have indication of how far we've got and at what point we should upload from Geograph if we want specific imagery.--Nilfanion (talk) 10:17, 5 August 2011 (UTC)[reply]
I think it's 1806567. Multichill (talk) 12:00, 5 August 2011 (UTC)[reply]

Category errors - August 2011

[edit]

I've noticed a few errors (caused by the usual lack of disambiguation confusion), for instance Category:Hope. Should I add any I see to Commons:Batch uploading/Geograph/Disambiguation problems? I see no point in making value judgements on whether or not "it should be disambiguated", just listing them is important.--Nilfanion (talk) 10:13, 5 August 2011 (UTC)[reply]

Non-standard characters in Geograph files

[edit]

One major nuisance with Geograph I've noticed is the use of non-standard characters - and how GeographBot mangled them further on upload.

For instance, File:Rose Ash, St Peter’s church - geograph.org.uk - 272396.jpg has an awkard character in the file name and its description - instead of it getting converted to the equivalent ’ or the standard ASCII '. This will affect thousands, probably tens of thousands of files, as a result of this non-standard character. There are also complications caused by use of ! - look at Geograph files in Category:Westward Ho!.

There are probably several problematic characters like these, but I'm not sure how to trace (and fix them).--Nilfanion (talk) 21:32, 20 October 2011 (UTC)[reply]

Restarting Geograph uploads

[edit]

@Jura1, Mike Peel, Nemo bis, Donald Trung, Jheald, Christian Ferrer, Gone Postal, Crouch, Swale, MGA73, Nilfanion, and Bjh21: Not too long ago it was suggested to restart the Geograph upload. At the Wikidata conference 2019 I talked with several people about how this is a good opportunity to use structured data for a large scale upload. As you might have noticed, I've been adding structured data to the already uploaded Geograph files. Now I'm slowly shifting my focus to uploading new files. I want to go all in structured data. So only {{Geograph from structured data}} + a category on the page (yes, still doing categories, people are very attached to categories, just doing the location category). The template grabs all the data from structured data. The template is far from done (Jarek is going to help to improve it). You can see it in action. All the information is in structured data using the several available properties. In my upload a long time ago I messed up the coordinates of the object and the photographer. I fixed that now. Based on the coordinates I use http://edwardbetts.com/geocode/ by User:Edward to lookup the category and location of creation (P1071). For the depicts (P180) I use the subjects and tags added to Geograph which are mapped to Wikidata items at User:GeographBot/Tags. Could use some help there to expand the mapping.

My next steps are:

I want to start slowly with uploading focusing on the incremental addition of recent files. I'm thinking about having the bot run every night and upload the files to Commons that have been uploaded to Geograph two months ago. That way every day we get some geograph files. Based on these uploads things (code, reverse geocoding, etc.) can be improved. Once we get that to a good enough quality, I can have the bot work on the older Geograph files that haven't been uploaded already. What do you think? Who wants to help? Multichill (talk) 20:04, 7 March 2020 (UTC)[reply]

@Multichill: I like the idea of uploading more files :-)
I was away from Commons for some time and I think structured data is hard to work with. Its much easier the old way because you can see all the info on the page. Do we have a "Scructured data for dummies"? ;-)
When I look at File:Cottages at Garneddwen - geograph.org.uk - 750937.jpg I see some errors like "Lua error in Module:Information/sandbox at line 297: Tried to read nil global lang." and "Error: Invalid parameters! (coordinates are missing or not numeric)"
I'm sure that can be fixed and I and other users will get used to structured data. I hope to see some new files soon!!! --MGA73 (talk) 20:17, 7 March 2020 (UTC)[reply]
It sounds good, and I'm happy to help. As I've said before, {{Structured Data}} already does most of what you want, and {{Geograph from structured data}} just forks off from that. Can we please focus on using a single template for this? Thanks. Mike Peel (talk) 20:19, 7 March 2020 (UTC)[reply]
  • Great. Thanks for doing that. I will try to help.Jura1 (talk) 22:30, 7 March 2020 (UTC)[reply]
  •  Support but why start with the recent photos first, why not start with the oldest that don't exist here first (which is usually 2010) making sure to avoid uploading any few that have previously been deleted. When I categorize things here I go to Geograph and select "Breakdown list" then "Year Submitted" then I in order try images up until 2009 since I know that not many in 2010 and after have images here. As far as the categorization goes here I'd recommend using some of the tags (and similar like categorization there) for categories here and also all but 1 CP here has its own category here so the images should be placed in CP categories to. Crouch, Swale (talk) 06:21, 8 March 2020 (UTC)[reply]
  • I like the approach of getting recent images first. A problem we have now is that somehow the UK seems to have stayed in the 2000s (imagewise when looking at some categories mostly with geograph images). Jura1 (talk) 09:34, 8 March 2020 (UTC)[reply]
  • Maybe it's worth doing a property for photographers at geograph (external-id).
    For photographers with many images, an item at Wikidata could be useful as well. Not sure about a good threshold. Jura1 (talk) 09:34, 8 March 2020 (UTC)[reply]
 Comment (Edit conflict) I'm neutral on whether doing this is a good idea, but I do have a few comments on how it should be done:
This is going to be a big job. Good luck! --bjh21 (talk) 11:32, 8 March 2020 (UTC)[reply]

@Multichill: Structured data is a better way of doing things but I don't think it should be used exclusively. The relevant "Geograph images in <county>" cat should be added, as should relevant "<tag> in <county>" cats. With the data itself, the Description field should import the description from Geograph where it is available, not be a clone of the title. More importantly, we need to be able to edit it and I cannot see how to do that. How would I change the description of this (to correct the grammar or add a French description or...)?

The reversed geocoding is polygon-based and seems robust, so shouldn't introduce errors in the same way as the original upload. The area that should be identified needs to be thrashed out.

  • Wales: Community. All of Wales is covered by communities, which is the lowest tier of the admin hierarchy.
  • England: Civil Parish. Not all of England is in a Civil Parish. The first example on http://edwardbetts.com/geocode/ doesn't have a parish and correctly identifies the next available tier (the district). It would be good if it could identify the town (Shoreham-on-Sea). But Shoreham is an unparished area, which isn't an administrative area but a gap between areas.
  • Isle of Man: Parish
  • Northern Ireland: Townlands?. The counties are the best known subdivisions, but have no admin function (which are the districts). Within the counties, there are obsolete subdivisions of civil parishes and then townlands.
  • Ireland: Townlands?. The counties are split into municipal districts. Below that there are obsolete civil parishes and townlands. Townlands might be the best method for all of Ireland, but will need set-up work.
  • Scotland: ?. User:Edward's tool seems to pick up the Civil Parish. There are two issues with that: Firstly they are defunct with no administrative function. The second is they are not part of a strict hierarchy and never were, and without a hierarchy a proper fall-back is hard. However they are the only unit available below the massive council areas.

I am not sure how much of this structure is available, I expect we have all we need on OSM, Wikidata and Commons for England and Wales. The same is probably true for the Isle of Man. I'm pretty certain we don't have the structure in place for Ireland. OSI and OSNI have full, free, data on townlands and civil parishes so OSM can import that if they haven't already done so. We would also need the relevant wikidata entries and commons categories. As for Scotland... its a mess, and I think a consensus needs to be found on best approach.

@Edward: I don't get why this found a proper hierarchy then selected the best unit, while this only found one unit?

@Multichill: I'd recommend focussing imagery of England and Wales for initial uploads, as the geocoding structure is more or less ready and will give plenty of examples to resolve technical problems with the template etc. The other areas should wait until the structure is in place (which is likely to take a lot of work for Ireland).--Nilfanion (talk) 11:01, 8 March 2020 (UTC)[reply]

@Nilfanion: There's a problem with importing the Geograph descriptions. The descriptions on Geograph are licensed under CC BY-SA 2.0, just like the pictures. However, wikitext on Commons is required to be dual licensed under CC BY-SA 3.0 and GFDL, while structured data is required to be licensed under CC0. This means that any descriptions on Geograph can't be imported unless the fall below the threshold of originality. Technically the same applies to titles, but they're much more likely to be below TOO. --bjh21 (talk) 11:41, 8 March 2020 (UTC)[reply]
The same will apply to all the original uploads as well. That's also a major reason to keep the description on Commons and not in structured data. That's because if I supply a description myself it is likely to be above the TOO, and on Commons I explicitly agree to our terms. If I place it into the structured data, it's not be immediately apparent that I am agreeing to CC0 and I might not be willing to accept those terms, as if I instead provided on Commons I get attribution rights etc.--Nilfanion (talk) 12:08, 8 March 2020 (UTC)[reply]

Thanks for the quick feedback. I will be traveling this week. When I get back I'll do some improvements based on the responses. Multichill (talk) 20:20, 9 March 2020 (UTC)[reply]

Ok, that took a bit longer than expected. Weird times. To address some of the points brought forward.

  • What to upload: I'll start with small batches anyway to see how it works out. Let's see how it evolves
  • @Bjh21: mentioned the license part. Solved that by adding the needed parts (title and author) as qualifiers to the license and displaying that in the attribution, see for example File:Old barns and renovated building, Restormel Farm - geograph.org.uk - 6000038.jpg. I'm not a huge fan of {{Credit line}}. Do you like how it looks?
  • credit_realname is a thing. Any idea how often it is used right now? Leaning towards just filtering these out because it's slightly unclear if this third party actually agreed to the license.
  • technically I think references (sources) are available, but I'm pretty sure I'll make some people cry if I start using it because the front-end is completely not ready for it. I corrected a couple of mistakes in the coordinates (looks like I mixed up the object and photographer in the original uploads)
  • Precision is still an open point. Do you know in what field it's hiding? What was the conversion formula again to get it from meters to whatever Wikibase is using again?
  • Structured data is new and scary. I'm using the title from Geograph to fill the caption in English. You can update these captions or add captions in another language. These are visible if your user interface is set to that other language. Let's see if in the next version we can also show some alternative languages. I'm staying clear of Geograph descriptions because of mentioned license issues (and also because I need to pull them from a different api)
  • I don't get the categories like Category:Geograph images in Dorset. These are completely different from what we normally do. We have a source category (Category:Images from Geograph Britain and Ireland in this case) and location categories (Category:Dorset where images get pushed down to categories like Category:Upwey, Dorset). What is the purpose of these categories?
  • As for the "<tag> in <county>" cats. I know everyone still loves their precious categories, but no, I'm not going to add these. It might hurt a bit at first (to me too, have been using them for 10+ years), but the end result will be so much better. Take a leap of faith here. Supporting and maintaining the old approach will delay and degrade the new approach.
  • I already had some fun with the reverse Geocoding and also Edward helped a lot. This part is going to be a bit of a puzzle. To explore this a bit I cleand up Isle of Man (Q9676) based on this info. What needs to happen:
    • OpenStreetMap needs to have a shape for the area in the form of a relation ([8])
    • Wikidata needs to have an item for the area (Lezayre (Q3237456))
    • The relation on OpenStreetMap should have a Wikidata tag linking it to the Wikidata item. Something like http://overpass-turbo.eu/s/RDe can be used to query OpenStreetMap to find areas that don't have a link yet and also SPARQL at https://sophox.org/
    • The Item on Wikidata should have a OpenStreetMap relation ID (P402). Technically not needed if OSM has the link, but makes it easier for us to keep track of areas that are done and areas that need attention with queries like https://w.wiki/Ltd

Multichill (talk) 19:33, 2 April 2020 (UTC)[reply]

I took a look at User:GeographBot/Tags and most are done. Those that does not exist. Should they be created? --MGA73 (talk) 07:37, 3 April 2020 (UTC)[reply]
I'm not a native English speaker so some words I'm not sure about. I added a link to name on en.wiki and category on Commons. It is done by simple replacement so if tag is "Foo" the link to enwiki is "en:Foo" and link to Commons is "Category:Foo" / "Category:Foos". There is no checking if "Foos" is an actual word :-) --MGA73 (talk) 08:55, 3 April 2020 (UTC)[reply]
Nice! Some things like https://www.geograph.org.uk/tagged/subject:weather are just too abstract for us to be used in depicts (P180). You can just move them up and put a note like that behind it ([9]). Something went wrong with the sorting. That needs to be merged. I'll do that.
I only added the tags that have been used more than 500 times. I'll expand the list. Multichill (talk) 08:45, 4 April 2020 (UTC)[reply]
@Multichill: The sorting is easy to fix once we are done. Just copy to Excel or Word and press sort :-)
@everyone Is the link to en.wiki and Commons any help? --MGA73 (talk) 11:18, 4 April 2020 (UTC)[reply]
@Multichill: There are lots of tags that aren't in the curated subject: prefix. You can get the full list, with frequencies, in the tag_stat table on http://data.geograph.org.uk/dumps/. --bjh21 (talk) 13:45, 4 April 2020 (UTC)[reply]
@Multichill: Precision: This is hidden in the natgrlen and vgrlen fields in the link you gave (the latter is viewpoint_grlen in the database dumps). This represents the total number of digits in the grid reference. A 10-figure grid reference has a precision of 1 m, and removing two digits reduces the precision by a factor of 10, so that a 4-figure reference has a precision of 1 km. One wrinkle: if the use6fig flag is set, that means that whatever the precision, the grid references only have an accuracy of 100 m. use6fig was used for grid references derived from 1:50000 maps, where it was useful to position markers accurately on the map, but the map itself was likely to be quite badly inaccurate. You might find the code from Geograph Update Bot useful: the location_from_row and object_location_from_row functions convert a row of the Geograph database into a {{Location}} or {{Object location}} template. --bjh21 (talk) 11:18, 3 April 2020 (UTC)[reply]
Ok, let's see how this works out. I assume higher than 10 is not in the data? Using the precision guidelines and the info from the data model:
  • 10 -> 1 meter -> 0.00001°
  • 8 -> 10 meter -> 0.0001°
  • 6 -> 100 meter -> 0.001°
  • 4 -> 1000 meter -> 0.01°
  • use6fig -> 100 meter -> 0.001°
Like that? Multichill (talk) 17:01, 4 April 2020 (UTC)[reply]
@Multichill: I'd go for doubling those, since at 60°N (the northern tip of Shetland), 1 m is about 0.00002° in latitude. But yes, essentially that. The Geograph database schema doesn't allow to better than 1 m precision: eastings and northings are both integers. --bjh21 (talk) 18:05, 4 April 2020 (UTC)[reply]
@Multichill: Geograph images in county: These are explained at User:Faebot/Geograph#C and Commons:Bots/Work_requests/Archive 7#Project C: Adding UK counties/district categories. As far as I can tell, they were intended to make proper categorisation (into "subject in county" categories) easier without overwhelming the top-level county categories. --bjh21 (talk) 11:32, 3 April 2020 (UTC)[reply]
Ok. So for the intersections. Multichill (talk) 17:01, 4 April 2020 (UTC)[reply]
@Multichill: Attribution: I think your use of qualifiers on the licence is a sensible way to represent this in structured data. I'm not attached to {{Credit line}}: It's jus the only way I found to represent author name and title separately. Having them appear in the licence template is (to me) obviously better. I don;;t think you need to duplicate the licence link there, though. Personally, I'd put the title in italics (as I did in {{Credit line}}) rather than in quotation marks, but that's just a matter of taste. Leaving out images with credit_realname=1 seems like a good plan. Checking the latest database dump, about 0.5% of relevant images (ID > 1800000) have the flag set. --bjh21 (talk) 20:09, 3 April 2020 (UTC)[reply]
As far as I can tell I just have "realname" in the api and no way to distinguish between the person who needs to be credited and the person who uploaded it. Would have to do a second api call to figure out the name attached to the user id. I always use "realname" so I should be good on the attribution side. Multichill (talk) 17:01, 4 April 2020 (UTC)[reply]
@Multichill: credit_realname is in the gridimage_extra table of the database dumps. --bjh21 (talk) 18:05, 4 April 2020 (UTC)[reply]
Not using any dumps, using the api. Multichill (talk) 19:36, 4 April 2020 (UTC)[reply]

I did a couple more test uploads, see Special:ListFiles/BotMultichill (example 1, 2 & 3). Multichill (talk) 21:46, 4 April 2020 (UTC)[reply]

@Bjh21: what do you think? Multichill (talk) 19:53, 6 April 2020 (UTC)[reply]
@Multichill: In no particular order: I think the precision on the co-ordinates of those examples should ideally be set to 0.0002 rather than 0.0001 (to allow for the fact that the British Isles are at about 55°N). The lack of references is annoying but tolerable. It's unfortunate that the data model requires you to specify a languauge for the titles. I'll need to do a fair amount of work on Geograph Update Bot to handle the new files, but that's not a problem. The edit summary for the upload should omit the "download" parameter. The rendering of the file description is pretty good. --bjh21 (talk) 22:39, 6 April 2020 (UTC)[reply]
@Bjh21: I had some bad experiences using non-standard precision because it doesn't seem to work very well in the user interface so I rather keep it on the 001 format unless you feel really strongly about this.
Wait, what are you actually planning to do with Geograph Update Bot? My aim is first to right for everything except a lower percentage of reverse geocoding cases where another pass is needed (next topic). Multichill (talk) 20:07, 7 April 2020 (UTC)[reply]
@Multichill: I don't care that much about getting the precision right, though it would be nice if the user interface could actually handle everything that the data model says is allowed. As for Geograph Update Bot, its primary role is to propagate changes from Geograph to Commons. In particular, it currently propagates newly-uploaded high-resolution images and changes to geocoding. Propagating high-resolution images requires parsing the wikitext on Commons to ensure that the new image is available under the same licence (in particular with the same author name and title). Once most of the Geograph images on Commons have their metadata in structured data rather than wikitext, Geograph Update Bot will need to be able to read and update that as well. --bjh21 (talk) 22:41, 7 April 2020 (UTC)[reply]
Sounds good. I see your bot is Pywikibot based. You can just copy the relevant code parts from https://github.com/multichill/toollabs/tree/master/bot/commons . The json format is not hard to work with, see for example for File:Fishergate Tower - geograph.org.uk - 6000247.jpg https://commons.wikimedia.org/wiki/Special:EntityData/M88953446.json . Should be easier than parsing wikitext. Multichill (talk) 17:41, 8 April 2020 (UTC)[reply]
@Multichill: Thank you. That was very helpful. Geograph Update Bot can now update structured data, and I've applied for that to be added to its tasks. --bjh21 (talk) 21:31, 13 April 2020 (UTC)[reply]

Reverse geocoding

[edit]
@Multichill: The largest (and potentially most interesting) aspect is the reverse geocoding. The methodology looks like it will give good output, if it has good data feeding into it. It ought to be sorted before the mass uploads start to avoid any of the noise which happened first time around.
Can we set up a project page showing the various links for each location? This would include the OSM relation, Wikidata item, Commons category), its parent location (like the county). That allows for a group effort to get the underlying data and links right, and makes data validation possible. A systematic approach here would help future projects in the UK and Ireland, and give a methodology for other countries too.
The output structured data would be improved by adding more info to the location field. eg 1 would be better if it stated "Lakes, Cumbria" instead of just "Lakes". The county gives useful context and should be immediately visible on the file page.
If the precision is too low (some Geograph files like this have 1km precision), it should not attempt to identify the location precisely. Its better to go for a higher level region than guess at the exact location.
What happens if there is no covering region? For instance, what would happen with this? I don't have a clue what is right there--Nilfanion (talk) 18:43, 7 April 2020 (UTC)[reply]
One example of bad location: File:Talsarnau Chapel - geograph.org.uk - 6000168.jpg. Why Snowdonia and not Talsarnau (which Edward's tool correctly picked up [10])?--Nilfanion (talk) 18:59, 7 April 2020 (UTC)[reply]
@Nilfanion: yes, the reverse geocoding is going to be a big fun puzzle. We should do a first pass to at least get a level of completeness, but we'll have to accept that for some files a second pass will be needed to add a more specific location.
Will reply in more detail later. In the mean time you might want to have a look at https://w.wiki/MHk . Replace parish of the Isle of Man (Q12780138) with whatever is relevant in the area. Most items should have Commons category (P373) and OpenStreetMap relation ID (P402). This will help the reverse geocoder make the right connections. Multichill (talk) 20:07, 7 April 2020 (UTC)[reply]
OK that works: Can export that to wikiformat. Reason I want to do so, is easier to see problems outside the queries and if on wiki can be shared. eg Running the query for community (Q2630741) shows 15% are missing. civil parish (Q1115575) shows 7% too many. As for townland (Q2151232) .. it only returns 2% of the total. A lot of work needed.--Nilfanion (talk) 20:47, 7 April 2020 (UTC)[reply]
@Nilfanion: Bit longer reply. Lakes is the correct label for Lakes (Q2510182) so that shouldn't be changed. Would be nice to get the description ("civil parish in Cumbria, UK") somehow, maybe when hovering over it with a mouse?
I asked Edward about the precision suggestion.
Some data or logic has been changed between when the file was uploaded and now. That's why currently the tool returns a different location than two days ago. Still improving the code.
Have a look at User:Multichill/Parishes of the Isle of Man for an example on how to setup a list. Multichill (talk) 18:11, 8 April 2020 (UTC)[reply]
The output on the Commons page should give the most useful context immediately to the casual human reader. Hovering over is the wrong approach. The way I'd get that would be to identify and present both wikidata entries for the precise and more general location. So not "Lakes, Cumbria" but "Lakes", "Cumbria". The second entry will be from one tier of the admin hierarchy and would be the fallback level if the precise one fails - so it still would have "Cumbria" even if it couldn't find "Lakes".
If you re-run with the existing code, assume it gives the right location now?--Nilfanion (talk) 19:23, 8 April 2020 (UTC)[reply]
How to show the location in the information box is purely a design thing. A was thinking of some ways to show the location field some time ago, but didn't really follow up. Might be nice to show breadcrumbs like approach so something like "Lakes (Q2510182) (South Lakeland (Q1428514) -> Cumbria (Q23066) -> North West England (Q47967) -> England (Q21) -> United Kingdom (Q145)))" (without the Q id's included. It should go up the tree and block on when located in the administrative territorial entity (P131) is the same as country (P17) on the starting point (or first item up the tree that has it set). Maybe @Jarekt: can update the template?
Didn't actually run your assumption, but the geocoder is used one on one so that's going to be the output.
I've been testing some queries to find items to update with the focus currently on England. It looks like:
  • Every English parish exists on OpenStreetMap and has a relation
  • Every English parish has an item on Wikidata, but sometimes we have two items: One for the parish and one for the village with slightly unclear distinction (Cebuano imports
  • Every English parish has a category here on Commons usually combined with the main village (I created those back in the days based on the English Wikipedia)
So we need to get the lot connected. All the (current) parish items on Wikidata should have Commons category (P373) and OpenStreetMap relation ID (P402). https://w.wiki/MN2 is still quite a lot to do here. Multichill (talk) 20:16, 8 April 2020 (UTC)[reply]
I think that sounds like right approach. I've done full analysis of the Welsh communities. There are 877 Welsh communities, but only 776 Wikidata entries with community (Q2630741). The discrepancy results from 133 communities missing entries, 18 with duplicate entries and 12 non-communities with Q2630741.
I have the full information for Wales (and will put on-wiki tomorrow), and can then start work on correcting it. The same methodology will flag the precise problems for any other item want to look at.
One I'm confused by... Tenby (Q558878) has community (Q2630741) but isn't found by this query?--Nilfanion (talk) 21:09, 8 April 2020 (UTC)[reply]
That's because of the rank, see d:Help:Ranking. Multichill (talk) 18:32, 9 April 2020 (UTC)[reply]
All the entries are equal rank, so it appears to be because the community is the third entry in P31. eg Haverfordwest was found correctly, is that because it was 2nd of 3? In which case how can we correct the queries?--Nilfanion (talk) 18:56, 9 April 2020 (UTC)[reply]
[11] :-)
The query I shared below ( https://w.wiki/MUY ) includes everything. Multichill (talk) 19:01, 9 April 2020 (UTC)[reply]
I suspect the reason we have too many entries for English parishes is the defunct parishes incorrectly having Q1115575 (and will be able to confirm when I run that data).--Nilfanion (talk) 21:14, 8 April 2020 (UTC)[reply]
@Nilfanion: defunct parishes should have civil parish (Q1115575) too, should just have end time (P582) as a qualifier. Multichill (talk) 18:32, 9 April 2020 (UTC)[reply]
@Multichill: Will the addition of P582 stop them from being identified? And how do we apply that qualifier if we do not know the end date? Also, see User talk:Nilfanion/Areas/Communities for the breakdown of anomalies for Wales.--Nilfanion (talk) 18:38, 9 April 2020 (UTC)[reply]
The original query already filtered it out. Updated it a bit, see https://w.wiki/MUY . Multichill (talk) 18:54, 9 April 2020 (UTC)[reply]

Multichill, About "Lakes (South Lakeland -> Cumbria -> North West England -> England -> United Kingdom))" sequence. It would not be hard to create but it would be expensive, since I would have to load in this case 5 entities and some geographical entities can be big. Bio infoboxes and {{Wikidata/FamilyTree}} are doing something similar and some pages show errors since lua takes too much time. That is why I usually try hard not to show any data that requires loading multiple entities. However if number of entities is like in this example and nobody calls the template few hundred times on one page than it might be fine. There is already Module:Taxontree that loads full Taxontree following parent taxon (P171) property. This would be similar but would follow different property. --Jarekt (talk) 02:17, 9 April 2020 (UTC)[reply]

@Jarekt: the number of levels (entities) will be like this and generally less. Only some locations in Germany might have more levels. Would be great if you could build this! Multichill (talk) 18:38, 9 April 2020 (UTC)[reply]
Multichill I got a little prototype at Module:PropertyChain:
--Jarekt (talk) 21:51, 9 April 2020 (UTC)[reply]
@Jarekt: that looks nice! Let's continue on Module talk:PropertyChain. Multichill (talk) 08:30, 10 April 2020 (UTC)[reply]

The duplication caused by cebuano looks to be problematic. Both enwiki and Commons usually don't separate the meanings, for very practical reasons relating to splitting the content. cebwiki is often the only Wikipedia to make the distinction between main settlement and parish. I can't see how to readily overcome that - technically they are separate but its not productive to make a distinction?--Nilfanion (talk) 16:22, 10 April 2020 (UTC)[reply]

@Nilfanion: I started Commons:Geograph Britain and Ireland/Reverse geocoding to keep an overview of the reverse geocoding work. We can keep track what parts need attention and what parts are done. We can use the talk page to discus things related to reverse geocoding. Can you add the Wales part for Commons:Geograph Britain and Ireland/Reverse geocoding?
About the Cebuano items: That's a can of worms. If you go into that it will suck up a lot of time and energy. If it's even slightly similar to the Netherlands than no clear cut same solution for all of these exist. You have to investigate and clean up village (parish) by village (parish). I think our best course of action now is to add the missing Commons category (P373) and OpenStreetMap relation ID (P402) links to the relevant items to improve geocoding quality and just leave the rest in the existing state. Multichill (talk) 09:33, 11 April 2020 (UTC)[reply]
Will do. Regarding the Cebuano items, I think simplest solution is to use the single entry (ie the "village") for 99.9% of things, and leaving the second entry for as little as possible. Anything else requires a colossal amount of work for zero gain.--Nilfanion (talk) 14:02, 11 April 2020 (UTC)[reply]

Almost ready to upload

[edit]

I think most points have been addressed. I spend quite a bit of time on the reverse geocoding because that was one of the main possible improvements. For England, Wales (not completely done), Scotland and Isle of Man, the robot will categorize images on the parish/community level. For Ireland on the county level, but that's a relatively smaller amount of images. Did I miss anything? Next step is to wake up User:GeographBot and I would like not to have huge discussions on Commons:Bots/Requests. Multichill (talk) 21:28, 3 May 2020 (UTC)[reply]

I think this needs some discussion with the wider community, possibly at the Village Pump rather than BRFA. The original Geograph batch upload wasn't exactly uncontroversial, and this is 2½ times the size. I also think wiping or archiving Commons:Batch uploading/Geograph and replacing it with a clear description of what's happening now would be good. --bjh21 (talk) 23:14, 3 May 2020 (UTC)[reply]
I agree this needs community discussion, at the VP before going to BRFA. Unfortunately as things stand I will have to oppose this. We will have a significant proportion of files placed in the wrong location category (cross-boundary images). I note that the problem of miscategorised files is precisely what caused the controversy the first time around.
The template needs significant tweaking to avoid misleading file descriptions on the affected files, but that is less important.--Nilfanion (talk) 05:46, 4 May 2020 (UTC)[reply]
I wonder how many images are "cross-boundary images". Are we talking about couple images per Geograph dataset or is that an issue affecting significant number of images? If it is just a few images than we can manually adjust the categories and file descriptions. --Jarekt (talk) 16:16, 4 May 2020 (UTC)[reply]
The Devon dataset should give an indicator as to the proportion of cross-boundary files, and a decent guess would be 1%. The problem with Geograph is the sheer size of the dataset. A 1% error rate means tens of thousands of images needing correction, and finding those errors means sifting through millions. It is simply not practical to correct such things manually.
If the bot simply uses the subject location to feed the category and file description, then the correct information is supplied automatically and no manual correction is needed.
It still won't be perfect as there will be occasional problems on Geograph itself. That will be a manual task as it is impossible to automate that.--Nilfanion (talk) 16:53, 4 May 2020 (UTC)[reply]
I bet the bot will make a much better job than users manually uploading the files. Categories are changed all the time. Its a normal procedure. Someone does the best he or she can and then someone else change the category a bit and eventually it ends in a permanent category. Well as permanent as it can get on Commons.
I bet in many cases it is not a big problem if the category is a bit wrong. For example I needed a picture of a hospice one day and I did not care where it was. I just needed a hospice. And if I use a photo I take the information from the description not the category.
I think it is overkill to refuse someone to upload files to Commons just because the category is not 100% correct. If wrong categories is a big problem then lets just putt it all in "The United Kingdom" and the problem is solved! A pretty bad solution if you ask me.
You could even argue that we do not need categories on Commons. With all the relevant meta data users can define their own searches. And with cords it is possible to place pictures on a map and users can zoom in to the location they want to see.
We should of course do our best to get it right but lets not be extreme. --MGA73 (talk) 20:03, 4 May 2020 (UTC)[reply]
But we can so easily get it right by a simple tweak! So why on earth wouldn't we.--Nilfanion (talk) 20:43, 4 May 2020 (UTC)[reply]
The subject location is already being used. If the photographer and object coordinates return different locations, both get added. I added this after the Devon run on your request. Problem solved? Multichill (talk) 09:50, 5 May 2020 (UTC)[reply]
No - Please add the subject location, regardless of whether it is the same as the photographer or a different location. That solves it :) Also use the subject location, not the photographer location, for the category.--Nilfanion (talk) 09:52, 5 May 2020 (UTC)[reply]
I'm not sure it's in line with Commons:Depicts. If you manage to get consensus in favor of it (on Commons talk:Depicts), I'll do it. Multichill (talk) 10:21, 5 May 2020 (UTC)[reply]
@Multichill: How you do you mean? The depicted location is NOT the same as the camera location, so cannot be covered by it. They are different locations for different purposes.--Nilfanion (talk) 10:51, 5 May 2020 (UTC)[reply]

The description is also a problem at the moment: Its not obvious how to go about editing it. Please see Template talk:Geograph from structured data#Description.--Nilfanion (talk) 08:31, 5 May 2020 (UTC)[reply]

  • I think with structured data, the image get quite a lot of metadata that wouldn't be there otherwise. It's nice to also try to get location categories done, but if these aren't seen as helpful, I suppose one could skip that part. Topical categories are generally much more important than location based categories, at least when coordinates are present. Jura1 (talk) 16:48, 6 May 2020 (UTC)[reply]
@Jura1: As proposed, these files do not have any topical categories on upload. Multichill rejected the idea of adding <tag> in <county> categories. IMO thats would be best way to add the topic, as just adding the <tag> category (ie Category:Churches) floods those high-level categories with thousands of files.
As for the location categories, they are useful in their own right, and also facilitate better topic identification. eg If we know a photo shows a church, and we know which village its in, that makes it easy to ID the church. That's why I want to ensure the subject location (and not the camera location) is applied as a category.--Nilfanion (talk) 07:53, 7 May 2020 (UTC)[reply]

@Jarekt: , @Multichill: I have now analysed 8,269 images for 79 Devon parishes. These are the first alphabetically and seem a representative sample. 599 (7%) of these have a mismatch between the location of creation (P1071) tag and the categories. These are the "cross-boundary" files, as the categories are based on the subject. Scaling this up to ALL Geograph files this gives a crude estimate of 450,000 discrepancies. That is worse than I thought, and is far beyond should be looked to be manually corrected. This can mostly be avoided if the bot takes the category from the subject location. Using the subject location also reduces the odds of the geocoding failing for offshore images, as the subject is probably on land even if the camera isn't. It also is a strong indicator that adding the location to the P180 will add value.--Nilfanion (talk) 11:43, 7 May 2020 (UTC)[reply]

  • Nilfanion Okay if we for one second start all over. If we have a picture of a church and that church is located on X-street and the photographer is taking the photo from the distance from Y-road. I'm an old school dude, so I would make it like this:
Description The "Holy Church of Wiki" on X-street (taken from Y-road), Wiki City, England.
Date
Source Own work
Author MGA73
What info would you like to be stored and how (categories, Pxxx etc. Anything you think)? You can ignore camera type, date, source, author and license. --MGA73 (talk) 12:10, 7 May 2020 (UTC)[reply]
I'd go for a similar description with my own photo, only real difference is I'd add a county as well as the town/city. It also suggests we should be smarter with the bot-applied description. The bot is only extracting the Geograph title, but as it is finding the location it could also add this to the description. The vast majority of own-work uploads give a location in the description text, so why don't we do the same here?
If the church doesn't have its own category/WD entry then I'd have:
  • Any topical categories (eg its a Holy Church of Wiki, it was built in 2009 etc). This probably means <tag> in <location> intersect categories.
  • The most precise location category available (the street, or the city, or the county).
  • The most relevant P180 entries (its a church)
  • A Pxxx entry for the churches location
  • Camera coordinates when available
Ideally the church would have its own category and WD entry. In this case all of this can be stored on the church's category and WD entry. The category and WD entry would include all the other things, including a P276 for its location. All the file then has is the category and P180 for the church.
That's the information I would expect to have (and this would apply to a photo of any subject). There will be other supplementary information as well, such as: If the photo is black and white, the camera location, it was taken at night, it is an FP, etc. That info could be stored as Wikidata properties or categories. This information would be recorded on the file, whether it has a category or not, as it is independent of the subject.
In terms of Geograph, the bot can determine the location of the subject (its in Wiki-City) and the type of subject (its a church) from the information supplied on Geograph. It ccannot identify the specific subject shown, but by providing location and type a human can readily work out the precise subject (the Holy Church of Wiki in Wiki-City). I'm thinking the bot could supply the following initial information:
Description <Geograph title> (<subject location), <county>, <country>)
Date
Source Geograph
Author Geographer
With the following categories and Pxx properties: Geograph tags -> relevant P180 entries; Subject location -> location category and Pxxx statement; Geograph tags -> <tag> in <county> categories. All other properties, including the P1071, are nice to have, but less important to the process of identification.--Nilfanion (talk) 13:25, 7 May 2020 (UTC)[reply]
Nilfanion okay for a moment I thought you would add the location to P180 but that should be The Holy Wiki Church? At least that I how I read Commons:Depicts.
just like if we have 2 famous paintings hanging on the same wall they would have same location but they should not have the same Despicts.
I think if we were to create Commons today we would not put several topics in the same field. We would probably have one “location of object” and “location of photographer” and “location of creation” if it was a painting for example created in Italy but now located in England. --MGA73 (talk) 18:05, 7 May 2020 (UTC)[reply]
If the location of the subject is used for any location-based category, and is displayed on the description page in some form I am happy.
I think this whole saga highlights the need for an extra type of location within wikidata. We sorely need "location of depicted object": That is the location of an object depicted in an artwork, whether that work of art is a photo, a painting or a sculpture. "Location of creation" is problematic itself, as a artwork can be started in one spot and finished in another. For instance a painting that is sketched at the location but finished in the studio, or a photo that receives creative post-processing at the photographer's house.--Nilfanion (talk) 19:03, 7 May 2020 (UTC)[reply]
Yes it can be a mess. But if what you in fact are suggesting is that we create "location of depicted object" then perhaps the place for it is Village pump? I can see benefits of it but I think it will make some problems if we try to implement it here via the back door. --MGA73 (talk) 19:14, 7 May 2020 (UTC)[reply]
I agree on that, a general consensus is needed is way to get an additional property approved.
With regards to this batch upload, I suggest: The subject's location needs to be used to avoid the massive problem I highlight above. This can be done by using the subject's location as a category, and ideally adding it to the description as well. There is NO need to add it to any wikidata property. However if/when the new property is approved, it can then be added to the bot's task or done separately.--Nilfanion (talk) 20:08, 7 May 2020 (UTC)[reply]

Template:DeindentThe above has got a bit confusing for me. Is it planned to add these to topical categories or not? WT79 (talk) 10:17, 10 July 2020 (UTC)[reply]

Great Bath (Bath)

[edit]

The bot has uploaded dozens if not hundreds of photos of (the city of) Bath to Category:Great_Bath_(Bath).

That category says *The Great Bath is the main bath at the Roman baths in Bath, Somerset*. Photos of the train station or local supermarket do not belong there. — Preceding unsigned comment added by Stevekeiretsu (talk • contribs) 22:43, 24 March 2022 (UTC)[reply]

Stevekeiretsu (talk) 22:43, 24 March 2022 (UTC)[reply]

I've dealt with one part of the problem, which was that Aquae Sulis (Q622992) had a Commons category (P373) of "Great Bath (Bath)", which was obviously wrong. I've changed it to "Aquae Sulis" to match the sitelink. That won't help very much, though: it'll just cause the pictures to end up in Category:Aquae Sulis instead. --bjh21 (talk) 10:01, 25 March 2022 (UTC)[reply]
@Stevekeiretsu: I think this is fully fixed now. I've added a rough end time (P582) to Aquae Sulis (Q622992) and the reverse geocoder now emits Category:Bath, Somerset instead. --bjh21 (talk) 10:23, 25 March 2022 (UTC)[reply]
Nice one :) Stevekeiretsu (talk) 20:18, 25 March 2022 (UTC)[reply]

The Circus, Bath

[edit]

Also in Bath, I see a lot of images from the centre of Bath were put into c:Category:The Circus, Bath and tagged with depicts (P180) and location of creation (P1071) = The Circus (Q893229). I've cleared them out from the category (though not changed the structured data), but perhaps it would be good to make sure that this won't happen again. The Circus (Q893229) is a particular circle of terraced houses in Bath, not an area of the city. Jheald (talk) 12:11, 17 March 2023 (UTC)[reply]

Query for cases: https://w.wiki/6TBb Currently 409 410 examples with location of creation (P1071) = The Circus (Q893229) that are not / no longer in c:Category:The Circus, Bath.
Earliest was M101697558 (17 March 2021), most recent M128758243 (16 February 2023) M129635686 (16 March 2023) -- Jheald (talk) 12:40, 17 March 2023 (UTC)[reply]
The underlying problem here is that Bath isn't an administrative area in its own right; it's just part of the unparished area of Bath and North East Somerset, which OSM has at admin_level=6. And the reverse geocoder doesn't generally return categories at that level (the City of London being a special exception). So it has to fall back on less reliable approaches. The fallback is to search Wikidata for nearby instances of human settlement (Q486972). And The Circus (Q893229) is one of those, by way of housing development (Q5916199)residential area (Q674950)neighborhood (Q123705)human settlement (Q486972). The reverse geocoder prefers closer items, but with a distance limit dependent on the type of item, so anywhere closer to The Circus (Q893229) than to Bath (Q22889) (but within 1km of The Circus (Q893229)), it will prefer The Circus (Q893229). Unfortunately quite a lot of Bath is within 1km of The Circus (Q893229).
As far as I can see, this isn't a problem of having the wrong data in Wikidata (unlike the last one). The problem is that the reverse geocoder doesn't really know how big The Circus (Q893229) is and its guess is very wrong. I think there are two reasonable ways to solve this: make the reverse geocoder more clever (Edward?)), or give the reverse geocoder better options by making sure that other instances of human settlement (Q486972) in Bath are properly represented and located in Wikidata. --bjh21 (talk) 14:01, 17 March 2023 (UTC)[reply]
I've changed the SPARQL query used by the geocode to avoid items that are architectural ensemble (Q1497375). edward (talk) 09:31, 19 March 2023 (UTC)[reply]

Descriptions

[edit]

In the original block of geograph uploads, the whole of the explanatory text for each image was dropped into the Description field of the {{Information}} template. But now, with the structured data uploads, only the actual title is appearing there. The explanatory text can be very useful for searches, often containing key terms like street-names that can be incredibly helpful for refining the categorisation; as well as often being valuably informative and interesting text about the objects depicted in its own right.

Would it be possible to review the recent uploads and add the further text, perhaps again (as per before) as direct wikitext in the information template? Thanks, Jheald (talk) 12:20, 17 March 2023 (UTC)[reply]

I dropped it because it often contained whole stories and other irrelevant information and no, I'm not adding wikitext at all. Multichill (talk) 12:20, 18 March 2023 (UTC)[reply]
@Multichill: These 'whole stories' are often informative and useful. They frequently can also be the only place that describes what the image actually is of, if it has been given a title that is jokey or arcane. They may contain words that make all the difference between finding or not finding the image with search. And if not having them means I have to click through to Geograph for every image to see if it has extra information, then that is far from irrelevant to me as an editor or a user.
It was reasonable as a first stab to make {{Geograph from structured data}} only draw from structured data; but if fields can't be annotated or added to or overwritten with additional wikitext, that probably breaks policy, and may not be helpful in encouraging editors to love SDC.
So on both counts I think it would be useful to reconsider. Jheald (talk) 20:07, 18 March 2023 (UTC)[reply]

Non-alphabetic characters

[edit]

Any chance of a tweak so that non-alphanumeric characters (including common punctuation marks such as ?, #, & etc.) are imported unchanged, rather than all being converted to a caret "^"? Dave.Dunford (talk) 10:12, 22 August 2023 (UTC)[reply]


[edit]

Is this still valid? I can't see this option in geograph. Are there recommended tools that can do it (https://url2commons.toolforge.org/ ?) -Bogger (talk) 09:28, 4 October 2023 (UTC)[reply]

@Bogger: The link is still there for me. Go to for example https://www.geograph.org.uk/photo/7613483, scroll down to below the description and there's a line that says "© Copyright Richard Law and licensed for reuse under this Creative Commons Licence. In that line, "reuse" is the link in question. It's also repeated in the row of links below with the label "Licensing". Would referring people to that link be more obvious? --bjh21 (talk) 10:23, 4 October 2023 (UTC)[reply]

Geographbot still using incorrect and quite bizarre categories

[edit]

I'm still finding Geographbot is fairly routinely using incorrect and frankly, quite bizarre, geographical categories for its uploads. The volumes are such that I'm struggling to keep up with correcting these in just one small area, and looking at other areas, it seems that people have just given up. Let me talk you through an example.

Geographbot seems to have decided to categorise a large swathe of Reading as being 'in' Reading Abbey. Now in real life Reading Abbey is an architectural and historical site, it isn't a suburb or settlement. If is represented by a Wikidata item of Wikidata:Q2482838, which shows it to be an instance of Abbey and Church Ruin and nothing else. If you look on OpenStreetmap, you will see it (under the name Abbey Ruins (599374827)) has quite a constrained area of roughly 100m x 30m. On OpenStreetmap, Abbey Ruins (599374827) is also linked to Q242838 on Wikidata.

Yet GeographBot has just decided that a modern office block about 500m to the south-east of the site is part of Reading Abbey (Hanover House). It puts the photo in Category:Reading Abbey and uses Q242838 as the 'location of creation' in the structured data. This despite the fact that commons errors the later allocation with a contemporary constraint failure ("The entities M143535269 and Reading Abbey should be contemporary to be linked through location of creation, but the latest end value of Reading Abbey is 1538Gregorian and the earliest start value of M143535269 is 18 February 2019"). [Just for the record, I think that the Wikidata entries latest end value is arguably wrong, as the ruins still exist, but that is a whole different issue.]

In a previous comment on this subject, I was advised to paste the latitude and longitude into https://edwardbetts.com/geocode/. I've done this and can confirm that this tool also thinks that Hanover House is part of Reading Abbey. But that doesn't make Geographbot correct, it makes both of them equally wrong.

As I don't have any insight into how Geographbot makes these decisions, I'm at a loss as to what to do. Can somebody explain and come up with a plan as to what we need to do to (a) stop this happening and (b) automatically fix all the errors that have already been made. I think the scale of existing errors is too massive to fix manually; if the worst comes to the worst we may have to think of deleting all GeographBot uploaded images, or all that have not been subsequently edited by a human editor, and start again. -- Chris j wood (talk) 16:19, 5 January 2024 (UTC)[reply]

@Chris j wood: GeographBot consults precisely that reverse-geocoding tool, so it's not surprising that they agree. The geocoder's code is published at https://github.com/EdwardBetts/uk-geocode/tree/master/geocode. It has several different ways of finding a suitable Wikidata item, and usually when it finds an utterly inappropriate category the problem lies in its last fallback, where it looks for anything nearby that's an instance of (P31) human settlement (Q486972) without an end time (P582). That query is in geosearch.sparql and you can run it in WDQS if you replace {{lat}} and {{lon}} appropriately.
But if I do that with (51.453426851916, -0.95871423577414), which the geocoder says maps to Reading Abbey (Q2482838), the query quite correctly doesn't turn up Reading Abbey (Q2482838) at all. So I'm a bit at a loss as to why the geocoder gives the result it does. Maybe Edward can help (and maybe the geocoder could show its working).
The current GeographBot has already uploaded 4 million pictures and it has fewer than 2 million to go, so I think fixing previous uploads is more important than trying to improve things for future ones. There's an obvious broad structure of bot that could help: find files uploaded by GeographBot whose categories are untouched by human hand; make them better. I might try to code the first part of that this weekend, then as we find ways to implement "make them better", they could be added to the bot's abilities. --bjh21 (talk) 18:26, 5 January 2024 (UTC)[reply]
Here's the query: https://w.wiki/8kGi edward (talk) 20:38, 5 January 2024 (UTC)[reply]
@Chris j wood and Bjh21: Thanks for the bug report.
I've changed the code to ignore abbeys. This example now returns Category:East Reading.
See https://edwardbetts.com/geocode/detail?lat=51.453&lon=-0.958
The reason this was happening is because abbey (Q160742) is a subclass of administrative division (Q56061).
The code has moved, it is now self-hosted here: https://git.4angle.com/edward/geocode
I'll add a note to github with the location change.
This is the fix: https://git.4angle.com/edward/geocode/commit/001468d267bbb2a6f52daea063171eca8ae6964f
edward (talk) 20:51, 5 January 2024 (UTC)[reply]
@Edward: It would be even more useful to update the note at the bottom of https://edwardbetts.com/geocode/ to reflect the new location. --bjh21 (talk) 01:34, 6 January 2024 (UTC)[reply]
@Bjh21: : Good catch! I've fixed this. Thanks edward (talk) 11:43, 6 January 2024 (UTC)[reply]

@Edward and Bjh21: - Thanks for your swift responses and fixes -- Chris j wood (talk) 19:46, 6 January 2024 (UTC)[reply]

Uploading from Geograph sister sites

[edit]

I hope all is well. First off, great work with maintaining the bot and keeping its batch uploading going. I was wondering if there are any plans to upload the images on the Geograph sister sites, Geograph Channel Islands, Geograph Deutschland, and Geograph Ireland, after GeographBot finishes with the backlog from Geograph Britain and Ireland? Thanks again for the great work and have a great day! -- DiscoA340 (talk) 22:35, 10 January 2024 (UTC)[reply]