Commons:British Library/Mechanical Curator collection/georeferencing campaign/next steps

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

In late 2014 over 50,000 maps and plans were identified in the million images uploaded to Flickr by the British Library. Since that time, more than half of those images have now been georeferenced using the BL Georeferencer.

A key next step is to start getting images onto Wikimedia Commons; which should also give a much stronger basis to reach out to Wikipedia / Wiki Commons people, with usable images that can be used in projects, and may also spur further georeferencing.

Principal challenges which need to completed to enable this include the form of the information page that should accompany each map; the upload mechanics; identification of the subject of the maps; matching to (or creating) appropriate Commons content categories; deciding how and what to track through further "maintenance" categories.

A good initial set to work on may be the approximately 6000 maps currently placed as being of or within the UK.

Overview / State of play[edit]

As of early December 2017:

  • About 30,000 maps and plans have now been georeferenced.
    • This progress page gives counts of maps still to do, by book subject-area; cf also the corresponding page counting maps already done.
      (The count above also includes about 3000 images in addition to those shown on the progress page, that were georeferenced as an initial set).
  • When they have georeferenced, the maps coordinates allow a zoom-level (ie extent of the map) to be identified, and for the map to be approximately placed in a hierarchy by administrative area.
  • The information can be re-arranged into a tree, to break out the maps by location.
  • Further identification can be added by comparing bounding boxes, and (at a building level) by looking up on Wikidata.
  • All this allows maps to be grouped together, with indicative categorisation:

Still to do[edit]

  • Many maps lack names on the BL Georeferencer, which should be added before upload. The links to these are highlighted on the batch pages.
  • An upload script still needs to written.
  • Finalisation is needed on some aspects of the description template. See Template talk:Map for discussion, including some proposed new fields, with reference to Category:MC map migration test set.
  • Post-upload, systematic work will be needed to supply/correct/refine missing or incorrect categorisations and descriptions.
  • Roll out to the rest of the world.
    • Much more to georeference. Currently about 50% complete, compared to about 80% for the UK.
    • Input needed re appropriate hierarchies to refine analyses -- eg appropriate (historical?) geographical units for categories, appropriate towns and cities to split out.

Useful general advice[edit]

Zoom levels[edit]

To browse the kinds of things are depicted in the maps, it's useful to step through them by OSM zoom level. The following links are for UK maps. More specific maps, or maps for other parts of the world can be obtained by modifying the URL appropriately. (See also examples here.)

    • 5 : British Isles
    • 6 : Great Britain / England / Scotland
    • 7 : Scotland / Wales / N.Ireland / regions (eg: Yorkshire)
    • 8 : Larger counties (eg Cornwall, Devon); some sectional maps
    • 9 : Counties / parts of counties (cf 9 West Midlands)
    • 10 : Districts and wider areas; often railway route maps; Greater London / Inner London
    • 11 : ...
    • 12 : ... ; Central London
    • 13 : Cities: eg Leicester, Norwich, Bath ;
    • 14 : Towns and smaller cities: eg Margate, Chichester, St Albans
    • 15 : Further towns and city centres
    • 16 : Larger castles, railway stations, city districts; Tower of London
    • 17 : Abbeys, cathedrals, more castles
    • 18 : More cathedrals and castles
    • 19 : Larger churches
    • 20 : Churches
    • 21 : Towers, small buildings


Identification[edit]

Work-in-progress page: /uk analysis

This page shows the combination of a number of approaches, for maps identified as relating to the UK:

  1. The output from Nominatim is used to place the maps in a hierarchy by location and zoom level, with luck identifying country, region, county, and town/city where appropriate.
    The 5-point Nominatim voting approach can make mistakes -- eg under-identifying features that are thin and angled (eg Cornwall, Northamptonshire); or occasionally over-identifying based on the content nearest the middle of the map (eg British Isles maps identified as Scotland). It's not great at identifying the smallest-level features (where it may prefer nearby streets). Naming of levels can sometimes be inconsistent, if OSM labels may have modified over the (more than) two years the geoparser has been running. The system only identifies current regions/counties etc, not historic ones. Also, if a map seems to straddle the border between two features, then it has been left at the parent level in the hierarchy. But, overall, the system seems to work reasonably well, giving a good classification that can be used as a backbone for further effort; and mis- or under-classifications are typically still "close" in hierarchy to the optimum, so manual adjustment (post-upload) of what it provides may usually be limited to some slight refinement.
  2. Matching based on bounding boxes: eg "British Isles", "England and Wales", county maps of counties, islands.
    Useful to identify maps that represent the whole of some feature, rather than just an area contained within it. The script takes the IDs of example maps for a list of features, and identifies maps with similar bounding boxes. Occasionally produces false positives, eg river basins covering a similar area to a county, but mostly results seem pretty reliable. Worth checking for false negatives, by looking at a few maps with a similar Nominatim + zoom classification, and seeing whether they should have been matched -- if so, then they can be added to the list of examples.
  3. Matching based on Wikidata coordinates.
    The Wikidata matching is fairly simple-minded. The nearest Wikidata item with a Commons category within 200m is selected, or failing that the nearest item. This appears to work reasonably well at scales 18 and above (maps covering up to about 250m); but is much more variable at scale 17 (maps covering ~ 500m) -- eg "Phoenix Computer" for the Cambridge New Museums Site; or a church for the market place. For larger features (scale 16 and below), this simple approach is not so good, because it tends to find features that are too small-scale. Might be worth a special run for battlefields (though limited by the extent that these are identified as such in Wikidata, and have coordinates). Identification is at the mercy of what is in Wikidata (eg listed buildings) or isn't (lots of stuff), and there may be multiple hits. Even at the finest scales identification will require checking. But it should be good enough to for the provisional selection of thematic batches to be uploaded together, with mis-identifications with luck sticking out clearly after upload.

Identification by bounding-box or Wikidata may be strong enough to propose a positive description for the subject of the map. But in many cases, particularly at transitional scales (smaller than a county, but larger than a cathedral), identification will be weak, and will require manual improvement. Such subjects might include maps of battlefields, river-basins or river-courses, transport links, parishes etc. These will mostly require manual description; although that may be straightforward, once we have a title for the map.

Upload[edit]

It is proposed to process the initial set via a succession of batched uploads of maps grouped by a theme (eg UK county maps, British Isles maps, plans of cathedrals...), using the UK identification above and starting with the best-identifiable subjects.
This will allow batch-by-batch quality-control of the various maintenance issues below (in particular, the addition of map titles, which might be best done before upload); and will also produce a useful structure for browsing what has been uploaded.

The original dream was to eventually move to uploading in near-real time, following georeferencing. This might be possible, for very well identified maps; though the map would definitely need a title.

TO DO:
Need to identify which maps have been uploaded already

Upload mechanism[edit]

  • Glam Wiki Toolkit ? But GWT produces quite a lot of mess (eg HTML escapes for special characters rather than UTF-8 &c), that does need clear-up post-upload (or, at least, it used to)
  • Or some other alternative? eg bespoke script?
Best way forward might be to develop from one of the scripts at Commons:Command-line upload -- eg perhaps base on Nichalp's script, with elements of flickrrippr ?
  • Any automated cron-job uploading of new content would presumably need a bot mechanism. But (if automated identification/fixing/description/categorisation that would be required were to be possible at all) that would be considerably into the future.
  • Uploads should also be noted on Flickr -- a comment might be the most flexible mechanism.

Filenames[edit]

  • suggest eg of the form: File:Author (1842) p1.106 - title needed.jpg (may also need to distinguish multiple files from same page).
  • DEFAULTSORT to organise by publication date, then book, then volume, then page -- as the general ingestion script for Mech Curator images has been doing

Description pages[edit]

TO DO:
Migrate already-uploaded maps to the {{Map}} template, to confirm how it would look, and whether any fine-tuning is needed.

Content categories[edit]

Commons categories can be unpredictable and tricky, as their refinement has been organic and haphazard, subdivided (or not) to whatever has given a reasonably human friendly size of about 5 to 100 images. Particular hierarchies may also as yet be quite under-developed (eg the "Old maps hierarchy" for a particular area).

Probably need to impose a structure, or at the very least undertake extensive clean-up.

Category for location[edit]

For England a reasonable hierarchy is now in place, that the hierarchy produced from the identification process can reasonably well match

Observations:

  • Region categories. The region categories have been created in the above hierarchy are for present-day regions, to accommodate the information obtained from Nominatim. (Though they are intended to be construed broadly, in a somewhat fuzzy way, rather than strictly. It could be objected that these are anachronistic, given the dates of most of the maps.
  • County categories. In most cases in England, these can do double duty for both historical and current counties; though some special care may be needed for relationships around e.g. the West Midlands, Merseyside, Greater Manchester -- where appropriate, towns or images should be double-categorised, both for the modern county and the ancient one. For Cumbria and Sussex, images should perhaps be diffused to the smaller counties (Cumberland/Westmorland and East/West Sussex), though they may initially be auto-categorised for the modern one.
  • Categories for whole entities vs accumulation categories for maps within those entities. Categories at the region, county, city/town, and buildings-by-location level are expected to be accumulation categories, gathering all maps located within those areas for which no more specific category exists. In contrast, categories such as the "British Isles", and county maps, are intended for maps of the whole entity. (cf: Category:Old maps of whole Wales (alone)). Something similar should probably be created for Scotland, eg Category:Old maps of Scotland (entire)), and perhaps England & Wales (entire).
QUESTION: Does a 'whole entity' category also make sense for maps of towns and cities -- a separate category for maps of the whole town, vs for parts of it?
(Something which coordinates alone may not determine: what was the whole city in 1600 may now be but a small precinct). An alternative may be to break off most smaller maps into local sub-categories, eg old maps by present-day London Borough. But (at least for London) many maps may straddle Borough boundaries, yet still cover a smaller area than the whole city. Perhaps the best thing is just to wait, and see if any categories become over-full -- this is easy enough to see from the small number of parent categories.
QUESTION: Some of these 'accumulation' categories could also/alternatively be split by zoom level. Does this make sense?
Again, maybe best to leave maps that this applies to to a later batch, ie "remaining old maps of region X" or "... of county Y", and look and see what we have then. An initial rough categorisation can be refined later (though 'zoom level' may not be so accessible, post-upload).
There is also Category:Plans of castles in England, part of Category:Plans of castles in the United Kingdom, part of Category:Maps of archaeological sites in the United Kingdom
Also we have Category:Historic railway station layout maps of the United Kingdom. There's also a category Category:Garden plans, part of Category:Maps of gardens and parks, Category:Landscape plans, Category:Architectural drawings by object (including theatres, cemetaries, parks, urban squares). Also Category:Plans, which includes Category:site plans.
I think it probably makes sense to stick to plan A, of a single category for old maps and plans for buildings and structures; and then to classify further by adding specific categories post-upload. As usual, worth concentrating images of this type into their own batch, and then considering them together.

Category for the subject itself[edit]

eg for cathedral, castle, house, battlefield etc (if we have such a category ?)

  • Primary category, eg Category:Westminster Abbey If we have a map of it, and the subject has an entry on Wikidata, then we certainly ought to have a primary category for the entity -- one should be created, if it does not already exist.
  • 'Maps/Plans of ...' category, eg Category:Plans of Westminster Abbey. There may already be a category for maps or plans of the entity. [[:d:Property:P3722|P3722 (P3722)]] exists to record this, but it is not populated. c:Category:Floor plans of buildings in England also exists, as noted abpve, and others, but these too may not be fully populated.
These sources could be mined systematically. On the other hand, if we're uploading batch by batch thematically, it may make as much sense just to place all items in their primary category, then go through those cats quickly by hand, to see if a sub-cat exists and/or how full the primary category already is, and so whether or not it would be worth creating a subcategory. For certain classes of entity that we are likely to have many photographs of -- eg cathedrals -- it may be appropriate to make "plans of X" categories as a matter of course. For smaller churches perhaps not. Castles (or castle-sites) might need to be reviewed case-by-case.
  • 'Old maps/plans of ...', eg Category:Old plans of Westminster Abbey. In a few cases there may even be a further sub-category "old maps and plans of entity X". Creation of this should probably be driven by how full the category "Old maps and plans of buildings and structures in town Y / county Z" becomes, and whether or how it makes sense to extract sub-categories out of it.
  • 'History of ...', eg Category:History of Westminster Abbey. A category "History of X" may exist, even if map/plan categories do not. Unnecessary to categorise an old map or plan in this category, if it is in the main category for X, or in a maps/plans category. But it might make sense to categorise an "old maps/plans" category there.

Category for maps of a particular type of thing, within a particular location[edit]

Even if an entity does not have a map category of its own, there may well be categories for maps/plans of entities of that type over a particular area, eg England, United Kingdom etc. These might be found in subcategories of Category:Maps of the United Kingdom, Category:Maps of the United Kingdom by theme, Category:Maps of England, Category:Maps by theme, etc. In some cases there may already be subcategories in Category:Old maps of England.

It may be a good idea to add these categories immediately post-upload if possible, before moving on to the next batch. It's hard to reassemble such groupings from geographically dispersed map categories. But on the other hand, the categories for each batch upload should group similar-ish maps, even if they are not categorised immediately, so these should make it not too difficult to go back and add further thematic categories at a future point.

Particular examples:

ISSUE: Lumping and splitting. It may not always be clear whether it is better to split 'old maps' out of such categories, or keep all maps on the theme together in a single category.
For example, for Category:Maps of the Great Western Railway, does it make more sense to present all the maps together, or to split out the 'old maps' ? Sometimes it is easier to see everything in the one category, rather than having to jump between two to find the map of interest. On the other hand, sometimes old maps are qualitatively different -- older (or sometimes contemporaneous) scholarship, different presentation styles &c, so it can make sense to make the split.
Probably, it might be advisable to be slow to make such splits -- they can always be achieved later relatively quickly, eg with Cat-a-lot.

Source categories[edit]

Up until now, categories have not been created, unless there were several images to go with them.
But it may make sense to change this, particularly now that Wikidata has become so gung-ho about creating items for books, which could be linked to the categories. Creating a category, and then a Wikidata item, with e.g. an OCLC number increases the chance the images will get matched with other sources. It also is a good thing to be able to easily see what other images are available from the same source.
An objection might be that it would flood the books category above, making it harder to find the gold-standard uploads prepared by Metilsteiner. This could be averted, by creating a 'highlights' category Category:Showcase books from the British Library Mechanical Curator collection.
Book category links should be added to the Synoptic Index pages. In the same way that we distinguish books with complete coverage with a 100% marker, it may be worth distinguishing books with rather few images, eg < 10%. In both cases, the %ages should ignore images that would not be uploaded -- eg fillers, decorative capitals, etc.
Would be worth investigating whether there is any easy way to pick up eg OCLC numbers for the books, or VIAFs for the authors, that could aid matching.
Probably the book categories should be migrated, to use the {{Books}} template.
  • Categories for authors, cartographers, individual artists, etc.
In general this information is only easily available at a per-book level. If we are going to add categories for every book, Commons procedure would be to put the author-categorisation on the book category, not the images.

Refinement, tracking, and maintenance categories[edit]

The maps on Flickr may need quite a lot of refinement:

Pre-upload[edit]

  • title refinement -- probably 70% of maps do not have titles
    best to do this before upload ? -- then (i) BL gets data too; (2) avoids messiness of renaming here

Batch-by-batch[edit]

  • rotation correction -- with luck most maps should be correctly rotated, but some may not be
  • category refinement
    a.k.a. spotting identification fails -- should be quite quick to push geographic cats by 1 up/down hierarchy if needed.
    would be good to spot as many maps-by-theme cats as possible at this stage
  • date depicted
    in general this will have to be spotted & put in by hand
  • description refinement
    how much of a description is needed, if we have title + depicts. But depicts will need to be updated, for mis-grouped maps
    • including matching to a Wikidata item -- likely to be all or none for a particular batch.
  • language of map ?

QUESTION: Are maintenance cats useful, for images to be removed from? Or can refinement progress be tracked well enough from the batch's main cat, and/or page of subjects?

Based on the subjects in each batch[edit]

  • consideration for Wikidata property d:Property:P1621 ("detail map")
  • inclusion in Wikipedia articles
    perhaps track via batch overview page, listing subjects?

Longer term[edit]

  • cropping
  • colour rebalancing
? leave people to do these as they wish, in the normal way, without pushing
? no particular need for (per-batch?) maintenance cat -- should be visible enough from thumbnails in batch cat?

etc?

  • projection and orientation
could be possible to determine automatically from georeferencing data.
cf https://web.natur.cuni.cz/~bayertom/ and https://web.natur.cuni.cz/~bayertom/images/publications/proj_detection1.pdf

Other cats?[edit]

  • upload date ??
probably unnecessary, if we have the batch categories

TO DO[edit]

current[edit]

  • Use UK+IE counties as a pilot set
  • Try to get VIAFs for authors, or match to existing categories -- see /books and authors
    • Open Library lookup Done for pilot: a few VIAFs, some more with b-dates/d-dates
    • Look up manifest from Universal Viewer, then Explore Catalogue --> more b-dates/d-dates ?
      • Look up in VIAF, try to match
    • Identify existing categories that are for books, try to match to Wikidata.
  • Projections --> estimation of projection, scale, orientation
    • Do maps in template test set by hand
    • Compile and install C++ code
    • Try to understand rotation estimates
  • Rename Kreiger maps
    • Quarry query to pick up links to redirects
      -- not going well: even [1] is slow.
      Instead, add a temporary cat to files to be renamed, then look at usage with PetScan --> easier

more[edit]

  • Create staging pages for initial upload batches -- underway:see Category:MC upload prep pages
    • GB county maps; GB cathedrals; GB castles; GB churches; GB, England, Wales, Scotland; GB towns+cities; remaining buildings in GB towns/cities; in Eng counties...
    • Still to add in: Northern Ireland, London -- allocation to London boroughs could be improved -- Ireland; UK dependencies
    • Add sectioning -- toc might be nice
    • Investigate missing book authors -- investigate remaining books missing authors, also where author+year has more than one hit; ? also whether other abbrevs for books are already in use
    • Missing book links -- identify the most commons books for this; & investigate. Probably either could not map from Flickr vol number to vol string from BL viewer; or down to ambiguity: single sysnum + volume matching to multiple arks.
    • Add indicative categories.
  • See how well Wikidata can do for identification. -- 799 matches for UK; though some may not be great... getting better
    • Towns? Battlefields? -- use WD to look for towns / battles? & compare to titles -- or look for 'battle' etc in title.
  • Analysis tweaks
    • Rejoin (some) modern counties / unitary authorities into traditional areas
    • Build category structure for Scotland -- scoped out, anyway
  • Create analysis pages for Europe & other continents. -- see Category:MC_map_identification-in-progress
  • Create analysis page for paired maps
  • Migrate description pages for a test set of sample existing maps to use {{Map}} template see Category:MC map migration test set
    • Mass-migrate description pages for existing maps (see below)
    • Make BL Georeferencer link more prominent ? Additional georef option for external ? raised, plus more, at Template talk:Map Done. A few questions still open.
  • Recalculate zoom levels & see how many would need to be changed. -- 4357 would be changed: almost all reduced by one. Possibly overdoes maps with high latitudes -- should maybe scale back. Reflect new values to Flickr
  • Create, test & get approval for batch uploader. Include rotation. -- Permission requests opened

Existing maps[edit]

more[edit]

  • Add georeferencing link + bounding box to Goad maps. Narrow their descriptions. -- won't make nice maps of bounding boxes though
  • Consider also migrating OS drawings to {{Map}} template. Add bounding boxes.
  • Migrate book categories to use {{Book}} template ?
    • Check completeness of books with book cat
    • Create a showcase books category
    • Create book cats for books with many images. (Try to estimate how many missing?)
    • Update synoptic index pages

Georeferencer anomalies[edit]

  • Georeferencing progress anomalies -- see anomalies page
    • Maps tagged "no_georef" need to be taken into account
    • Retag images with Klokan 404s from "to_georef" to "klokan_404"
    • Compare "has_georef" list from tagging with list from Fusion Table