User:Fæ/Project list/UK legislation

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Shortcut: COM:UK legislation project

Coronavirus health regulations for travelling to Wales, 15 Oct 2020
Proxy voting in Scotland, 14 Aug 2020

UK legislation project[edit]

Introduction[edit]

This batch upload project was a request for the mass upload of freely available UK Government documents, published at legislation.gov.uk. The initial data feed used is https://www.legislation.gov.uk/2020/data.feed and then swapping out 2020 to earlier years.

Though in theory pure text documents are out of scope, these have specific and highly valuable use as formal 'snapshots' for development of various language Wikipedias about UK legislation, such as for BREXIT (e.g. Withdrawal Agreement) or COVID-19 (e.g. COVID-19 Very High Alert regulations for England) and are highly unlikely to ever be separately imported to Wikisource.

A list of files in this batch upload project can be shown using incategory:"Images uploaded by Fæ" insource:legislation.gov.uk filemime:pdf. As of 28 Oct 2020 there were 42,860 documents and as of 29 Jun 2021 there were 326,150, so estimates were originally way under!

The data feed includes legal records going back to the 13th Century (year 1267), however these are not scans of original documents but "legally meaningful" records for lawyers and legal historians, so for archaic documents these have been retyped and scanned or printed to file as 'modern' PDFs. Naturally apart from significant legislative events, as the records go back in time the number of documents generally decreases, so while for 2005 there are 5,877 documents available, for 1905 there are just 8 documents.

There are many files that are trivially small in filesize, such as this SSI speed restriction order at 20KB and 1 page long, though much rarer large documents are included, like this UKPGA 1907 sheriff courts act at 45MB and 446 pages long. These uploads are therefore not expected to be impeded by the large PDF upload issue.

Brief asides
  • Total project size would be 111,748 if every document were uploaded and if every item has a PDF. This may not be desirable if minor amendments were not useful, though it could probably be argued that it's better to have everything and make these housekeeping/curation choices.
  • Though there are summaries and titles in non-English languages, like Welsh, these are not separately marked and so cannot be placed in non-en text blocks.
  • Revoked legislation appears indexed by the year of original enactment. This may lead to confusion if trying to use the legislation.gov.uk feed to discover, or freshly upload, legislation recently withdrawn.

Example searches[edit]

As different types of legislation are sub-categorized by region or varying legislature over time, finding types of legislation without knowing the year or limited to a narrow type of legal instrument, can be time consuming. Using generic searches by topic to add extra categorization will rely on interested volunteers adding value with extra categories. Some example searches:

Related projects[edit]

Copyright[edit]

No separate or specific licensing is included in the data feeds. OGL version 3 is the current default for all legislative publications on the official website without specified exceptions, so {{OGL3}} is applied even though there are many documents that in fact pre-date OGL itself.

The legislation.gov.uk main web pages include the copyright statement: All content is available under the Open Government Licence v3.0 except where otherwise stated. This site additionally contains content derived from EUR-Lex, reused under the terms of the Commission Decision 2011/833/EU on the reuse of documents from the EU institutions. For more information see the EUR-Lex public statement on re-use.

Consequently documents falling under the EUR-Lex may need to be swapped from OGL to a more accurate template statement about their copyright. This has been left as a housekeeping decision, rather than delaying the upload as there is no doubt that reuse is allowed and encouraged as EUR-Lex is stated to be identical to CC-BY-4.0 with CC0 applying to metadata, which is potentially a "weaker" requirement than OGL3.

Configuration[edit]

Dogs Act of 1906, addressing police powers for stray dogs, use of dog collars and that dead cattle must be buried to avoid attracting dogs.

Filenames have their default format based on the title provided in the data feed:

<title> (<maintype> <document number>).pdf

Or for unacceptably long titles (over 240 bytes triggers the filename-toolong API error):

<maintype> <document number>.pdf

Where language variants exist:

<title> (<maintype> <document number> <language code>).pdf

Example File:Dogs Act 1906 (UKPGA Edw7-6-32).pdf, shows that the maintype is a United Kingdom Public General Act and the document references includes a regnal code for Edward 7th, indicating that it was first authorized by Parliament under the King's reign and the rest of the number being a sequential number. The Document number format varies by maintype and by year, so is not especially reliable for indexing or formatting more generally.

Note that where "/" is found in the title, these appear to be consistently used for squashing together language variants. Only the text before the slash is taken as a file title, however the full title is included on the image page.

Categories[edit]

The parent category is Category:Legislation of the United Kingdom.

Unfortunately the values for "subject" and "category" in the data feed appear inconsistent, with many documents having the missing altogether. Later housekeeping may be able to use these for fine categorization, such as filtering out highways legislation.

The main link (self) contains an implicit hierarchy under maintype like uksi, ssi, wsi which are used to map to Commons sub-categories.

Python array mapping feed links to maintypes and filename abbreviations
siArr = [
	['/ukmo/', 'UK Ministerial Orders', 'UKMO'],
	['/uksi/', 'United Kingdom Statutory Instruments', 'UKSI'],
	['/ukpga/','United Kingdom Public General Acts', 'UKPGA'],
	['/ukcm/', 'United Kingdom Church Measures', 'UKCM'],
	['/ukci/', 'United Kingdom Church Instruments', 'UKCI'],
	['/uksro/','UK Statutory Rules and Orders', 'UKSRO'],
	['/ukla/', 'UK Local Acts', 'UKLA'],
	['/ssi/',  'Scottish Statutory Instruments', 'SSI'],
	['/sdsi/', 'Scottish Statutory Instruments', 'SSI'],
	['/asp/',  'Scottish Acts', 'ASP'],
	['/aosp/', 'Acts of the Old Scottish Parliament', 'AOSP'],
	['/wsi/',  'Welsh Statutory Instruments', 'WSI'],
	['/mwa/',  'Measures of the National Assembly for Wales', 'MWA'],
	['/anaw/', 'Welsh National Assembly Acts', 'ANAW'],
	['/asc/',  'Welsh Parliament Acts', 'ASC'],
	['/eur/',  'European Union Regulations', 'EUR'],
	['/eudn/', 'European Union Decisions', 'EUD'],
	['/eudr/', 'European Union Directives', 'EUDR'],
	['/nisi/', 'Northern Ireland Orders in Council', 'NISI'],
	['/nisr/', 'Northern Ireland Statutory Rules', 'NISR'],
	['/nia/',  'Northern Ireland Acts', 'NIA'],
	['/nisro/','Northern Ireland Statutory Rules and Orders', 'NISRO'],
	['/aip/',  'Acts of the Old Irish Parliament', 'AIP'],
	['/apni/', 'Acts of the Northern Ireland Parliament', 'APNI'],
	['/apgb/', 'Acts of the Parliament of Great Britain', 'APGB'],
	['/aep/',  'Acts of the English Parliament', 'AEP'],
	]

The resulting mapping is to:

Also of interest Acts of Parliament of the United Kingdom relating to the European Communities and the European Union

These are being created as new categories with these (plural) terms when content is uploaded under them. At first glance this may be in conflict with existing categories, however the naming of current categories is comparatively muddied with historical terminology or less precise language than that used by legislation.gov.uk. Later merging and reforming may be needed based on consensus. An example of pre-existing categories is Category:The Statutes of the United Kingdom of Great Britain and Ireland.

{{DEFAULTSORT}} is applied to all files using the document id, which means that under any category they should display in the order of maintype, year or regnal period, document number.

Metadata[edit]

Some documents have a detailed XML metadata file available, however as this is inconsistent, only the data in the main data feed is parsed. Within this there are optional elements, these only are added to the image page when found:

  • Subjects and Categories, with Categories only displayed if not identical to the Subjects list
  • ISBN
  • Summary is used to populate description, when this is unavailable the title is repeated

Taking a dive on one example; the data feed is accessed by default at 20 entries a time, this is page 46 of the data feed. One entry is for the A83 Trunk Road, details:

<entry>
	<id>http://www.legislation.gov.uk/id/ssi/2020/217</id>
	<title>The A83 Trunk Road (Arrochar) (Temporary 30mph Speed Restriction) Order 2020</title>
	<link rel="self" href="http://www.legislation.gov.uk/id/ssi/2020/217"/>
	<link href="http://www.legislation.gov.uk/ssi/2020/217/made"/>
	<link rel="alternate" type="application/pdf" href="http://www.legislation.gov.uk/ssi/2020/217/pdfs/ssi_20200217_en.pdf" title="PDF"/>
	<link rel="http://purl.org/dc/terms/tableOfContents" type="application/xml" href="http://www.legislation.gov.uk/ssi/2020/217/contents/made" title="Table of Contents"/>
	<author><name/></author>
	<updated>2020-07-18T00:59:59+01:00</updated>
	<published>2020-07-17T03:00:43.454253+01:00</published>
	<ukm:DocumentMainType Value="ScottishStatutoryInstrument"/><ukm:Year Value="2020"/><ukm:Number Value="217"/>
	<summary/>
</entry>

Walking through, we can see the unique id link, from which we can generate the filename unique id of SSI 2020-217, and deduce that the sub category of Category:Scottish Statutory Instruments should be applied. Using the title with id gives the full filename which can be tested for pre-existence on Commons before going further. The link with title="PDF" gives us the document source, and if not found this entry would be skipped as we are only uploading PDFs. The published date is truncated to provide the document date, while in this case the ukm schema provides nothing not already deduced. For this entry the summary is blank, so the title will have to be used to populate the description. For this example, the uploaded file at SSI 220-217 shows EXIF data including an author of "Transport Scotland". Sadly there appears to be no way to scrape this from the data feed, nor even by scraping the full legislation page, which only includes the PDF itself as content. No examples have been noticed so far that populate the author in the data feed, so this is not currently tested for.

Corrections[edit]

A special case is "corrections", often titled "Correction Slip", which may have entries like:

<entry>
	<id>http://www.legislation.gov.uk/nisr/2020/57/pdfs/nisrcs_20200057_en_001.pdf</id>
	<title>Correction Slip 1</title>
	<link rel="self" href="http://www.legislation.gov.uk/nisr/2020/57/pdfs/nisrcs_20200057_en_001.pdf"/>
	<link href="http://www.legislation.gov.uk/nisr/2020/57/pdfs/nisrcs_20200057_en_001.pdf"/>
	<author><name/></author>
	<updated>2020-08-26T00:59:59+01:00</updated>
	<published>2020-06-05T00:00:00+01:00</published>
	<ukm:DocumentMainType Value="NorthernIrelandStatutoryRule"/><ukm:Year Value="2020"/><ukm:Number Value="57"/><ukm:ISBN Value="9780338012487"/><ukm:CreationDate Date="2020-03-31"/><ukm:Subject Value="MENTAL CAPACITY"/>
	<category term="MENTAL CAPACITY"/>
	<summary>These Regulations provide amendments to the Mental Capacity (Deprivation of Liberty) (No. 2) Regulations (Northern Ireland) 2019 (the principal Regulations) during an emergency period where significant proportion of the workforce is unavailable to carry out the statutory functions.</summary>
</entry>

These are not currently uploaded, as the source link does not have a title of "PDF" or "Print Version". They may be added as a special case, there being extra design work needed as id would need special parsing, and to be meaningful cross-linked with their 'parent' document where it exists, if thought sufficiently useful.

Welsh[edit]

Where documents exist in Welsh and English variants, "en" and "we" are added to the identity (not our choice, this is how the pdfs are named at source) and files uploaded as separate documents.

<link rel="alternate" type="application/pdf" href="http://www.legislation.gov.uk/wsi/2020/519/pdfs/wsi_20200519_we.pdf" title="PDF"/>
<link rel="alternate" type="application/pdf" href="http://www.legislation.gov.uk/wsi/2020/519/pdfs/wsi_20200519_en.pdf" title="PDF"/>

The same Commons page text is used. It is unknown if the data feed is fully consistent in this way of presenting the variations in PDFs, or if languages other than Welsh are handled this way.

Example WSI 2020-519 en and WSI 2020-519 we.

Some early uploads missed this wrinkle and only uploaded the first language version (Welsh), the second version being added in a later run. Consequently naming may be inconsistent, but could be harmonized if someone wanted to do the renames; example WSI 2020-1096 and WSI 2020-1096 en.

For unknown reasons many older Northern Ireland Statutory Rules and Orders have source PDF names that use the _en suffix, though as far as can be seen, only English versions were published.

Queen's Printer Versions[edit]

There are different scenarios where a "Queen's Printer" PDF version may exist and be different to the PDFs that are defined as the "revised" or "as enacted" versions. Some Queen's Printer versions may be old scans which have been updated with OCR data and may have been word processed to create the "as enacted" or "revised" version and searchable web pages, while others are interpreted to be the multi-language version in English and Welsh, though in effect there are 3 Queen's Printer versions in this scenario. These are being retrospectively added to relevant categories where they exist and may be of additional educational use. The naming adds the suffix "qp" to the identity.

The PDF links for alternative Queen's Printer Versions are not in the <entry> within the data.feed. To find these, the script checks current Commons categories for already uploaded documents. Using the source link given, it then navigates to the readable web page at legislation.gov.uk and checks the web html source for likely looking links to pdf documents. Appropriate links are described as 'Queen's Printer' but there is some variation in how this is done for different legal instruments. Where new documents exist, the new Commons image pages use a gallery tag to link back to the existing file and add a direct source link to the new PDF which may help with later searches. By using a gallery, the original will automatically show a link to the new upload in the 'File usage on Commons' section.

Comparison of Queen's Printer original with "revised" or "as enacted" versions

Design and technical matters[edit]

The site legislation.gov.uk is not whitelisted at the time of project start, a request is pending at Phab:T265690. The PDFs appear to be small, so does not represent a threat for local processing.

Where there are long titles these may result in proposed filenames of over 255 bytes, the Commons maximum length. These will revert to a super short name of the bare document code as artificially trimming the title may result in unpredictably misleading sentences. Example Commission Regulation (EU) 2020-1181 of 7 August 2020 correcting certain language versions of Directive 2007-46-EC of the European Parliament and of the Council establishing a framework for the approval of motor vehicles and their trailers, and of systems, components and separate technical units intended for such vehicles (Framework Directive), correcting certain language versions of Commission Regulation (EU) No 582-2011 implementing and amending Regulation (EC) No 595-2009 of the European Parliament and of the Council with respect to emissions from heavy duty vehicles (Euro VI) and amending Annexes I and III to Directive 2007-46-EC of the European Parliament and of the Council, and correcting the Danish language version of Commission Regulation (EU) 2017-2400 implementing Regulation (EC) No 595-2009 of the European Parliament and of the Council as regards the determination of the CO2 emissions and fuel consumption of heavy-duty vehicles and amending Directive 2007-46-EC of the European Parliament and of the Council and Commission Regulation (EU) No 582-2011 (Text with EEA relevance) (EUR 2020-1181).pdf becomes EUR 2020-1181.pdf

The upload script is written in Pywikibot, running in Python. The upload process is highly robust, so internet outages or other upload interruptions are handled gracefully. Commons API warnings are not ignored, which means that there should be no duplicates uploaded and warnings such as the same filename in use but with a different file type should cause the upload to be skipped. The design is re-run friendly, so possibly transient errors like mime type rejections may be countered by a few re-runs.

The script is manually launched and can be restricted to a particular year or given a year range to run through. Full automation of refreshes for the current year may be reasonable once the data feed is white-listed so direct url upload is possible. At that point it could be run on the WMF cloud as a Faebot cron job for a weekly or even daily refresh.

Outages during the run are likely. This is running on a "classic" X220 ThinkPad made in 2011 and purchased second-hand for £100 in 2018. This tends to overheat when more than around 4 background processes are left running, or indeed if anything accidentally gets in the way of the outlet fan. At the time of writing, this longish term upload project is running in parallel on the same machine with the batch uploads of Städel museum, similarly reliant on local disk access. The long overall runtime may be affected by any internet outages or drop-outs, which does happen in unplanned ways with the home broadband supplier, or home electricity tripping out. Ref m:Hardware_donation_program/Fæ where a second-hand laptop request was raised back in September 2019, and would be of assistance with these projects.

Known errors, bugs[edit]

  • chunk-too-small These uploads give an unexpected offset and the API returns a chunk-too-small error. It's unclear why these are rejected, a possible cause being the unexpected format within the PDF, or an unusual bug with the WMF chunked upload process when handling small PDFs. Unfortunately this does not appear fixable even after a repeated download. Example Edw7-7-17 pdf link. A work-around is being used to trim off the last byte from the binary file, as these appear consistently to be (redundant?) line feed characters. See Category:Work around of byte trimmed for chunk-too-small API error.
With the completion of the upload-by-url whitelisting of phab:T266117, this issue may have been resolved by always uploading direct to the server.
  • verification-error Some files return bad format errors, apparently recognizing the PDF as a text/html document, it is unclear why. These non-trivial errors may be transient, local glitches with downloading, or anti-bot measures by gov.uk; for example NISR 2020-90 gave WARNING: API error verification-error: Files of the MIME type "text/html" are not allowed to be uploaded but on a second attempt a couple of minutes later was uploaded without incident. The work-around has been to automate up to 2 more fresh downloads when the API returns this error, but to give up after the third attempt which appears to resolve most of the incidents.
  • multiple PDFs Apparently very rarely, an entry may have multiple same-language documents referenced. An example is EUR 1965-19 which has 7 PDFs listed in the entry; a suspect list is found with incategory:European_Union_Regulations intitle:"_en.pdf". In these cases the first document only will be uploaded. This is potentially fix-able but seems so rare at the moment that individual documents will be left to be done manually, with automation only if working around this known error is worth the extra programming time, in particular this would mean adding to the naming schema to allow for document sequences.
  • internal_api_error_UploadChunkFileException Another rare error, these are ignorable, possible upload server glitches, as the uploads appear to succeed on automated reattempt.
  • bad quality scans Some of the older documents are just bad quality. For example UKSI 1965-529 is a not-great old photocopy which has been apparently retrospectively scanned in on a flatbed, UKSI 1954-832 is both a poor scan and has a later digital format issue making it render in bad proportions, and Race Relations Act 1965 appears to have been scanned with a faulty light. Most look better when viewed in a PDF viewer, and are not automatically fixable as the legislation site curation service is out of our reach. However it may be possible for documents of high educational interest to find better copies and ensure the image page cross-links to the more usable versions as they become available.
  • titleblacklist-custom-hidden-char is a rare error, possibly due to corruptions in naming at source. Not currently fixed. Example "The 91â„2% Ulster Development Bonds Order (Northern Ireland) 1974 (NISR 1974-56).pdf"

Slurping[edit]

A (local) chron job is running from 20 Oct 2020 which 'slurps' the data feed for the current year. This automatically stops when a page (20 documents) of the data feed results in no new upload attempts, which should work as the feed is sorted by descending publication date. The presumption is that UK Gov legal feeds should not be redesigned without breaking a lot of external dependencies, so a slurper task would probably run without needing maintenance for a few years.

Slurped uploads include an upload comment pointing to this section. Slurping is scheduled for once a day, and as the numbers will probably be tiny (<10 a day?) is not under a flagged bot account but under where they might be noticed if something is going wrong.

Source code available at github, but this was written to be quick, not reusable by others.

In 2020, the most recent slurped uploads can be 'cheaply' found using this search, which sorts by descending creation date. For example, this shows that 17 newly published documents were slurped in the early hours of 29 Oct 2020 and these were of multiple different types (WSI, UKSI, NISR, SSI, EUDN, EUR).