User:Fæ/Project list/Irish legislation

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Shortcut: COM:Irish legislation project

Introduction[edit]

This project is a mass batch upload legislation available as PDFs from http://www.irishstatutebook.ie. The project is currently (1st week November) at a design stage.

The project source domain has been white-listed for url uploads, Phab:T267193.

Related projects[edit]

Configuration[edit]

A credit template of {{The electronic Irish Statute Book}} is added to all uploads in this project to assist tracking and to ensure the latest understanding about the copyright release and official status of the publications is easy to update across the collection.

Naming scheme[edit]

.

Categories[edit]

.

Design notes[edit]

The pre-1922 acts are not part of this upload. These are indexed at http://www.irishstatutebook.ie/eli/pre-acts.html where html versions have been created. Though these can be printed from the web page version, there are no word-processed PDF versions available and pure prints of webpages are not suitable for Wikimedia Commons.

Unfortunately the same is true for acts through to the 1990s. For example for 1993, only one of forty documents has a PDF version. Only those documents with PDFs are relevant for this project.

The irishstatutebook site uses a hierarchy of the format /eli/2020/act/13/, which provides a quick model for creating an identity number on Commons. Embedded within the web page metadata are DC values (Dublin Core). These consistently use "esib" which is written out as "electronic Irish Statute Book (eISB)", but this is not a way of distinguishing Acts of the Oireachtas from Statutory Instruments. The exemplar for showing associated XML is http://www.irishstatutebook.ie/eli/2017/act/8/enacted/en/xml. Unfortunately the XML appears a bit of an afterthought, and the schema looks like a way of recycling the html pages. For example "title" in the exemplar XML is "COURTS ACT 2017" while the metadata in the html shows a title of "Courts Act 2017" in titlecase.

Care needs to be taken in parsing the meaning of these fields, for example the html has embedded a meta property eli:publisher of "Houses of the Oireachtas" and the same document has a meta name DC.Publisher of "Office of the Attorney General"; presumably one is the document and the other is the website itself.

Schema[edit]

The official schema is described at http://www.irishstatutebook.ie/pdf/ELI_URI_schema.pdf, which includes some real examples. Key elements to identify are:

  • year
  • type; like act, ca, prv, si
  • natural identifier; which may be the sequential number within the year
  • version; like enacted, made
  • language; en/eng or ga/gle, ref ISO 639-1

These appear deducible from the given URL, before needing to parse the html source. Both the document identifier on Commons, and a mapping to categories can be based on these as an index.

Data fallbacks[edit]

The website publications have changed over time. Consequently the availability of XML pages, html metadata and even pdfs falls off the earlier the year. The workflow therefore needs fallbacks to ensure as many documents as possible can be uploaded, while aiming to harvest reasonable information.

The main loop is to go through a year range. Then for a given year the different types of document are cycled through (si, act) to generate the main catalogue page:

From this each document's web page can be discovered:

The fallback for data can then be:

  1. If the html page has good meta data fields embedded, these can generate titles, description, publication date.
  2. If the meta data is missing, the XML page can be used. This does not have a description text, but SEC1 appears to consistently have the introductory document text, which can be used as a substitute.
  3. Should the XML not exist, we probably have to give up trying to automate the upload of that document.

Copyright[edit]

All documents explicitly refer to http://www.irishstatutebook.ie/eli/open-data.html, both for the website content and the PDFs. This confirms that "Creative Commons Attribution 4.0 International License" does applies to the content.