Commons:GLAMwiki Toolset Project/Technical Architecture/Maintain versioned Metadata Sets

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Maintain versioned Metadata Sets

[edit]

Why

[edit]
  1. Complete transparency of the provenance of an upload
    • By uploading the actual metadata set that was used to create the media files, it’s completely transparent to any Wikimedia Commonist what the basis of the upload is.
  2. Re-use of metadata sets without re-upload
    • If an improved mapping is created, an update of all the uploaded media file metadata can be accomplished without actual re-uploading the metadata set.
  3. Transparency of metadata set updates
    • If the original metadata set is uploaded in complete format and is then updated (with new or deleted media items or updated metadata of existing items) the “built in” versioning of MediaWiki allows for users/re-users of the metadata set to transparently choose a version to use, or to revert to an older version if a new version for some reason would be corrupt.
  4. Richer/more expressive metadata
    • The original metadata is very often richer/more expressive than what can be expressed in the default Wikimedia Commons templates - even the quite rich Artwork template. So while a mapping may then be lossy (not only due to incompatible data models, could simply be human error) the original metadata is available and a richer mapping can be performed without a need for a full re-upload of all metadata.
  5. Job queue throttling
    • Having the metadata set within the wiki makes it easier to throttle the creation of job queue jobs.
  6. New Commons templates
    • Let's say a GLAM uploads a number of historical maps using Artwork as the target template. Later, a specific Map template is introduced by the Commons community. A GLAM-tool user could then simply re-map that map collection to the Map template, again without the need for an actual re-upload of metadata.
  7. Connecting with Wikidata:Data collaborators
    • We thought that having the actual metadata would be something that the Wiki movement would be interested in, see Wikidata bulkuploads.


Requirements

[edit]
  • File sizes
    • can range from a few kilobytes to several megabytes.
    • max file-size is 100mb.
  • XML Schema - not required.
  • XML DTD - not required.
  • When metadata is in XML format, each record must consist of a single parent with many child
    • XML attribute lang= is the only one currently used and without user interaction
  • There is no need to display the Metadata sets in the wiki.
  • There is no need to edit the Metadata sets in the wiki.


Potential Solutions

[edit]
  1. Upload the Metadata Set to the File: namespace
    • upload metadata files to the wiki and store them as files using the user name as a unique identifier; thus Rijksmuseum.xml becomes File:Rijksmuseum-Dan-nl.xml.
    • add a category to the uploaded metadata files GWToolset Metadata Set for reference so that Metadata Sets can be found on the page Category:GWToolset_Metadata_Sets.
  2. Use ContentHandler
    • upload metadata files to the wiki and store them with ContentHandler as TextContent under the namespace GWToolset using the user name to create a unique path; thus Rijksmuseum.xml becomes Metadata_Sets/Dan-nl/Rijksmuseum.xml
    • Metadata Sets can then be found on the page Special:PrefixIndex/GWToolset:Metadata_Sets or Special:PrefixIndex/GWToolset:Metadata_Sets/Dan-nl
  3. “What I would suggest is a hybrid approach. The metadata file gets uploaded, and is stored using FileBackend class. (There's a couple extensions that store "files" without them being a file page. For example the Score extension stores the rendered files on the server, but its not attached to any file page). Once the xml file is on the server, use ContentHandler to make a new content type that stores a reference to the file [instead of the original file] (probably in the form of a mediawiki virtual file url).”
    -- http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/71536/focus=71839
    -- Bawolff (talk) 19:19, 31 Jul 2013 (UTC)
  4. Breakup the metadata records and store each one as a ContentHandler object in a new namespace, e.g. Original Metadata, pointing to a reference ContentHandler object in Original Metadata Sets. The concept is similar to how the new namespace Campaign is implemented.
    • the initial upload would store the XML file in the upload stash
    • a batch job would process the stash file, converting it into corresponding objects in the ContentHandler namespaces
    -- discussion with Multichill (talk) 07:03, 27 Sep 2013 (UTC)