Commons:Structured data/Overview

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
Translate this page; This page contains changes which are not marked for translation.

Since its inception, Wikimedia Commons has provided one of the most successful platforms within the Wikimedia community for sharing knowledge. It has, however, long been known that a number of underlying technical hurdles have prevented it from becoming an ideal platform for other knowledge communities, potential partners and our Wikimedia communities. Responding to these technical challenges would include creating features that would allow both Wikimedia communities and external partners to take full advantage of that platform, including multilingual description and categories, holistic APIs, and better metadata management. All of which by extension substantially improve broader public access to Commons, through features like multilingual search.

Starting in 2014, the Foundation began initial exploration into the concept of structured data on Commons to address these requests. This initial exploration identified most of the core architectural features needed for improvement to Commons and found that these features fit well within the the roadmap for the Wikidata software, Wikibase, supported by WMDE. At that point, the 2014 project was slowed, and delayed until Wikibase offered more robust infrastructure. During 2016, we have reached that point, allowing for an initial demonstration of how Structured Data on Commons could work.

In October 2016, WMF and Wikimedia Deutschland announced a funding agreement that would provide multi year support for Wikidata, including funding for backend support for integrating Wikidata into Wikimedia Commons. WMF’s long-term roadmap includes plan to build tools to make this infrastructure usable by the Commons and GLAM communities. However, the sum of this work would not be resourced for several years.

Recently, WMF was approached by a grantmaker to apply for funding that could accelerate this work into a quicker three-year time frame. The potential funding may cover the cost of some work that is already in WMF’s annual plan and budget. WMF’s fundraising team generally does not seek funding for projects outside of WMF’s annual plan. However, this is a unique opportunity where a funder approached us with the potential interest of accelerating the work on this existing project. To be clear and to set expectations appropriately, we have only been asked to apply. We may receive funding or we may not.

NOTE: This is NOT a new project proposal. Currently, this page discusses a October 2016 update from the WMF on possibly expediting the project.

What are we publishing here?

This page represents a synopsis of the potential accelerated roadmap/plan we expect sharing with potential funders for this work. We haven’t included the whole of the document, because it is still under discussion and review internally, and depending on who we receive/seek resources from the document will change. Additionally, much of the material contained within the document is for explaining the context of structured data on Commons to unfamiliar audiences (including, but not limited to Common’s scope, why linked structured metadata is important for multimedia, especially cultural heritage multimedia, and comparisons to other platforms). We have included here:

  • Our high-level roadmap for developing the project, if we get resources to expedite the work
  • Identified benefits and risks of the project.
  • The expected opportunity for changes in Commons.

If/when we gain the necessary external support for the accelerated project, we will share a fuller version (baring sensitive information) with the community.

Why do we need your comment? How can you help?

The Commons volunteer community has long asked for features that require a database-like structure that would let them describe the media more fully. For example, multilingual categories have been a long term conversation amongst our community that would allow more non-English speaking volunteers to effectively tag and find images. (See an early-2004 discussion, a 2008 blog post, the 2009 GLAM-WIKI Recommendations another in 2010, a question on Quora, the 2015 Community Wishlist and more recent discussions.)

It is our continued hope that Structured data on Commons provides a foundation for solving those problems in general.

We want to provide an open window of time for members of the Wikimedia communities to provide feedback on this particular project and are seeking feedback on the following:

  • Do you see this expedited roadmap as a worthy undertaking?
  • What roadblocks, risks or challenges do you anticipate with accelerating such a project?
  • Does the current information on the page here accurately represent the role of the communities, especially the Wikimedia Commons and Wikidata communities, in engaging such a software project?
  • How would you like to support this project?
  • If you would like to stay up to date with this proposal and/or Structured data on Commons more generally, we recommend signing up for this Newsletter

We plan to leave initial commenting on this proposal through November 9.

High level description

We have put together a set of materials which lays out the context of why and how structured data works within the Wikimedia projects and the effects created by a lack of effective structured data infrastructure on Wikimedia Commons, that reflects the known issues described at Commons:Structured data. Here we highlight some key elements of those materials which focus on the actual framework for working with the Wikimedia Community.

Background

All freely licensed photos, audio, and video files on Wikipedia are stored on the Wikimedia Foundation site called Wikimedia Commons. Started in 2004, Commons now has 34 million media files – photos, audio, and video – and it continues to rapidly grow every year: Contributors added about five million new files last year. Its media files are easily discoverable through platforms like Google search.

Thousands of volunteers integrate these media files from Commons into our Wikimedia projects, like Wikipedia, to illustrate our content and share that media with the public. These files are typically 1) personal photography and media uploaded by individuals; 2) freely licensed media files from locations on the Internet like Flickr, YouTube, open access journals, and other repositories; or 3) donations from institutions and organizations with substantial media collections, like UNESCO, NASA, and the British Library.

Wikimedia Commons operates on MediaWiki, the same software that powers Wikipedia. MediaWiki was developed primarily for writing and hosting text like Wikipedia, so Commons’ millions of media files don’t have machine-readable metadata. Instead, each media file is accompanied by free-form, non-structured descriptions that are not consistently machine-readable. This makes it difficult to link the media there to other educational resources on the Wikimedia sites, like Wikipedia or Wikisource. It also makes it difficult for other websites to take advantage of the media – to link to it and connect the media to their own content. Moreover, unstructured data makes it more difficult for Internet users to search for this media – to find the photos, video, and audio that would be ideal to reuse but are essentially unseen, hidden because the media’s details, the words that describe the media, are incomplete and disconnected from the wider Internet.

Structured data – and Wikidata – are changing the way people across the Internet can access information from repositories like Wikimedia Commons. One way to think of structured data: It’s a kind of DNA that explains information in a much more integral way. Structured data provides meaning for a media file from multiple angles. It offers multiple ways to search for that media, and multiple ways to understand that media. Unstructured data only tells part of that media’s history. Unstructured data diminishes the value of information contextualizing the media – and makes it much harder to find the image, audio, or video. For much of Commons’ content, it’s as if the media files don’t exist at all, since they’re so difficult to find on the Internet, and difficult to find on Commons. We’re changing that.

The Structured Data on Commons project will provide the means to transform the information throughout Wikimedia Commons media files from free text into machine-readable data, so that Commons’ millions of media files are much easier to view, translate, find, edit, curate, use, and reuse.

Known benefits

In the proposed work, we highlight the following outcomes:

  • Benefit 1:
Categories and metadata would be more easily multilingual on Commons. This allows successful completion of the 2015 Community Wishlist request at: 2015 Community Wishlist Survey/Commons#Allow categories in Commons in all languages
  • Benefit 2:
Developers may expand the project even more. Structured data on Commons provides the fundamental infrastructure required for consistent use of Commons data through APIs and other machine readable endpoints, so that developers both within and from outside the Wikimedia Community can create consistent, reusable, and reliable software that edits, helps with reuse, and allows analysis of Commons media and its associated data.
  • Benefit 3:
With better Commons search capability, contributors can more effectively illustrate Wikimedia content.
  • Benefit 4:
A better Commons search allows for better user experience for the broader reader community.
  • Benefit 5:
Better structure to commons data allows for easier and simpler partnership with content providers, especially knowledge collecting/sharing organizations.

We also identify other expected benefits of Wikimedia Commons’ improvement to our broader community, including but not limited to:

  • Ease of GLAM and other knowledge partner participation, especially in communities not supported by organizations like Europeana and DPLA
  • Greater discovery and reuse by external communities beyond Wikimedia Projects

Known risks

Thus far we have documented a number of identified risks:

  • Risk 1 (High):
The community review and discussion of structured data on Commons could identify large unanticipated technical changes that require extended development time.
  • Risk 2 (Medium):
Community adoption of features into existing workflows is obstructed by community politics. However, the Wikidata team has demonstrated effective adoption of Wikidata features both in Wikidata and the broader use of Wikidata in other projects, including Commons and dozens of language Wikipedias. Applying lessons learned of transparent planning and community management in the Wikidata context to Structured Data on Commons will be a necessary part of its success.
  • Risk 3 (Medium):
The backlog of tools and software features that break during the infrastructure change could impede effective integration of all of the new features planned for structured data on Commons.
  • Risk 4 (Low):
Relationships between WMF and/or WMDE and the Commons community change in some fundamental way that makes community members skeptical of the initiative.
  • Risk 5 (Low):
Complexities in feature development and roadmaps clearly defining work between Wikidata as currently funded by WMF and future work.
  • Risk 6 (High):
Proposed changes require significant changes to Commons underlying schema and software. This could disrupt existing re-users of our content and builders of community and third party tools. To some degree we need to mitigate this by building migration paths or compatibility layers and anticipating a long window of engaging with and supporting changes to those software.

Project Structure

There are already commitments by both WMDE and WMF to work on structured data on Commons. Wikimedia Deutschland has already committed to starting some of this work through planned work on Wikidata. Moreover, both the Discovery and Multimedia teams have made commitments related to creating basic infrastructure for using Commons structured data.

The timeline below is based on the current status of this work and how we would expect an accelerated timeline to look if this grant were awarded to WMF. The project would involve a three year process, broadly broken into three focuses:

Focus 1 - building the initial infrastructure
Focus 2 - integrating additional features into Commons and Wikimedia projects
Focus 3 - active engagement of the broader community and supporting additional tool and infrastructure development.

Should we not receive additional funding through the current opportunity; this timeline would need to be updated to take into account any progress made at a slower rate between now and the funding opportunity.

Year 1 Infrastructure Year 2 Integration Year 3 Engagement
Wikidata: Test and prototype metadata as structured data, integration of structured data interface, review labels and vocabulary Wikidata: Integration of structured data interface, wikidata ui concepts
Tools: Exploration & Testing Tools: Deployment Tools: Iteration and maintenance
Media Viewer License Display
Search & Query: Explore features Search & Query: Deployment Search & Query: Iteration and maintenance
Community: Review concepts & data models Community: Feature launch feedback Community: Feedback and curation support
Community: Support migration of tools, especially other community media import tools, like PattyPan Community: Encourage tool development for engagement, cleanup and contribution
Partners: Review data models Partners: Promote features and seek new potential partners Partners: Encourage tool development for engagement, cleanup and contribution

How Commons Content Could Change

With this plan, after two years, the infrastructure will be in place to migrate parts of Commons’ millions of media files to structured data, and after three years that migration should be well under way as we solidify the infrastructure. Already, a sizable amount of the images are ready for migration. As noted in the introduction, Magnus Manske, an expert in how Commons and Wikidata content is structured, estimates that the Commons community could instantly migrate between 16-19 million files into a Structured Commons format if they had the right infrastructure and a fairly standard piece of software. We're providing the infrastructure with this project, which frees the community to apply the relatively easy software – an example of the symbiosis that happens between Wikimedia and our projects' volunteers.

How soon after this initial migration would the other files be converted? Most of the remaining challenges would be related to solvable considerations around template/data features of existing Commons content. The “X-factor” is the capacity of volunteers and bot tools to migrate the material. Whenever you migrate a large set of material from one data structure (semi-structured, historical Commons data stored in wikitext) to another (structured data on Commons), there is a long tail of data-cleaning where, for a sizable window of time, a large portion of the content stored within the file will have a mix of structured data that’s easy to clean and integrate, and unstructured (or “uncleaned”) data that requires clean-up by a combination of semi-automated tools and human evaluation.

It may take five or even 10 years for the majority of Commons’ media files to be moved over into a Structured Data format. We can’t offer an exact timeframe since it depends on the input of the Commons’ community. But we are optimistic. Every month, more than 7,000 Commons contributors make five or more edits on Commons, and every month more than 1,300 Commons contributors make 100 or more edits on Commons. Commons and Wikidata each have about 7,000 active editors. Whether on Commons, Wikipedia, or another Wikimedia project, our sites are populated with passionate contributors. A key is designing software, tools, community support, and incentives for contributors to participate in the Structured Data on Commons project. That’s what we’re doing.

Once the infrastructure is in place, we anticipate that the Commons community will have capacity for migrating over for certain data types, including “Categories.” Almost every image on Commons is placed in content-based categories, which are as close as Commons comes to tags. This information is ripe for synchronization with Wikidata identifiers.

More than 1.5 million (of the 4.2 million) of these categories have already been synchronized with Wikidata items, which make them prime candidates for this migration. Many of these categories also have subcategories, which could be examined systematically with the right tools. Additional data items on almost every file could more readily be migrated:

  • Copyright statements
  • The majority of “creators of works” fields (some of which is dependent on major technical infrastructure planned for this grant)
  • The majority of upload date fields
  • The majority of creation date fields
  • The identities of uploaders of files (which requires major technical infrastructure changes planned for in this grant)

Subsections of Commons content that, with higher confidence, can be migrated, in large part, to structured data, include:

  • More than 31 million media files using a version of the Information Template that provides some basic information that can be moved into structured data, some of which will need cleanup for consistency to create uniform data. Included among these templates, are several types of data ripe for migration:
• Of these, more than 24 million language-identified file descriptions (captions), which are encoded correctly for correct integration into structured data.
• Of these images, 6.2 million files have location geo-coordinates that could accurately describe the location of the picture and be integrated into structured data.

Other subsections of Commons that have been curated with more robust description metadata could be migrated more quickly by communities. And these subsections would be the most useful parts of the initial migration. They include*:

(*-These numbers do not indicate all files of that genre – just the ones curated with data in the best standard)

The Wikimedia Foundation’s UploadWizard, cross-wiki upload, core upload, and related upload campaign tools maintained by the Foundation, and mass upload tools being developed by the volunteer GLAM-Wiki Community (principally Patty Pan and GLAMPipe) will need adjustments, to be compatible with structured data of Commons. These tools are all built around the assumptions that media should have structured data and descriptions, so it's a matter of adjusting the data pathways through our API/framework and expanding the fields available in the upload forms.