Commons:Strukturierte Daten/Überblick

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
This page is a translated version of a page Commons:Structured data/Overview and the translation is 14% complete. Changes to the translation template, respectively the source language can be submitted through Commons:Structured data/Overview and have to be approved by a translation administrator.

Seit seiner Gründung ist Wikimedia Commons eine der erfolgreichsten Plattformen innerhalb der Wikimedia-Gemeinschaft für den Austausch von Wissen. Es ist jedoch seit langem bekannt, dass eine Reihe von zugrundeliegenden technischen Hürden sie daran gehindert haben, eine ideale Plattform für andere Wissensgemeinschaften, potenzielle Partner und unsere Wikimedia-Gemeinschaften zu werden. Zur Bewältigung dieser technischen Herausforderungen müssten Funktionen geschaffen werden, die es sowohl den Wikimedia-Gemeinschaften als auch externen Partnern ermöglichen, die Vorteile dieser Plattform in vollem Umfang zu nutzen, darunter mehrsprachige Beschreibungen und Kategorien, ganzheitliche APIs und eine bessere Verwaltung von Metadaten. All dies würde den öffentlichen Zugang zu Commons durch Funktionen wie die mehrsprachige Suche erheblich verbessern.

Im Jahr 2014 begann die Stiftung mit einer ersten Untersuchung des Konzepts der strukturierten Daten auf Commons, um diese Anforderungen zu erfüllen. Bei dieser ersten Untersuchung wurden die meisten architektonischen Kernfunktionen ermittelt, die für eine Verbesserung von Commons erforderlich sind, und es wurde festgestellt, dass diese Funktionen gut in die Roadmap für die Wikidata-Software Wikibase passen, die von WMDE unterstützt wird. Zu diesem Zeitpunkt wurde das Projekt 2014 verlangsamt und verschoben, bis Wikibase eine stabilere Infrastruktur bot. Im Laufe des Jahres 2016 haben wir diesen Punkt erreicht, so dass eine erste Demonstration der Funktionsweise von strukturierten Daten auf Commons möglich wurde.

Im Oktober 2016 kündigten WMF und Wikimedia Deutschland eine Finanzierungsvereinbarung an, die eine Mehrjahresunterstützung für Wikidata vorsieht, einschließlich der Finanzierung der Backend-Unterstützung für die Integration von Wikidata in Wikimedia Commons. Die langfristige Roadmap der WMF beinhaltet den Plan, Werkzeuge zu entwickeln, um diese Infrastruktur für die Commons- und GLAM-Gemeinschaft nutzbar zu machen. Die Summe dieser Arbeit würde jedoch erst in einigen Jahren zur Verfügung stehen.

Kürzlich wurde die WMF von einem Fördermittelgeber angesprochen, um eine Finanzierung zu beantragen, die diese Arbeit in einem kürzeren Dreijahreszeitraum beschleunigen könnte. Die potenzielle Finanzierung könnte die Kosten einiger Arbeiten abdecken, die bereits im Jahresplan und Budget der WMF enthalten sind. Das WMF-Fundraising-Team bemüht sich im Allgemeinen nicht um die Finanzierung von Projekten außerhalb des WMF-Jahresplans. Dies ist jedoch eine einzigartige Gelegenheit, bei der ein Geldgeber an uns herangetreten ist, um die Arbeit an diesem bestehenden Projekt zu beschleunigen. Um das klarzustellen und die Erwartungen angemessen zu formulieren, wurden wir lediglich gebeten, uns zu bewerben. Vielleicht erhalten wir eine Finanzierung, vielleicht aber auch nicht.

HINWEIS: Dies ist KEIN neuer Projektvorschlag. Derzeit wird auf dieser Seite ein Update der WMF vom Oktober 2016 über eine mögliche Beschleunigung des Projekts erörtert.

Was wird hier veröffentlicht?

This page represents a synopsis of the potential accelerated roadmap/plan we expect sharing with potential funders for this work. We haven’t included the whole of the document, because it is still under discussion and review internally, and depending on who we receive/seek resources from the document will change. Additionally, much of the material contained within the document is for explaining the context of structured data on Commons to unfamiliar audiences (including, but not limited to Common’s scope, why linked structured metadata is important for multimedia, especially cultural heritage multimedia, and comparisons to other platforms). We have included here:

  • Our high-level roadmap for developing the project, if we get resources to expedite the work
  • Identified benefits and risks of the project.
  • The expected opportunity for changes in Commons.

If/when we gain the necessary external support for the accelerated project, we will share a fuller version (baring sensitive information) with the community.

Why do we need your comment? How can you help?

The Commons volunteer community has long asked for features that require a database-like structure that would let them describe the media more fully. For example, multilingual categories have been a long term conversation amongst our community that would allow more non-English speaking volunteers to effectively tag and find images. (See an early-2004 discussion, a 2008 blog post, the 2009 GLAM-WIKI Recommendations another in 2010, a question on Quora, the 2015 Community Wishlist and more recent discussions.)

It is our continued hope that Structured data on Commons provides a foundation for solving those problems in general.

We want to provide an open window of time for members of the Wikimedia communities to provide feedback on this particular project and are seeking feedback on the following:

  • Do you see this expedited roadmap as a worthy undertaking?
  • What roadblocks, risks or challenges do you anticipate with accelerating such a project?
  • Does the current information on the page here accurately represent the role of the communities, especially the Wikimedia Commons and Wikidata communities, in engaging such a software project?
  • How would you like to support this project?
  • If you would like to stay up to date with this proposal and/or Structured data on Commons more generally, we recommend signing up for this Newsletter

We plan to leave initial commenting on this proposal through November 9.

High level description

We have put together a set of materials which lays out the context of why and how structured data works within the Wikimedia projects and the effects created by a lack of effective structured data infrastructure on Wikimedia Commons, that reflects the known issues described at Commons:Structured data. Here we highlight some key elements of those materials which focus on the actual framework for working with the Wikimedia Community.

Hintergrund

All freely licensed photos, audio, and video files on Wikipedia are stored on the Wikimedia Foundation site called Wikimedia Commons. Started in 2004, Commons now has 34 million media files – photos, audio, and video – and it continues to rapidly grow every year: Contributors added about five million new files last year. Its media files are easily discoverable through platforms like Google search.

Thousands of volunteers integrate these media files from Commons into our Wikimedia projects, like Wikipedia, to illustrate our content and share that media with the public. These files are typically 1) personal photography and media uploaded by individuals; 2) freely licensed media files from locations on the Internet like Flickr, YouTube, open access journals, and other repositories; or 3) donations from institutions and organizations with substantial media collections, like UNESCO, NASA, and the British Library.

Wikimedia Commons operates on MediaWiki, the same software that powers Wikipedia. MediaWiki was developed primarily for writing and hosting text like Wikipedia, so Commons’ millions of media files don’t have machine-readable metadata. Instead, each media file is accompanied by free-form, non-structured descriptions that are not consistently machine-readable. This makes it difficult to link the media there to other educational resources on the Wikimedia sites, like Wikipedia or Wikisource. It also makes it difficult for other websites to take advantage of the media – to link to it and connect the media to their own content. Moreover, unstructured data makes it more difficult for Internet users to search for this media – to find the photos, video, and audio that would be ideal to reuse but are essentially unseen, hidden because the media’s details, the words that describe the media, are incomplete and disconnected from the wider Internet.

Structured data – and Wikidata – are changing the way people across the Internet can access information from repositories like Wikimedia Commons. One way to think of structured data: It’s a kind of DNA that explains information in a much more integral way. Structured data provides meaning for a media file from multiple angles. It offers multiple ways to search for that media, and multiple ways to understand that media. Unstructured data only tells part of that media’s history. Unstructured data diminishes the value of information contextualizing the media – and makes it much harder to find the image, audio, or video. For much of Commons’ content, it’s as if the media files don’t exist at all, since they’re so difficult to find on the Internet, and difficult to find on Commons. We’re changing that.

The Structured Data on Commons project will provide the means to transform the information throughout Wikimedia Commons media files from free text into machine-readable data, so that Commons’ millions of media files are much easier to view, translate, find, edit, curate, use, and reuse.

Bekannte Vorteile

In the proposed work, we highlight the following outcomes:

Benefit 1:

Categories and metadata would be more easily multilingual on Commons. This allows successful completion of the 2015 Community Wishlist request at: 2015 Community Wishlist Survey/Commons#Allow categories in Commons in all languages

Benefit 2:

Developers may expand the project even more. Structured data on Commons provides the fundamental infrastructure required for consistent use of Commons data through APIs and other machine readable endpoints, so that developers both within and from outside the Wikimedia Community can create consistent, reusable, and reliable software that edits, helps with reuse, and allows analysis of Commons media and its associated data.

Benefit 3:

With better Commons search capability, contributors can more effectively illustrate Wikimedia content.

Benefit 4:

A better Commons search allows for better user experience for the broader reader community.

Benefit 5:

Better structure to commons data allows for easier and simpler partnership with content providers, especially knowledge collecting/sharing organizations.

We also identify other expected benefits of Wikimedia Commons’ improvement to our broader community, including but not limited to:

  • Ease of GLAM and other knowledge partner participation, especially in communities not supported by organizations like Europeana and DPLA
  • Greater discovery and reuse by external communities beyond Wikimedia Projects

Bekannte Nachteile

Thus far we have documented a number of identified risks:

Risk 1 (High):

The community review and discussion of structured data on Commons could identify large unanticipated technical changes that require extended development time.

Risk 2 (Medium):

Community adoption of features into existing workflows is obstructed by community politics. However, the Wikidata team has demonstrated effective adoption of Wikidata features both in Wikidata and the broader use of Wikidata in other projects, including Commons and dozens of language Wikipedias. Applying lessons learned of transparent planning and community management in the Wikidata context to Structured Data on Commons will be a necessary part of its success.

Risk 3 (Medium):

The backlog of tools and software features that break during the infrastructure change could impede effective integration of all of the new features planned for structured data on Commons.

Risk 4 (Low):

Relationships between WMF and/or WMDE and the Commons community change in some fundamental way that makes community members skeptical of the initiative.

Risk 5 (Low):

Complexities in feature development and roadmaps clearly defining work between Wikidata as currently funded by WMF and future work.

Risk 6 (High):

Proposed changes require significant changes to Commons underlying schema and software. This could disrupt existing re-users of our content and builders of community and third party tools. To some degree we need to mitigate this by building migration paths or compatibility layers and anticipating a long window of engaging with and supporting changes to those software.

Projektstruktur

There are already commitments by both WMDE and WMF to work on structured data on Commons. Wikimedia Deutschland has already committed to starting some of this work through planned work on Wikidata. Moreover, both the Discovery and Multimedia teams have made commitments related to creating basic infrastructure for using Commons structured data.

The timeline below is based on the current status of this work and how we would expect an accelerated timeline to look if this grant were awarded to WMF. The project would involve a three year process, broadly broken into three focuses:

Focus 1 - building the initial infrastructure
Focus 2 - integrating additional features into Commons and Wikimedia projects
Focus 3 - active engagement of the broader community and supporting additional tool and infrastructure development.

Should we not receive additional funding through the current opportunity; this timeline would need to be updated to take into account any progress made at a slower rate between now and the funding opportunity.

Year 1 Infrastructure Year 2 Integration Year 3 Engagement
Wikidata: Test and prototype metadata as structured data, integration of structured data interface, review labels and vocabulary Wikidata: Integration of structured data interface, wikidata ui concepts
Tools: Exploration & Testing Tools: Deployment Tools: Iteration and maintenance
Media Viewer License Display
Search & Query: Explore features Search & Query: Deployment Search & Query: Iteration and maintenance
Community: Review concepts & data models Community: Feature launch feedback Community: Feedback and curation support
Community: Support migration of tools, especially other community media import tools, like PattyPan Community: Encourage tool development for engagement, cleanup and contribution
Partners: Review data models Partners: Promote features and seek new potential partners Partners: Encourage tool development for engagement, cleanup and contribution

How Commons Content Could Change

With this plan, after two years, the infrastructure will be in place to migrate parts of Commons’ millions of media files to structured data, and after three years that migration should be well under way as we solidify the infrastructure. Already, a sizable amount of the images are ready for migration. As noted in the introduction, Magnus Manske, an expert in how Commons and Wikidata content is structured, estimates that the Commons community could instantly migrate between 16-19 million files into a Structured Commons format if they had the right infrastructure and a fairly standard piece of software. We're providing the infrastructure with this project, which frees the community to apply the relatively easy software – an example of the symbiosis that happens between Wikimedia and our projects' volunteers.

How soon after this initial migration would the other files be converted? Most of the remaining challenges would be related to solvable considerations around template/data features of existing Commons content. The “X-factor” is the capacity of volunteers and bot tools to migrate the material. Whenever you migrate a large set of material from one data structure (semi-structured, historical Commons data stored in wikitext) to another (structured data on Commons), there is a long tail of data-cleaning where, for a sizable window of time, a large portion of the content stored within the file will have a mix of structured data that’s easy to clean and integrate, and unstructured (or “uncleaned”) data that requires clean-up by a combination of semi-automated tools and human evaluation.

It may take five or even 10 years for the majority of Commons’ media files to be moved over into a Structured Data format. We can’t offer an exact timeframe since it depends on the input of the Commons’ community. But we are optimistic. Every month, more than 7,000 Commons contributors make five or more edits on Commons, and every month more than 1,300 Commons contributors make 100 or more edits on Commons. Commons and Wikidata each have about 7,000 active editors. Whether on Commons, Wikipedia, or another Wikimedia project, our sites are populated with passionate contributors. A key is designing software, tools, community support, and incentives for contributors to participate in the Structured Data on Commons project. That’s what we’re doing.

Once the infrastructure is in place, we anticipate that the Commons community will have capacity for migrating over for certain data types, including “Categories.” Almost every image on Commons is placed in content-based categories, which are as close as Commons comes to tags. This information is ripe for synchronization with Wikidata identifiers.

More than 1.5 million (of the 4.2 million) of these categories have already been synchronized with Wikidata items, which make them prime candidates for this migration. Many of these categories also have subcategories, which could be examined systematically with the right tools. Additional data items on almost every file could more readily be migrated:

  • Copyright statements
  • The majority of “creators of works” fields (some of which is dependent on major technical infrastructure planned for this grant)
  • The majority of upload date fields
  • The majority of creation date fields
  • The identities of uploaders of files (which requires major technical infrastructure changes planned for in this grant)

Subsections of Commons content that, with higher confidence, can be migrated, in large part, to structured data, include:

  • More than 31 million media files using a version of the Information Template that provides some basic information that can be moved into structured data, some of which will need cleanup for consistency to create uniform data. Included among these templates, are several types of data ripe for migration:
Of these, more than 24 million language-identified file descriptions (captions), which are encoded correctly for correct integration into structured data.
Of these images, 6.2 million files have location geo-coordinates that could accurately describe the location of the picture and be integrated into structured data.

Other subsections of Commons that have been curated with more robust description metadata could be migrated more quickly by communities. And these subsections would be the most useful parts of the initial migration. They include*:

(*-These numbers do not indicate all files of that genre – just the ones curated with data in the best standard)

The Wikimedia Foundation’s UploadWizard, cross-wiki upload, core upload, and related upload campaign tools maintained by the Foundation, and mass upload tools being developed by the volunteer GLAM-Wiki Community (principally Patty Pan and GLAMPipe) will need adjustments, to be compatible with structured data of Commons. These tools are all built around the assumptions that media should have structured data and descriptions, so it's a matter of adjusting the data pathways through our API/framework and expanding the fields available in the upload forms.