User:John Cummings/Development

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Development[edit]

Wikimedia Commons operates on MediaWiki, the same software that powers Wikipedia. MediaWiki was primarily developed for hosting text like in Wikipedia. So, typically, each media file on Commons is accompanied by plain-text descriptions (wikitext, templates) and categories. These are usually only available in one language – mostly English – and, most importantly, not consistently machine-readable.

Structured metadata allows the files to be accessible in a robust, consistent, structured and linked format: a format that allows software to understand, on a large scale, what the metadata fields mean (structured) and to connect them to other databases on the internet, putting them in a broader context (linked). Structured metadata is also more granular and easier to translate than unstructured data.

This switch makes it possible to use Commons' media in new ways, and makes the files on Commons much easier to view, search, edit, curate or organize, use and reuse, in many languages. Development of Structured Data on Commons is tracked on Phabricator.

Structured Data on Commons is powered by the Wikibase software. Data can be accessed via API; see mw:Wikibase/API for documentation.

The extension used for structured data is WikibaseMediaInfo. Wikimedia Commons, a sister project of Wikipedia, is a collection of more than 60 million free media files. The project Structured Data on Wikimedia Commons converts information about these files to a structured and machine-readable format, making them easier to view, search, edit, organize and re-use, in many languages.

This is implemented with Wikibase, the same technology as used for Wikidata.

Wikimedia community members and staff from the Wikimedia Foundation (WMF) and Wikimedia Deutschland (WMDE) (the Wikidata team) work on this project from 2017 till the end of 2019. In 2017–2019, through the Structured Data on Commons project, Wikimedia Commons gets access to new infrastructure and tools. These will help Commons' community of contributors to provide machine-readable, structured data about the media files, in addition to the free text (in wikitext markup) with which the files have been described before.

The Structured Data on Commons project is funded by a grant from the Alfred P. Sloan Foundation. Read more about the grant application and the community consultation that has preceded it.


FAQ

What kind of structured data should I add? How should I describe a certain file in structured data?

  • For best practices on Depicts statements, see Commons:Depicts.
  • For inspiration and examples of other statements, please check the Modeling pages. Feel free to ask questions there and contribute to the documentation yourself. The data model of Structured Data on Commons is a work in progress and is designed by the Wikimedia Commons community.

How can I find files that already have structured data?

You can use the Wikimedia Commons search function to find files with structured data. A few examples:

How can I add structured data to many files at once?

You can use the tools mentioned on this page; see above.

Other questions?

See also: Commons:International Image Interoperability Framework.
When is this developed?

Development of Structured Data on Commons took place between 2017 and the end of 2019.

Throughout the three-year project, the team encouraged and supported volunteers and partners who wanted to build tools that allowed active and diverse editing, maintenance, conversion and cleanup of the files on Commons.

It was expected that, after 3 years, 5 million media files on Wikimedia Commons would contain some structured metadata – probably more. Currently, on account of community processes, we are in the 'long tail' period of several additional years in which all files on Commons are being slowly described with some structured data.

Who is working on this?[edit]

Structured Commons is a collaboration between developers, the communities of Wikimedia Commons, Wikidata, Wikipedia, and sister projects, and partners and allies of the Wikimedia movement.

The developer team consists of staff from both the Wikimedia Foundation and Wikimedia Deutschland. Community developers (tool developers, bot operators, developers at partner organizations) can also play a large role in this project. All the developed features are conceptualized, created, tested and improved in close collaboration with the community of active contributors to Commons and Wikidata, as well as Wikipedia and sister projects. We also warmly welcome active feedback from cultural institutions (GLAMs – Galleries, Libraries, Archives and Museums).


About Wikimedia Commons
Wikimedia Commons is Wikimedia's free media repository. It contains (Summer 2020) more than 60 million media files that are typically
  1. personal photography and media uploaded by individuals;
  2. freely licensed media files uploaded to Commons from locations on the internet like Flickr, YouTube, open access journals, and other repositories;
  3. uploads from institutions and organizations with substantial media collections, like UNESCO, NASA, and the British Library.

Information about files on Wikimedia Commons was enhanced with structured data, powered by Wikidata.

Wikidata, the structured data repository in the Wikimedia ecosystem, debuted in 2012. Wikidata is overseen by a team at Wikimedia Deutschland (WMDE) and serves as a foundation for Structured Data on Commons.

Wikidata's software – Wikibase – does not store articles in wikitext, but it stores concepts (called 'items'). These items then contain many 'statements' that describe, for instance, the item's title, relations to other items, and dates.

The structured data on Wikidata is freely re-usable across Wikimedia sites and by third parties. Additionally, computers can easily process and understand it. Because of this flexibility, Wikidata is increasingly used in STEM[1] fields, but also in cultural heritage and the humanities. Moreover, because of its support of broad translation, data entered in Wikidata in one language can immediately be made available in many other languages as well. The integration of structured data in Wikimedia Commons happens through the integration of Wikibase, and metadata from Wikidata, into file descriptions on Wikimedia Commons.

Models??[edit]


Schematic of the main user groups of Structured Commons

Roughly, nine groups or types of people will use Structured Commons. Each type has distinct needs and workflows.

The needs of these users are outlined in user stories.

Outline of user stories, version September 29, 2017

Viewers: They mainly visit Commons to find particular free files. They most likely never edit. Within this type are Remixers (people who will download, edit, and remix the images in their own new creations) and Embedders (bloggers, reporters, etc. who embed unaltered work on other pages).

Casual Uploaders: They actively (on average at least once every month) upload one image at a time (which may or may not be their own). These are amateurs who probably take most photos with their phones.

Batch Uploaders: They upload 20 or more images at a time using a batch upload tool. They are typically someone associated with a GLAM project, but not always.

Wikimedia enhancers: They are users on various wiki projects who search for images to use on those projects (Wikipedia, WikiVoyage, etc).

Photographers: They are pros or semi-pros who actively upload their own images to Commons. They are generally using DSLR/mirrorless cameras but may use phones in a pinch.

Editors: They actively edit media information on Commons for the sake of accuracy, completeness, or maintaining site quality.

Curators: They actively categorize, group, and label images to make things organized and easy to find. They may also be involved in picking featured, quality, and valued images.

Tool Builders: They are volunteer developers who write and release software to supplement Commons functionality or fill in functional gaps.

Admins: They are users who have special abilities to enforce the rules and primarily act as site enforcers of copyright policies and social norms. They may or may not actively donate media.



Historically, file information about media files on Wikimedia sites has been stored in wikitext. The lack of a database structure with structured data on Wikimedia Commons makes it difficult to search, confusing to users, and impractical for new feature development. Today, many contemporary media repositories use highly structured and machine-readable data - including many media databases in the cultural sector like Europeana and RKDimages. In the meanwhile, Wikimedia Commons relies on a patchwork of plain text data embedded in thousands of overlapping templates and a set of categories, mostly in English only, that are often incompatible with other sites or tools.

For a long time, many members of the Wikimedia Commons community have asked for the implementation of features that require a database-like structure, that let them describe the media on Commons more fully.

For instance, the community has been discussing multilingual categories on Commons for a long time. Without multilingual categories, it is difficult, if not impossible for non-English speaking volunteers and end users to tag and find media on Commons.

Some earlier discussions and thoughts about structured data on Wikimedia Commons and on multilingual categories can be found at

2012-2014[edit]

In 2012, Wikidata - Wikimedia's free knowledge base and a sister project of Wikipedia - was founded and built upon Wikibase software, which stores versioned structured data in a central repository. Wikibase, supported by Wikimedia Deutschland (WMDE), offers a practical way to maintain structured data in MediaWiki, the software that powers all Wikimedia projects. It is widely considered to be a useful tool to support the growth of the free knowledge movement. Since the inception of Wikidata and Wikibase, many Wikimedia community members have proposed to use this mechanism to store and retrieve media metadata on Wikimedia Commons.

In 2013, the Wikimedia Foundation's multimedia team hosted a number of roundtable discussions with community members, asking what it should focus on in coming years. In each roundtable, the top request from participants was to implement structured data on Commons, even if that topic was not on the agenda to begin with. Some community members pointed out the difficulty of searching on Commons, others pointed at the lack of multilingual categories. Many suggested that categories could be complemented with more granular topics that could be linked to Wikidata's multilingual knowledge base.

In 2014 the Wikimedia Foundation started to explore the concept of structured data to address these concerns. The team identified most of the core architectural features that were needed to improve Wikimedia Commons, and discovered that these features would fit well in the roadmap of Wikibase (described above). At this point, the 2014 project was slowed and delayed until Wikibase offered more robust infrastructure. In 2016 this point was reached with an initial demonstration of how structured data on Commons could work: a first demo of so-called mediainfo entities, a new entity type.

Developments from 2016[edit]

In October 2016, WMF and WMDE announced a funding agreement that would provide multi-year support for Wikidata, including backend support for integrating Wikidata into Wikimedia Commons. This funding agreement was supplemented in late 2016 by a $3 million external grant from the Alfred P. Sloan Foundation, which makes it possible to develop structured data functionality in Wikimedia Commons in an accelerated three-year period (2017-2019).


Why is this important? Why do we put effort and invest resources in this project?[edit]

Wikimedia Commons is an immensely valuable place for advancing human knowledge: it is one of the most comprehensive sites on the internet that serves as a repository for entirely free media. Organizing this knowledge better, making it easier to find, and making it fully multilingual, vastly increases its value to the rest of the world.Structured Data on Commons has many concrete benefits for Wikimedia editors, for the Wikimedia movement at large, for organizations with which we build partnerships, and for people and organizations that wish to reuse content from Commons across the web.

.

Who is working on this?[edit]

Structured Commons is a collaboration between developers, the communities of Wikimedia Commons, Wikidata, Wikipedia, and sister projects, and partners and allies of the Wikimedia movement.

The developer team consists of staff from both the Wikimedia Foundation and Wikimedia Deutschland. Community developers (tool developers, bot operators, developers at partner organizations) can also play a large role in this project. All the developed features are conceptualized, created, tested and improved in close collaboration with the community of active contributors to Commons and Wikidata, as well as Wikipedia and sister projects. We also warmly welcome active feedback from cultural institutions (GLAMs - Galleries, Libraries, Archives and Museums).

Who pays for the work on this project?[edit]

In October 2016, the Wikimedia Foundation and Wikimedia Deutschland announced a funding agreement that would provide multi-year support for Wikidata, including backend support for integrating Wikidata into Wikimedia Commons. This funding agreement was supplemented in late 2016 by a $3 million external grant from the Alfred P. Sloan Foundation, which makes it possible to develop structured data functionality in Wikimedia Commons in an accelerated three-year period (2017-2019).

Read more about the grant application and the community consultation that has preceded it.

What is this project's timeline?[edit]

Development on this project, financed by the Alfred P. Sloan Foundation, takes place between 2017 and the end of 2019. This builds upon many earlier discussions that have already taken place since 2004 (see the project history). A timeline of development on Structured Commons is available on the development page.

Will Structured Data also be made available for other Wikimedia projects (for instance on Wikisource, for fair use images on English Wikipedia...)?[edit]

The team focuses on Wikimedia Commons first. This is also the project for which we received funding from the Alfred P. Sloan Foundation.

But we hope to extend it to other projects at some point. The software development for implementing structured data on Commons is based on Wikibase (used for wikidata.org) and is being written with compatibility with other Wikimedia projects in mind. Information from Wikidata is already being used for sitelinks in Wikipedia and for some templates. Structured file description information from Commons should, similarly, be available where files from Commons are used on Wikimedia projects and elsewhere. Additional support for Wikimedia projects is out of scope for the specific work done in the Structured Data on Wikimedia Commons, but we have it on the list of longer-term opportunities.

About metadata and structured data on Commons[edit]

How long will it take until Wikimedia Commons is entirely converted to Structured Data?[edit]

Quite a bit of time! By the end of 2019, not all media files on Wikimedia Commons will be complemented with structured data yet. The team will work with the Commons community so that the community can add structured data to approximately 5 million files by that point. The further conversion process will likely take more years.

Some numbers to give an idea on low-hanging fruit...[edit]

In August 2017, Wikimedia Commons contains 42.5 million files.

Magnus Manske has created an online tool, Commonsedge, which roughly indicates whether the metadata templates of a Commons file can be easily transferred to machine-readable data, or not. According to results from this tool, we notice that around 60% of Commons files are ready for conversion—which usually means that they are described with templates that are technically convertible to structured data.

Converting files to structured data will probably be most straightforward for the following types of files.

Newly uploaded files[edit]

The Structured Data on Commons project will work to allow newly uploaded files to be enhanced with structured data, by working on/with the most frequently used upload tools so that they support structured data as well: the UploadWizard, cross-wiki upload, upload campaign tools (e.g. for Wiki Loves Monuments), and mass upload tools developed by the volunteer community (Pattypan, GLAMpipe, and others).

Isn't information on Wikimedia Commons structured already?[edit]

(answer under construction; examples/screenshots to be added)

Without Structured Data, many files on Wikimedia Commons are indeed already described with some structure: categories and templates (various information templates, license templates, source/creator/institution templates, and more).

However, this is not fully structured data that can be consistently read and understood by machines.

Structured metadata contains a logical structure that is uniform, and explicitly expressed. This allows data—in this case media files on Wikimedia Commons—to be accessible in a robust, consistent, structured and linked format: a format that allows software to understand, on a large scale, what the metadata fields mean (structured) and to connect them to other databases on the internet, putting them in a broader context (linked). Structured metadata is also more granular and easier to translate than unstructured data.

Will Commons still have templates? Will templates on Commons disappear?[edit]

It will still be possible to work with templates on Wikimedia Commons. The Structured Data on Commons team will not replace wikitext—including templates—on the software/technical side. It is likely that some information in current templates (such as creator and institution info) will easily be converted to structured data from an early stage. When this (structured) information can easily and multilingually be searched via the improved search functionality of Structured Commons, the Wikimedia Commons community will likely need to have a discussion whether it's worthwhile to also keep the same information in a wikitext template.

Will Commons still have categories? Will Commons categories disappear?[edit]

The Structured Data on Commons team will NOT remove the ability to create and edit categories. We keep all existing systems in place and it will still be possible to work with categories. What we do want to do, is to build a system that serves many of the use cases for categories in a different (and profoundly multilingual) way. Structured metadata on Commons will update search on Wikimedia Commons so that it will also be possible to search multilingually across different criteria.

For instance, with structured data and its new search functionality, users will be able to search for a painting (Wikidata/Gemälde/cuadro/peinture/絵画作品/...) that depicts a pheasant (Wikidata/Fasan/faisán/faisan/コウライキジ/...) in any language, leading to the same search results. If someone uploads an additional painting of a pheasant and describes it properly with structured data, that painting will then also immediately show up in those same search results, even if the uploader is not aware of the existence of categories and has not added any.

How can the information in Commons categories be used for structured data on Commons?[edit]

Wikimedia Commons contains roughly 6,066,000 categories. (source 1) (source 2) (checked on November 9, 2017)

Until now, without fully structured data, categories on Wikimedia Commons have been the best instrument to 'tag' media files on Commons and to organize them. The Commons category system is multihierarchical (i.e. it's a tree structure and each 'node/branch' in the tree can have multiple parents and children).

A lot of (often detailed) information is stored in Commons categories.

Commons categories are extremely varied in topic and purpose:

  • they range from single-topic 'tags' (example: George Washington) to highly complex intersection categories (example)
  • they are used in a variety of different 'relationships' to a media file, e.g. to simply describe what is depicted in a media file, or to indicate creators or sources of media files;
  • they sometimes contain copyright-related information (example: CC-BY-SA-2.0), or information about the creation or upload process behind the media file (example: Uploaded with Mobile/Web)
  • and they are sometimes purely used for structural and maintenance purposes (example: Mérimée with PA parameter).

This enormous variation and complexity makes them problematic to deal with as such as structured data: in APIs it is impossible to distinguish between these different uses, and to make that diversity solidly machine-readable. It is therefore highly recommended that as much information as possible, currently contained into categories, will also be translated to more refined and semantically correct structured data.

The Structured Commons team helps the Wikimedia community to work towards this transition by, among other things, assisting in the creation and maintenance of volunteer-driven conversion tools.

Will 'old' ('unstructured') information on Commons be removed?[edit]

The Structured Data on Commons team designs the project in such a way that no information is taken away from Wikimedia Commons. The team only adds features that might slowly supplement and perhaps replace existing ones. When to add and replace data is the Commons community's own decision. In the team’s current roadmap:

  • Categories are not touched. Certain features of Structured Data on Commons might, at some point, overlap with categories (most notably combined categories such as "Bridges in India" or "1988 in Lima"). It is up to the Commons community to decide whether structured data, at some point, will make these categories obsolete.
  • File pages will continue to hold wikitext. Structured data is inserted as an addition to this.
  • The {{Information}} template (with its many variations - {{Artwork}}, {{Photograph}} and many more) are not touched. Many (if not all) template parameters could be migrated, but deciding how and when is up to the Commons community.
  • The edit history of the file description page is not touched.
  • How the binary files are stored is not affected by this project.
  • The upload history is not touched.

How will usernames be stored in Structured Data?[edit]

Media files on Wikimedia Commons will be described with concepts from Wikidata (e.g. people, institutions, species, places). But not every topic on Wikimedia Commons is notable enough to merit its own Wikidata item, and a lot of metadata will be described as strings (plain text), not Wikidata items. This probably applies to most usernames on Wikimedia Commons.

How will dates be stored in Structured Data?[edit]

Media files on Wikimedia Commons will be described with concepts from Wikidata (e.g. people, institutions, species, places). Dates of media on Commons will be described with date-related properties from Wikidata, and with the time data type. See the "Time" section at d:Special:ListDatatypes.


Partners and allies[edit]

Is this project interesting for external parties like researchers, tech companies, developers...?[edit]

Yes. As soon as a significant amount of media files on Wikimedia Commons are described via structured data, a large corpus or knowledge base of media files will become freely available (on Commons and via an API) for building applications, for re-use and research purposes.

Why do cultural institutions (GLAMs - Galleries, Libraries, Archives and Museums) and heritage organizations play a large role in this project?[edit]

The vision statement of the Wikimedia movement - Imagine a world in which every single human being can freely share in the sum of all knowledge - is closely aligned with the mission of public knowledge and cultural institutions. And indeed, cultural organisations (GLAMs) have already donated millions of media files to Wikimedia Commons in the past, and continue to do so. The Wikimedia movement’s wikis are major portals for educational and heritage material, and Wikipedia is very often a first stop for the public when someone is learning and researching. Offering media on Wikimedia Commons stimulates re-use of cultural and heritage collections on Wikipedia and beyond.

Without structured data, Wikimedia Commons does not have the refined APIs and Linked Open Data technology that meets the needs of cultural institutions, who regularly share their collections through structured, open end-points for both reuse and aggregation into hubs like DPLA and Europeana. Even when GLAM organizations, STEM organizations, and other sharers of educational and heritage material do not have capacity to host or provide utilities for reuse on their own digital platforms, they are not consistently choosing Wikimedia Commons as a site to upload their content. Instead these organizations choose platforms like Flickr or commercial-vendor-controlled digital platforms. In part, this is because Commons does not provide the kinds of robust data structures and APIs needed for monitoring changes to the data, so that the institutions can benefit from the Wikimedia community’s improvements to that data.

Structured Commons expands Commons with features that are central to cultural heritage and media sharing software, but does so with the focus towards openness and collaboration which make our projects, like Wikidata, widely useful for GLAM communities. By closely consulting cultural heritage organizations, especially those who have been strong partners with us in the past or want to partner with us more, we can make sure that Structured Commons becomes a better platform for encouraging partner collaboration in the future and that grows our impact on public knowledge.

For more information about how GLAMs work with Commons and digitized content, see the GLAM portal on outreach.wikimedia.org.

About Wikidata[edit]

Why is Wikidata used in this project?[edit]

Wikidata is the free and open, multilingual knowledge base of the Wikimedia projects, that can be edited, read and re-used by humans and machines. Its data is available under a CC0 license. Wikidata describes, structures and interrelates all the concepts about which there is a Wikipedia article (in any language), and has much more data than that! Wikidata acts as central storage for structured data that can be re-used and improved across Wikimedia projects and beyond.

Wikidata's software, Wikibase, is in fact a set of two extensions to MediaWiki, the software that powers all Wikimedia projects. In the Structured Data on Commons project, Wikibase—with data from Wikidata—is integrated in file pages on Wikimedia Commons.

Where can I learn more about Wikidata?[edit]

The best introduction to Wikidata, for absolute beginners, is probably Asaf Bartov's three-hour (!) presentation. It is worth your time!

Short videos (YouTube links)

Text-based introductions include:

If you want to have good examples of the structure of data, check one of the showcase items (10min+) https://www.wikidata.org/wiki/Wikidata:Showcase_items

The Wikidata Query Service offers one of the most powerful ways to search and re-use Wikidata's data.

A few examples of tools that reuse Wikidata data:


Tools, gadgets, bots and workflows[edit]

I am a tool / bot developer. What consequences does Structured Commons have for me?[edit]

The Structured Data on Commons team designs the project in such a way that no information is taken away from Wikimedia Commons. The team only adds features that might slowly supplement and replace existing ones. This addition and replacement is the Commons community's own decision.

This means that, in a technical sense, existing workflows, bots and tools on Wikimedia Commons should still work with already uploaded files. However, the underlying API of Wikimedia Commons might change considerably, and an increasing amount of files on Commons will be enriched with structured data. As soon as Structured Commons is rolled out, many volunteers and partners might start uploading media files to Commons that are only described with structured data, not with wikitext. So, increasingly, it will make sense to either create new tools and/or to update current ones.

Many volunteer developers are already aware of these changes, are taking updates to their tools into account, and keep an eye on developments. If you are a tool developer / bot operator who is not yet informed and curious how the changes to Commons might affect you: no problem, don't hesitate to ask, help and support is available! The tool and bot page is specifically designed for you; feel free to ask any questions there and to request assistance. Besides volunteer developers from the Wikimedia community, Structured Commons' community liaison Sandra can also support you.

Getting involved and staying up to date[edit]

I am a Wikimedia community member. How can I help and be involved in this project?[edit]

Check the Get involved page! There are many ways to contribute - by providing feedback, helping others, translating content...

I represent a cultural / educational / research institution. I would like to get involved in Structured Data on Commons. How can I get in touch?[edit]

The Get involved page includes information on how to engage with this project as a representative from a cultural or knowledge institution. You can also get directly in touch with Sandra Fauconnier, community liaison and GLAM contact for Structured Data on Commons: sfauconnier@wikimedia.org

How can I stay up to date?[edit]

  1. Science, technology, engineering, and mathematics.