Commons:Guide to batch uploading

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
A map of the common steps in a batch upload to Wikimedia Commons

Batch uploading or data ingestion is uploading multiple files in an automated manner. This guide aims to explain how to do this. See also Commons:Batch uploading for more information. For information on forming a relationship with a content partner in order to obtain content to upload see Commons:Guide to content partnerships. For help evaluating the workflow of batch uploads, see the learning pattern on Meta.

Before you start: licensing and permissions[edit]

Before you even think about batch uploading a set of files, be 100% sure all the files are free and fall within the project scope. Make sure you know the artists and when they died or which century they lived in and use creator templates to allow others easily verify the year of death. If you need to arrange permission, please do so before starting the batch upload. In most cases each batch upload will use a single license for the whole batch. Which license is appropriate depends on a variety of factors, especially whether the images are of 2D objects or 3D objects.

2D works[edit]

2D works include paintings, drawings, photographs, and documents. If you intend to use one of the licenses below, make sure the images are actually 2D; try to crop frames of paintings if present, do not upload photographs of rooms with murals or frescoes if they show architectural features unless you provide a separate license for the photographer. Some appropriate licenses for 2D works include:

  • {{PD-Art}} - use for images of 2D artworks by artists who died more than 70 years ago.
  • {{Licensed-PD-Art}} - use for images of 2D artworks by artists who died more than 70 years ago when the photographer of the work has explicitly released their photographs under a free license.
  • {{PD-scan}} - use for scans or photocopies of 2D works by authors who died more than 70 years ago.

All of these templates can be passed a sub-license as a parameter. For example, if you are uploading images of paintings and you know that all of the artists died at least 100 years ago, you could use {{PD-Art|PD-old-100}}

3D works[edit]

3D works include sculptures, buildings, paintings with artistic frames, coins, and some textiles. In most cases, two copyrights will be involved in these images: a copyright on the original work and a copyright on the photograph itself. (See Commons:Freedom of panorama for exceptions.) When you upload the images, be sure to specify the licensing for both the photograph and the work depicted in the photograph. See, for example, this photo of a 3D artwork from the Walters Art Museum. 3D works generally don't require specialized licensing tags. You can usually use standard PD-Old tags for the works and Creative Commons tags for the photographs.

Prerequisites[edit]

Get the files[edit]

Before you upload anything you need to have the files. You can either have the files stored locally or have URL links to the exact location of the source files. The URL should deep link directly to the jpg/ogg/... files. In case all or some of the images need to be altered, for example to crop frames of the paintings or remove watermarks, then it is easier to download all the images first and store them locally.

Get the metadata[edit]

Get the metadata so we have enough information to later construct the filename, description and categories. For example in case of artworks you might need: authors, titles, techniques, dates, institutions holding the artwork, etc. Sources of metadata might include:

  • can be provided by the websites, see for example here
  • can be provided by the GLAM institutions collaborating in the upload process (for example, if a museum is using TMS, they could export their data as an SQL file)
  • can be scraped from the website: page scraping

Rate limits[edit]

Normal users on commons are rate limited to 380 uploads per 72 minutes. Users granted image-reviewer, patroller, or autopatrolled status have a ratelimit of 999 uploads per 1 second. Users can apply for these user rights at COM:RFR. If you are accountcreator, bot, bureaucrat, sysop or translationadmin, there is no rate limits.

To test your ratelimits, click here, then go to query > userinfo > ratelimits > upload.

Tools[edit]

A handout that describes the main strengths and applications for tools available for GLAM-Wiki documentation for uploading to commons as of March 2017.

Historically, uploads were done using bots programmed by Wikimedians. More user-friendly tools now exist that simplify part or totality of the upload process.

  • Pattypan and ComeOn! are desktop applications optimised for GLAM uses, which use a spreadsheet to upload files from your hard-drive.
  • OpenRefine, a flexible data wrangling tool, is also suitable for GLAM uploads. It takes a large number of data formats (including XML and data from APIs). OpenRefine supports editing and uploading structured data.
  • (Not working since Nov. 2021) Commonist works with older systems and non-Commons MediaWiki instances
  • (Deprecated as of 2020) The GLAMwiki Toolset (also known as the GWToolset), an on-wiki tool that allows you to batch upload files from a structured data source, such as a XML file.
  • Offroader is an Android app and will transcode mov and mp4 to webm, split video files larger than 4GiB, checks the SHA1 before starting an upload, can do revision uploads, continues uploads after interruption (CPSR-token, logout, changed filekey, internet failure, device crash), checks file integrity after publication, can create a protocol with error messages for developers.

See Commons:Upload tools for other choices.

Tools which can alter already existing files en masse are useful for adding details to files if something was forgotten during upload.

  • Cat-a-lot can alter the categories of large amounts of files in seconds

Process[edit]

Open a subpage at Commons:Batch uploading to discuss the upload. On this page you can describe what you're uploading, get feedback and document progress.

Check for duplicates[edit]

Before bothering to construct all the information, check if the file doesn't already exist at Commons.

  1. Calculate the SHA-1 hash of the file. (In PHP, you can use the sha1_file() function.)
  2. Ask the API if a file with that hash exists
  3. Skip the file if a file with the same hash already exists, or verify that the current description is in correct format and add metadata information if needed

Naming[edit]

See also Commons:File naming

Define a file naming convention that makes sense, while making sure filenames are unique and not already used at Commons. You might want to include:

  • Title and/or brief description (up front, so that when names are clipped we still figure out what it is about)
  • Year
  • Source/Institution name
  • Accession number/Record identifier - Adding unique identifier used by other institution allows easier linking and increases probability that the filename is unique.

Descriptions and templates[edit]

Descriptions should be derived from the metadata, especially if there is extensive metada provide by a partner institution. To successful store the information on Commons, the metadata will need to be converted into wikitext format:

Categories[edit]

See also Commons:Categories

The files you're going to upload need to get integrated into the category tree on Commons. Categories are a common way for users to find files. A full categorization as possible is key to getting your images seen by the widest audience.

Types of categories[edit]

Broadly speaking there are two types of categories a file can have: categories related to the topic of the file, and the ones related to the origin of the file.

Consider the picture on the right. It's a reproduction of a 1910's photograph of men in a Pic-Pic car, from the collection of the Swiss Federal Archives.

The topic categories for this picture are Pic-Pic_vehicles, 1910s photographs and Switzerland in World War I (mechanized vehicles).

The origin categories are CH-BAR Collection First World War Switzerland and Media contributed by the Swiss Federal Archives.

Origin category can be further divided in tracking categories and source categories. Tracking categories are of little use to 'normal users', but are essential for tracking all content from a source. For example, the BaGLAMa tool uses this category to provide monthly page views.

A tracking category can also indicate that the reproduction is 'officially donated'. Reproductions (especially of public domain works) can be found widely on the internet and uploaded by any Commons users. In the case of a batch upload, the source is usually 'verified' (e.g. via an API or a datadump). It therefore makes sense to have both a tracking category such as Media contributed by <institution> and a source category indicating the collection of that institution such as Collections of <institution>.

For files that 'belong together' it also makes sense to create an overarching source category, such as Decorative arts in the Louvre - Room 19. This category should be properly placed in the correct hierarchy (e.g. it's a child of Decorative arts in the Louvre).

Note that categories on Commons are far from standardised and can be unpredictable. In general, category names are in English, but there are many cases of inconsistent naming and errors.

Putting it into action[edit]

To summarize: every file you upload should have:

Your files can have:

  • As many topic categories as possible from the metadata. These may include, as appropriate: date (Category:1905 in France), location, place of creation, artist, type of object, style, material, technique, subject, etc. Ideally use as precise a category as possible within each tree (check what sub-categories there are), and if a large number of files - say over 20, but sometimes fewer - are going to be added to a category, it is often best to either spread them among subcategories, or create a new sub-category for them (as was done with subcategories of Category:Photographs by Paolo Monti).
  • {{subst:chc}} to add {{Check categories}} automatically (HotCat and others will remove it automatically when categories are altered manually). And/or a «To check» category for the post-upload maintenance (see Category:To be checked).

Do a test upload[edit]

Upload a few images, and ask for feedback on the Commons:Batch uploading subpage. Reviewers will point out or come up with some crazy and unheard-of templates that you will want to integrate. You will certainly have to go through several iterations before you have the green light. Please be patient: it is better to get it right the first time than to fix uploads afterwards.

If you want to test uploading and safely experiment with using tools or new templates in a safe environment, you can set up an account on the beta cluster. This is a mirror of Wikimedia Commons where if things go wrong you will not cause any disruption to the live environment. See http://commons.wikimedia.beta.wmflabs.org and this explanation.

Create new user for your upload bot[edit]

If you do not already have one you will need to request bot account at Commons:Bots/Requests.

Do the real upload[edit]

Although upload bots can be written in several different languages and using different existing frameworks, most bots so far were based on the Python Pywikibot Framework. You can also reuse code shared by other batch uploaders.

In the case your files are too big, or for very large batches, you can request a Server-side upload.

Documentation[edit]

Set up a Commons page to describe the project :

See also[edit]