Commons:Batch uploading/Tropenmuseum

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Tropenmuseum[edit]

The Tropenmuseum donated about 2100 image related to Suriname and will donate a lot more images in the future (see Commons:Tropenmuseum). GerardM did the communication part, did Multichill the uploading/technical part.

Suriname[edit]

The first batch I got were 2100 images related to Suriname and the Marroon. I received a DVD containing the images and a Microsoft Access database containing the metadata. I created a user ODBC connection in windows and used pyodbc to make a connection from python. The code is a combination of custom code, pywikipedia and functions I copied from previous projects (Deutsche Fotothek & WLANL). The filenames were already in the right form and contained a unique identifier so I had my bot loop over the files and for each file:

  1. Extract the unique id
  2. Using the identifier pull all relevant info from the database
  3. Generate a description
  4. Generate temp categories
  5. Generate a Sha1 hash and check for duplicates
  6. If the file doesn't exist yet, upload the file using KITbot

Of course you can find the source in my svn.

The provided metadata was excellent. It contains descriptions in one (Dutch) or more (English) languages and was very useful for generating temp categories. All the images are placed in Category:Images from the Tropenmuseum and a bunch of temp categories. Images have to be copied from these temp categories to real categories. Turned out we don't have a lot of Suriname related images so I pretty much had to build a category tree from the ground up. This is a lot of work, but images end up in very good topic categories. It also improves the chance of images ending up in multiple relevant topic categories (previous batch uploads images got stuck at only one category). This is a lot of moving around, but I that's just a job for a bot. This mapping causes a lot of over-categorized images, but this can easily be fixed with the recategorization bot (imagerecat.py -cat:Images_from_the_Tropenmuseum -onlyfilter). For the next part we have to figure out how to get people to categorize the images because I don't feel like doing this all alone. Users only have to map temp cats onto topic categories, the actual moving is done by a bot. Not sure how to make this easy for other users. Multichill (talk) 11:41, 16 September 2009 (UTC)[reply]

Indonesia[edit]

Yesterday Gerard and I visited the Tropenmuseum. We got 35.000 images and a database with all the metadata. I slightly modified the program I used for Suriname and fired up the bot. Modifications:

  • Other database name and other table names
  • Changed the regular expression to find the id of the file
  • Removed some encoding bugs
  • Filtering the temporary categories to get rid of the completely useless categories right away
  • Added <!--{{id|1=To be translated}}--> so Indonesian translations can be added later.

The upload will probably be finished tomorrow. Than comes the hard part: Categorization. I added temporary categories again, but this time I got some data from the Tropenmuseum describing the structure of these categories so I can build a tree. I will first do this for the geography tree. Multichill (talk) 22:19, 26 November 2009 (UTC)[reply]

Categorization[edit]

Moved to Commons:Tropenmuseum#Categorization to avoid redundancy

Opinions first part[edit]

  • Making categorization easier: How about doing something like with the Fotothek upload? Like creating the temporary categories with a commons delinker link and a suggested category, waiting for a user to review it. And where are all these categories stored? I mean where can I find a list of all the temporary categories with how many files they contain so I could check for a better category name also for the delinker? Automatic Dutch to English translation would also make it a lot easier, instead of going to Google and translating...BTW, the upload is already finished right?--Diaa abdelmoneim (talk) 00:05, 18 September 2009 (UTC)[reply]
3th batch[edit]

A third batch is expected somewhere in February 2010, but this might be much later. Until then we have plenty of images to keep us all busy. Multichill (talk) 22:57, 20 December 2009 (UTC)[reply]

Objects due to arrive soon[edit]

Just got an email. The next batch is in the (snail) mail now. The next batch is 6000 photo's of objects in the collection of the Tropenmuseum. Probably going to upload these objects in the next couple of days. Multichill (talk) 15:58, 16 June 2010 (UTC)[reply]

We had some problems, but now I'm uploading new images. Multichill (talk) 19:39, 27 July 2010 (UTC)[reply]