Commons:Batch uploading/Brooklyn Museum/HowTo
How do I process this batch job
[edit]First sorry for my bad english ;) This is a short documentation how I do this job. I hope this will help a bit if you start you own bot. You should be a linux or unix user to understand this. Currently I have to use Xubuntu (but I dislike it), so the following is done with Xubuntu.
I current work in this steps:
- analyse the website and find the best way to extract images and the metadata
- write some scripts with a lot of loops, sed and grep commands and then download all I need (images & metadata)
- (if needed parse the metadata again, check and format it) and create a script to upload all
- upload tests and upload (with another headless maschine)
First step. I test the api and find out that I can get all information about an object by its itemId. The simplest way to get all itemIds is to parse the search results. I wrote this simple bash script:
#!/bin/bash outfile=itemIds.txt # item count / 30 + 1 for i in {0..136} ; do index="`expr $i \* 30`" echo "Page #${i} ..." lynx --source "http://www.brooklynmuseum.org/opencollection/search/?type=object&start_index=${index}&q=africa*&prev_q=&x=25&y=14"|tr '[\n\r\t]' ' '|sed 's/<div /\n<div /g'|grep 'item-info' |grep -v 'item-info-no-image'|grep '/opencollection/objects/[0-9]*/'|cut -d '"' -f 2| cut -d '/' -f 4 >> ${outfile}.tmp done cat ${outfile}.tmp | sort | uniq > ${outfile} rm ${outfile}.tmp
After this I found only 1568 objects with images. Ok, I make a script to download the xml data to each object in a single file (I suggest using sub-folders for each object). You will need a api key you can get here. I little bit tuning the params to get the highest resolution and all other fine information.
#!/bin/bash apikey="<insert your api key>" cat itemIds.txt | while read item ; do # create folder in 'files' mkdir "files/${item}" echo "ItemId: ${item} ..." # get xml as-is lynx -source "http://www.brooklynmuseum.org/opencollection/api/?method=collection.getItem&version=1&api_key=${apikey}&item_type=object&item_id=${item}&image_results_limit=20&include_html_style_block=true&max_image_size=1536" > files/${item}/${item}.xml done
To analyse the available licences I wrote this script (extractRights.sh) and then pipe it to sort.
#!/bin/bash find ./files -type f -name '*.xml' | while read file ; do cat "${file}" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\" '{print $2}' done
bash extractRights.sh | sort | uniq -c
This is the result:
1 1.0 80 copyright_artist_or_artists_estate 1450 creative_commons_by_nc 37 no_known_copyright_restrictions
Now I know the keywords for the licences I can use (creative_commons_by_nc and no_known_copyright_restrictions) and wrote script to remove all files that are not with this license.
Hint: Currently there is a mistake by the museum. They marked images as CC-BY on the website but the same as CC-BY-NC on the api. We are sure they mean CC-BY in the api too, so 'creative_commons_by_nc' in the api means 'creative_commons_by'.
#!/bin/bash find ./files -type f -name '*.xml' | while read file ; do rightstype="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" if [ "$rightstype" != "creative_commons_by_nc" ] && [ "$rightstype" != "no_known_copyright_restrictions" ] ; then rm "$file" rmdir "`dirname $file`" fi done
I do the same with the attribute 'collection' to get only items that are in the Arts of Africa collection.
#!/bin/bash find ./files -type f -name '*.xml' | while read file ; do collection="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" if [ "$collection" != "Arts of Africa" ] ; then rm "$file" rmdir "`dirname $file`" fi done
Ok, lets count:
find ./files -type f -name '*.xml' | wc -l
Ok, this are 1392 objects now.
I wrote a bash script that extract all information from the xml files, put them in singles files each and download and rename the images. I know that bash is not the perfect script language for this job, but I like to play around with and its easy to develop. Do not wonder I put any piece of information on a single file (I love to work with single files) but this will make the upload script small and easy to develop.
#!/bin/bash find ./files -type f -name '*.xml' | while read file ; do echo "Process ${file} ..." xml="`cat \"${file}\" | tr '[\r\n]' ' '`" # extract all information with some grep and sed magic # please do not try to understand this while you are not a little bit crazy ;) id="`echo \"${xml}\" | sed 's/id=/\nid=/g'| grep "id=" | head -n 1|sed '/id/s/\(.*id=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" title="`echo \"${xml}\" | sed 's/title=/\ntitle=/g'| grep "title=" | head -n 1|sed '/title/s/\(.*title=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" uri="`echo \"${xml}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" accession_number="`echo \"${xml}\" | sed 's/accession_number=/\naccession_number=/g'| grep "accession_number=" | head -n 1|sed '/accession_number/s/\(.*accession_number=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" object_date="`echo \"${xml}\"| sed 's/object_date=/\nobject_date=/g'| grep "object_date=" | head -n 1|sed '/object_date/s/\(.*object_date=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" medium="`echo \"${xml}\" | sed 's/medium=/\nmedium=/g'| grep "medium=" | head -n 1|sed '/medium/s/\(.*medium=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`" dimensions="`echo \"${xml}\"| sed 's/dimensions=/\ndimensions=/g'| grep "dimensions=" | head -n 1|sed '/dimensions/s/\(.*dimensions=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`" credit_line="`echo \"${xml}\" | sed 's/credit_line=/\ncredit_line=/g'| grep "credit_line=" | head -n 1|sed '/credit_line/s/\(.*credit_line=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" classification="`echo \"${xml}\" | sed 's/classification=/\nclassification=/g'| grep "classification=" | head -n 1|sed '/classification/s/\(.*classification=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" description="`echo \"${xml}\" | sed 's/description=/\ndescription=/g'| grep "description=" | head -n 1|sed '/description/s/\(.*description=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`" location="`echo \"${xml}\" | sed 's/location=/\nlocation=/g'| grep "location=" | head -n 1|sed '/location/s/\(.*location=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" label="`echo \"${xml}\" | sed 's/label=/\nlabel=/g'| grep "label=" | head -n 1|sed '/label/s/\(.*label=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`" #collection="`echo \"${xml}\" | sed 's/collection=/\ncollection=/g'| grep "collection=" | head -n 1|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" #rightstype="`echo \"${xml}\" | sed 's/rightstype=/\nrightstype=/g'| grep "rightstype=" | head -n 1|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" markings="`echo \"${xml}\" | sed 's/markings=/\nmarkings=/g'| grep "markings=" | head -n 1|sed '/markings/s/\(.*markings=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" dynasty="`echo \"${xml}\" | sed 's/dynasty=/\ndynasty=/g'| grep "dynasty=" | head -n 1|sed '/dynasty/s/\(.*dynasty=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" signed="`echo \"${xml}\" | sed 's/signed=/\nsigned=/g'| grep "signed=" | head -n 1|sed '/signed/s/\(.*signed=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" period="`echo \"${xml}\" | sed 's/period=/\nperiod=/g'| grep "period=" | head -n 1|sed '/period/s/\(.*period=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" if [ "$id" != "" ] ; then echo -n "$id" > $file.id fi if [ "$title" != "" ] ; then echo -n "$title" > $file.title fi if [ "$uri" != "" ] ; then echo -n "$uri" > $file.uri fi if [ "$accession_number" != "" ] ; then echo -n "$accession_number" > $file.accession_number fi if [ "$object_date" != "" ] ; then echo -n "$object_date" > $file.object_date fi if [ "$medium" != "" ] ; then echo -n "$medium" > $file.medium fi if [ "$dimensions" != "" ] ; then echo -n "$dimensions" > $file.dimensions fi if [ "$credit_line" != "" ] ; then echo -n "$credit_line" > $file.credit_line fi if [ "$classification" != "" ] ; then echo -n "$classification" > $file.classification fi if [ "$description" != "" ] ; then echo -n "$description" > $file.description fi if [ "$label" != "" ] ; then echo -n "$label" > $file.label fi if [ "$location" != "" ] ; then echo -n "$location" > $file.location fi #if [ "$collection" != "" ] ; then # echo -n "$collection" > $file.collection #fi #if [ "$rightstype" != "" ] ; then # echo -n "$rightstype" > $file.rightstype #fi # others ################################################### if [ "$markings" != "" ] ; then echo "* Markings: $markings" >> "$file.other" fi if [ "$signed" != "" ] ; then echo "* Signed: $signed" >> "$file.other" fi if [ "$dynasty" != "" ] ; then echo "* Dynasty: $dynasty" >> "$file.other" fi if [ "$period" != "" ] ; then echo "* Period: $period" >> "$file.other" fi # artists (diffrent values) echo "${xml}" | sed 's/<artist /\n<artist /g' | grep '<artist ' | while read artist ; do artist_role="`echo \"${artist}\" | sed '/role/s/\(.*role=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" artist_name="`echo \"${artist}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" echo "* ${artist_role}: ${artist_name}" >> "$file.other" done # geolocations (diffrent values) echo "${xml}" | sed 's/<geolocation /\n<geolocation /g' | grep '<geolocation ' | while read geolocation ; do geolocation_name="`echo \"${geolocation}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" geolocation_type="`echo \"${geolocation}\" | sed '/location_type/s/\(.*location_type=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" echo "* ${geolocation_type}: ${geolocation_name}" >> $file.other done # images image_count=0 echo "${xml}" | sed 's/<image uri=/\n<image uri=/g' | grep '<image uri=' | sed 's/\/size[0-9]\//\/size4\//g' |while read image ; do image_link="`echo \"${image}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`" image_color="`echo \"${image}\" | sed 's/is_color=/\nis_color=/g'| grep "is_color=" | head -n 1|sed '/is_color/s/\(.*is_color=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|grep 'true'`" image_xray="`echo \"${image_link}\" | grep '_xrs_\|_xray_' &> /dev/null && echo \"true\"`" image_name="`basename \"${image_link}\"`" image_ext="`echo \"${image_name}\" | rev | cut -d '.' -f 1 | rev | tr '[A-Z]' '[a-z]'`" image_count=`expr ${image_count} + 1` if [ "$image_count" -gt "1" ] ; then upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`_(${image_count}).${image_ext}" else upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`.${image_ext}" fi echo "> Download ${image_name} ..." wget "${image_link}" -O "files/${id}/${upload_name}" &> "files/${id}/${upload_name}.log" || echo "ERROR!" >> "files/${id}/${upload_name}.log" echo "File:${upload_name}" >> "$file.gallery" if [ "${image_link}" != "" ] ; then echo -n "$image_link" > "files/${id}/${upload_name}.link" fi if [ "${image_name}" != "" ] ; then echo -n "$image_name" > "files/${id}/${upload_name}.name" fi if [ "${image_color}" != "" ] ; then echo -n "$image_color" > "files/${id}/${upload_name}.color" fi if [ "${image_xray}" != "" ] ; then echo -n "$image_xray" > "files/${id}/${upload_name}.xray" fi done done
This result in a file-listing like this for each xml-file.
$ ls -l insgesamt 564 -rw-rw-r-- 1 xxx xxx 2238 Okt 15 20:27 2910.xml -rw-rw-r-- 1 xxx xxx 6 Okt 20 13:05 2910.xml.accession_number -rw-rw-r-- 1 xxx xxx 9 Okt 20 13:05 2910.xml.classification -rw-rw-r-- 1 xxx xxx 14 Okt 20 13:05 2910.xml.collection -rw-rw-r-- 1 xxx xxx 56 Okt 20 13:05 2910.xml.credit_line -rw-rw-r-- 1 xxx xxx 547 Okt 20 13:05 2910.xml.description -rw-rw-r-- 1 xxx xxx 52 Okt 20 13:05 2910.xml.dimensions -rw-rw-r-- 1 xxx xxx 70 Okt 20 13:05 2910.xml.gallery -rw-rw-r-- 1 xxx xxx 4 Okt 20 13:05 2910.xml.id -rw-rw-r-- 1 xxx xxx 18 Okt 20 13:05 2910.xml.medium -rw-rw-r-- 1 xxx xxx 31 Okt 20 13:05 2910.xml.object_date -rw-rw-r-- 1 xxx xxx 87 Okt 20 13:05 2910.xml.other -rw-rw-r-- 1 xxx xxx 22 Okt 20 13:05 2910.xml.rightstype -rw-rw-r-- 1 xxx xxx 5 Okt 20 13:05 2910.xml.title -rw-rw-r-- 1 xxx xxx 63 Okt 20 13:05 2910.xml.uri -rw-rw-r-- 1 xxx xxx 161155 Mär 10 2012 Brooklyn_Museum_22.233_Stool_(2).jpg -rw-rw-r-- 1 xxx xxx 80 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.link -rw-rw-r-- 1 xxx xxx 927 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.log -rw-rw-r-- 1 xxx xxx 13 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.name -rw-rw-r-- 1 xxx xxx 162477 Mär 15 2012 Brooklyn_Museum_22.233_Stool.jpg -rw-rw-r-- 1 xxx xxx 4 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.color -rw-rw-r-- 1 xxx xxx 91 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.link -rw-rw-r-- 1 xxx xxx 932 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.log -rw-rw-r-- 1 xxx xxx 24 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.name
Hint: Dependent on the source there can unusable filenames with %XX characters or double dots or double underlines. You should found this before upload and rename all dependent files correct, otherwise the upload fails silent with the upload-script. (you can try to log it and pipe all output to a logfile and analyse this after all, i.E. python pywikipedia/upload.py ... &>> alluploads.log)
Now we can upload the files, using pywikipedia and this upload script:
find ./files/ -name '*.jpg' | while read file ; do if ! grep -m 1 "^${file}$" upload.log &> /dev/null ; then path="`dirname \"${file}\"`" number="`basename \"${path}\"`" filename="`basename \"${file}\"`" id="`cat \"${path}/${number}.xml.id\" 2> /dev/null`" uri="`cat \"${path}/${number}.xml.uri\" 2> /dev/null`" accession_number="`cat \"${path}/${number}.xml.accession_number\" 2> /dev/null`" medium="`cat \"${path}/${number}.xml.medium\" 2> /dev/null`" dimensions="`cat \"${path}/${number}.xml.dimensions\" 2> /dev/null`" credit_line="`cat \"${path}/${number}.xml.credit_line\" 2> /dev/null`" image_link="`cat \"${file}.link\" 2> /dev/null`" image_name="`cat \"${file}.name\" 2> /dev/null`" # prepare title if test -e "${path}/${number}.xml.title" ; then title="{{en|`cat \"${path}/${number}.xml.title\" 2> /dev/null`}}" else title="" fi # prepare date if test -e "${path}/${number}.xml.object_date" ; then if grep "^[0-9]*th century$" "${path}/${number}.xml.object_date" &> /dev/null ; then yy="`cat \"${path}/${number}.xml.object_date\"| sed 's/[a-zA-Z]//g' | sed 's/[ ]*//g'`" object_date="{{other_date|century|${yy}}}" else object_date="{{en|`cat \"${path}/${number}.xml.object_date\" 2> /dev/null`}}" fi else object_date="" fi # prepare description (the line break and the empty line in the environment variable are important) description="`cat \"${path}/${number}.xml.description\" 2> /dev/null`" label="`cat \"${path}/${number}.xml.label\" 2> /dev/null`" if [ "${description}" != "" ] && [ "${label}" != "" ] ; then description="{{en|${description}}} {{en|${label}}}" else if [ "${description}" == "" ] && [ "${label}" == "" ] ; then description="${title}" else description="{{en|${description}${label}}}" fi fi # prepare location location="`cat \"${path}/${number}.xml.location\" 2> /dev/null`" if test -e "${path}/${number}.xml.location" ; then location="{{Brooklyn Museum location|collection=africa}} ${location}" else location="{{Brooklyn Museum location|collection=africa}}" fi # prepare additional notes notes="" if test -e "${path}/${number}.xml.other" 2> /dev/null ; then notes="`cat \"${path}/${number}.xml.other\" | sed 's/ place / Place /g' 2> /dev/null`" else notes="" fi # add gallery if more than one image (the line breaks in the environment variables are important) image_count="`cat \"${path}/${number}.xml.gallery\" 2> /dev/null | wc -l`" if [ "${image_count}" -gt "1" ] ; then gallery="<gallery> `cat \"${path}/${number}.xml.gallery\" 2> /dev/null` </gallery>" else gallery="" fi # add categories for b&w or x-ray (the line breaks in the environment variables are important) add_categories="" if test -e "${file}.xray" ; then add_categories=" [[Category:X-rays of objects]]" else if ! test -e "${file}.color" ; then add_categories=" [[Category:Black and white photographs]]" fi fi # upload... echo "Uploading $filename => " starttime=$(date +"%s") yes N | python pywikipedia/upload.py -simulate -keep -filename:${filename} -noverify ${file} "{{Artwork | Artist = {{unknown}} | Title = ${title} | Year = ${object_date} | Description = ${description} | Technique = | Dimensions = ${dimensions} | Institution = {{Institution:Brooklyn Museum}} | Location = ${location} | Credit_line = ${credit_line} | Inscriptions = | Notes = ${notes} | Source = [http://www.brooklynmuseum.org/opencollection/objects/${id} Online Collection] of [[w:Brooklyn Museum|Brooklyn Museum]]; Photo: Brooklyn Museum, [${image_link} ${image_name}] | accession number = [http://www.brooklynmuseum.org/opencollection/objects/${id} ${accession_number}] | Permission = {{WikiAfrica/Brooklyn Museum}} | Other_versions = ${gallery} }} [[Category:African art in the Brooklyn Museum]] [[Category:Import by User:Slick-o-bot/Brooklyn Museum]]${add_categories}" && echo "${file}" >> upload.log # set throttle (means: $throttle uploads per minute) throttle=4 stoptime=$(date +"%s") uploadtime=$(($stoptime-$starttime)) sleep=`expr \( 60 - ${throttle} \* ${uploadtime} \) / \( ${throttle} - 1 \)` if [[ ${sleep} -lt 0 ]] ; then sleep=0 ; fi echo "-----------------------------------------------------------------" echo ">> upload time was ${uploadtime} seconds, sleeping ${sleep} seconds" echo "-----------------------------------------------------------------" sleep ${sleep} fi done