User:Slick/VRIN Crawler
Jump to navigation
Jump to search
This bash script act as VRIN crawler. It does not use the API. Not really usefull to use. It runs on given categories and extract the VRIN on images and write it to a list. It does not check if this is a valid VRIN (and sometimes it found crap). But works fine for me. Because it parse the HTML, it maybe have to change if the layout/templates on commons is changing. It does not crawl sub-categories, only the given.
It can usable if you like to
- found already uploaded pictures by VRIN
- search for Dups on Commons by VRIN
Depending on your internet connection it runs some days (with this given categories).
#!/bin/bash categories=" Category:Files_created_by_the_United_States_Armed_Forces_with_known_IDs Category:Files_created_by_the_United_States_Air_Force_with_known_IDs Category:Files_created_by_the_United_States_Army_with_known_IDs Category:Files_created_by_the_United_States_Coast_Guard_with_known_IDs Category:Files_created_by_the_United_States_Department_of_Defense_with_known_IDs Category:Files_created_by_the_United_States_Marine_Corps_with_known_IDs Category:Files_created_by_the_United_States_Navy_with_known_IDs " file="`tempfile`" for category in ${categories} ; do lynx --source "http://commons.wikimedia.org/wiki/${category}" > ${file} while [ "`cat ${file} 2> /dev/null`" != "" ] ; do cat ${file} 2> /dev/null | tr '\n\r\t' ' ' | sed 's/ [ ]*/ /g' | sed 's/<li /\n<li /g' | grep "^<li" | grep 'class="gallerybox"' | sed 's/href=/\nhref=/g' | grep "^href" | cut -d '"' -f 2 | grep "^/wiki/File:" | sort | uniq | while read image ; do id="`lynx --source \"http://commons.wikimedia.org${image}\" | tr '\n\r\t' ' ' | sed 's/ [ ]*/ /g' | sed 's/This Image was released by/\nThis Image was released by/g' | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | grep -m 1 '^This Image was released by' | sed 's/ID/\nID/g' | grep -m 1 '^ID ' | cut -d ' ' -f 2`" echo "> ${id} http://commons.wikimedia.org${image}" echo "> ${id} http://commons.wikimedia.org${image}" >> idCrawl.log done page="`cat ${file} 2> /dev/null | tr '\n\r\t' ' ' | sed 's/ [ ]*/ /g' | sed 's/<a /\n<a /g' | grep '^<a ' | grep '>next 200<' | sed 's/href=/\nhref=/g' | grep '^href' | cut -d '"' -f 2 | sed 's/\&/\&/g'`" if [ "${page}" != "" ] ; then lynx --source "http://commons.wikimedia.org${page}" > ${file} else rm ${file} fi done done test -e ${file} && rm ${file}
If you own or create a tool that makes the same job and/or is better, please send me a message.