User:Slick/VRIN Crawler

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

This bash script act as VRIN crawler. It does not use the API. Not really usefull to use. It runs on given categories and extract the VRIN on images and write it to a list. It does not check if this is a valid VRIN (and sometimes it found crap). But works fine for me. Because it parse the HTML, it maybe have to change if the layout/templates on commons is changing. It does not crawl sub-categories, only the given.

It can usable if you like to

  • found already uploaded pictures by VRIN
  • search for Dups on Commons by VRIN

Depending on your internet connection it runs some days (with this given categories).

#!/bin/bash
categories="
	Category:Files_created_by_the_United_States_Armed_Forces_with_known_IDs
	Category:Files_created_by_the_United_States_Air_Force_with_known_IDs
	Category:Files_created_by_the_United_States_Army_with_known_IDs
	Category:Files_created_by_the_United_States_Coast_Guard_with_known_IDs
	Category:Files_created_by_the_United_States_Department_of_Defense_with_known_IDs
	Category:Files_created_by_the_United_States_Marine_Corps_with_known_IDs
	Category:Files_created_by_the_United_States_Navy_with_known_IDs
"

file="`tempfile`"
for category in ${categories} ; do 

	lynx --source "http://commons.wikimedia.org/wiki/${category}" > ${file}

	while [ "`cat ${file} 2> /dev/null`" != "" ] ; do

		cat ${file} 2> /dev/null | tr '\n\r\t' ' ' | sed 's/ [ ]*/ /g' | sed 's/<li /\n<li /g' | grep "^<li" | grep 'class="gallerybox"' | sed 's/href=/\nhref=/g' | grep "^href" | cut -d '"' -f 2 | grep "^/wiki/File:" | sort | uniq | while read image ; do
		
			id="`lynx --source \"http://commons.wikimedia.org${image}\" | tr '\n\r\t' ' ' | sed 's/ [ ]*/ /g' | sed 's/This Image was released by/\nThis Image was released by/g' | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | grep -m 1 '^This Image was released by' | sed 's/ID/\nID/g' | grep -m 1 '^ID ' | cut -d ' ' -f 2`"

			echo "> ${id} http://commons.wikimedia.org${image}"
			echo "> ${id} http://commons.wikimedia.org${image}" >> idCrawl.log

		done

		page="`cat ${file} 2> /dev/null | tr '\n\r\t' ' ' | sed 's/ [ ]*/ /g' | sed 's/<a /\n<a /g' | grep '^<a ' | grep '>next 200<' | sed 's/href=/\nhref=/g' | grep '^href' | cut -d '"' -f 2 | sed 's/\&/\&/g'`"
		if [ "${page}" != "" ] ; then
			lynx --source "http://commons.wikimedia.org${page}" > ${file}
		else
			rm ${file}
		fi

	done

done
test -e ${file} && rm ${file}

If you own or create a tool that makes the same job and/or is better, please send me a message.