User:Multichill/Imagecopy
Wikimedia Commons wants to be the central place for free images and other media. That's a nice goal, but before we try to conquer the world, we'd better start with the Wikimedia Universe. The different Wikimedia wiki's still contain a lot of free files which should be moved to Commons (stats). Not all files can be moved, but at least a lot. Take for example en:Category:All free media, it contains 480.000 files. This is a huge job and can't be done by hand and automatically. Semi-automatic is the approach.
Requirements
[edit]- Accurate, errors gets stuff deleted, we don't want that
- To the point, we don't want to drown in redundant information
- Fast, a lot to move
- Easy to use, to have more users help out
New bot
[edit]To match these requirements I wrote a bot based on the lessons learned from previous tools. The bot is called imagecopy_self.py and is part of pywikipedia. The first version of the bot focuses on self-published works. It won't work on other files. This is over 300.000 files so it should keep us busy for a while. The bot will try to figure out the fields of {{Information}} based on the current information template and will fallback to free text, first uploader and first upload date. Category suggestions are from CommonSense and some filtering is applied. The bot is currently beta so still full of bugs (no, not really, just be careful ;-) ).
Approaches
[edit]300.000 files is a lot of work. For each subset you have two aproaches:
- Cherry picking: Just do the good and easy files. Any file which takes too much time or if something is wrong: Just skip it. According to the Pareto principle we should be able to move a lot of files in not too much time with this approach.
- Complete: When you're done the files are either moved to Commons, marked as non-free or deleted. This takes a lot of time and you might have to deal with some upset users.
Subsets
[edit]To be able to cope with the large number of files it's good to work on certain subsets:
- Date based categories (like the subcategories of en:Category:User-created public domain images)
- Files uploaded by a user
Issues
[edit]Open
[edit]- Date should maybe be i18ned
- Source should maybe be i18ned
- Use a regex to fuzzy extract a date when only upload date is available (example)
- {{Pd-self}} isn't caught correctly, see en:File:BHF-177 structure.png. Multichill (talk) 21:54, 8 August 2010 (UTC)
- Looks like the re.ignorecase is ignored. According to the manual in 2.7 this should function correctly. Multichill (talk) 21:58, 8 August 2010 (UTC)
- Found it and fixed in r8814. Multichill (talk) 19:32, 30 December 2010 (UTC)
- Looks like the re.ignorecase is ignored. According to the manual in 2.7 this should function correctly. Multichill (talk) 21:58, 8 August 2010 (UTC)
- Okay, this time I think it's a real error ;). Seems that the user interface won't load properly under Mac OS X 10.6.2 (using Python 2.6.1). See a video I uploaded here (you'll probably have to choose the HD version to able to see the text in the Terminal window) for what happens when I try to run the program. - EdoDodo talk 13:48, 9 August 2010 (UTC)
- Can you try if imagecopy.py works? That one uses tk too. Multichill (talk) 14:06, 9 August 2010 (UTC)
- imagecopy.py works as far as the interface is concerned, but it's the script that was giving me the "Segmentation fault" error. - EdoDodo talk 15:24, 9 August 2010 (UTC)
- For what it's worth, running imagecopy_enwp.py results in identical behavior to imagecopy_self.py. - EdoDodo talk 15:28, 9 August 2010 (UTC)
- Can you try if imagecopy.py works? That one uses tk too. Multichill (talk) 14:06, 9 August 2010 (UTC)
- "View in browser" opens latest revision (example http://en.wikipedia.org/w/index.php?title=File:Falko-Traber-Zuckerhut-Rio1.jpg&redirect=no&useskin=monobook&oldid=368959779) and not just the file (example http://en.wikipedia.org/wiki/File:Falko-Traber-Zuckerhut-Rio1.jpg). I suggest it just opens the file. --MGA73 (talk) 14:51, 14 August 2010 (UTC)
- I suggest that en:Template:duplicate file info and all the info in it is not transfered to Commons. Bot operator should check first if any of the info is relevant. --MGA73 (talk) 15:07, 14 August 2010 (UTC)
- The placeholder template is also a self but it needs a tweak to work. --MGA73 (talk) 20:42, 15 August 2010 (UTC)
- Perhaps add a option so we can ask the bot to skip images that is not in use. --MGA73 (talk) 18:46, 30 December 2010 (UTC)
- Template "move to commons" was not removed here: [1] --MGA73 (talk) 18:48, 30 December 2010 (UTC)
- Should be fixed in r8814. Multichill (talk) 19:31, 30 December 2010 (UTC)
Fixed
[edit]- Should pre fetch descriptions and put it in a queue
- Done. Multichill (talk) 16:46, 7 August 2010 (UTC)
- Should put files to upload in a queue
- Done. Multichill (talk) 16:46, 7 August 2010 (UTC)
- Always fetch default fields
- Now fetching it if the information based fields are empty. Multichill (talk) 16:46, 7 August 2010 (UTC)
- If date is left empty, use default field
- Now fetching it if the information based fields is empty. Multichill (talk) 16:46, 7 August 2010 (UTC)
- If source is left empty, use default field
- Now fetching it if the information based fields is empty. Multichill (talk) 16:46, 7 August 2010 (UTC)
- If author is left empty, use default field
- Now fetching it if the information based fields is empty. Multichill (talk) 14:44, 8 August 2010 (UTC)
- If filename already exists, don't loose all filed out fields
- Should be fixed now. Multichill (talk) 16:46, 7 August 2010 (UTC)
- Add a line in the code saying "This is still in test - please report any errors to xxx".
- Done. Multichill (talk) 16:46, 7 August 2010 (UTC)
- Enwp seems to use a location field in {{Information}}
- Is now added to description if found. Multichill (talk) 16:26, 8 August 2010 (UTC)
- Get fields from current information template with pywikipedia code (not regex)
- Done. Multichill (talk) 16:26, 8 August 2010 (UTC)
- Remove 1= in cases like this.
- Looks like ==Licensing:== was causing this and not the 1=. Made the regex more flexible. Multichill (talk) 16:26, 8 August 2010 (UTC)
- If author=<uploader>, use default field
- Done. Multichill (talk) 16:26, 8 August 2010 (UTC)
- Should not add images to Category:People by name
- Done Blacklisted the category. Multichill (talk) 07:17, 30 December 2010 (UTC)
- It seems that templates like "cite book" in the information template makes the bot create a lot of blank fields.
- I think I partially fixed this when I implemented the new parser code for {{Information}}. Needs to be tested. Multichill (talk) 16:26, 8 August 2010 (UTC)
- Yeah looks like it works File:TSC-Frequency-Signs-Childhood.png. --MGA73 (talk) 14:12, 14 August 2010 (UTC)
- I think I partially fixed this when I implemented the new parser code for {{Information}}. Needs to be tested. Multichill (talk) 16:26, 8 August 2010 (UTC)
Forget for now
[edit]- Do not "blank" fields if there is an {{ifd}} on the image (Example File:20100404 205709 BacteriaFromKitchenSponge.jpg).
- Don't really understand this one. Need another example to test. Multichill (talk) 16:26, 8 August 2010 (UTC)
- Insert info from outside of a {{Information}} in "Description" or below. (Example File:20100404 205709 BacteriaFromKitchenSponge.jpg)
- This is hard to implement. Might add a lot of crap to the description field. Multichill (talk) 16:26, 8 August 2010 (UTC)