Commons:Batch uploading/Navy News Service
[edit]
The Fema request got me started. The US Navy got about 75.000(!) images available at http://www.navy.mil/view_photos_top.asp just waiting to be copied to Commons. I wrote a bot based on the FEMA upload.
- The bot loops over all the images.
- From the META fields I get the url, long description and short description
- A regex extracts the date from the long description
- A regex extracts the author from the long description
- A regex extracts the location from the long description
- The title is constructed based on the url and the short description
- Image is uploaded and ends up in one of these categories
This is just a general overview. The source is available here. Multichill (talk) 16:48, 16 October 2009 (UTC)
Opinions[edit]
- There is a template for the US Navy images {{ID-USMil}} you could use this or create one only for the US navy and add it in the source.--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)
- Looks nice. I'll probably use it for the next files. Multichill (talk) 17:30, 19 October 2009 (UTC)
- I'm not sure if the ID should be stated first. Like on File:000629-N-5686B-001 Sailor Returns Home.jpg I think US Navy should be before the numbering.--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)
- Sure, so it would be File:US Navy 000629-N-5686B-001 Sailor Returns Home.jpg in this case. Multichill (talk) 17:30, 19 October 2009 (UTC)
- Some images like File:020121-N-5563S-003 .50-Caliber Machine Gun.jpg don't have date and location. This is because the date isn't in brackets. It is however between ")" and "--" or ")" and "–". I also don't know why the location isn't grabbed...--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)
- Looks like I have to improve the regex to catch these cases. Both date and location use the same regex for maching. Multichill (talk) 17:30, 19 October 2009 (UTC)
- You don't need the ID in the description. Create or use a source template for the upload where the ID is stated and a link to the site is given.--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)
- I do to prevent naming collisions. Multichill (talk) 17:30, 19 October 2009 (UTC)
Ok. Bot is changed to include the suggestions. Now it's running again. Multichill (talk) 19:50, 21 October 2009 (UTC)
- One small problem - it doesn't seem to like image descriptions with quotation marks in them, and so cuts off partway through - eg/ File:US Navy 071227-N-4014G-037 An MH-60S Seahawk assigned to the .jpg; File:US Navy 071227-N-6125G-184 ailors attached to the Nimitz-class aircraft carrier USS Harry S. Truman (CVN 75) enjoy a USO concert preformed by the band .jpg. Shimgray (talk) 21:18, 23 October 2009 (UTC)
- Ah, an escaping problem. This probably only happens to a couple of images. We can always move them to a better name if the current name is not clear. Multichill (talk) 21:58, 23 October 2009 (UTC)
I worked my way through some of the aircraft carrier categories. Interesting! Still a lot of additional categorization to do though.
- For one carrier, generally several temporary "aboard USS .." categories could be combined into one.
- The ship based temporary categories seem more helpful than the ones for stable locations, e.g. "Arabian Gulf".
- For the captions, maybe {{Original caption}} could have been used.
- Minor point: given the small size of the license tag, it could have been included directly into {{Information}}.
- It might be worth going through the descriptions by bot to wikify names of units, ships, etc., linking them to the corresponding articles at en.wp
-- User:Docu at 06:43, 25 October 2009 (UTC), edited 06:59, 25 October 2009 (UTC), 08:10, 25 October 2009 (UTC)
- You should probably move it to topic categories right away. Maybe you could use a bot.
- Stable locations only seem to be useful for photos on land.
- That could have been used, but I didn't.
- That could have been done.
- Nice to see people working on this! Multichill (talk) 11:18, 25 October 2009 (UTC)
- I mentioned 3 and 4 mainly for future uploads. BTW I made a bot request at Commons:Bots/Requests/vertrepbot. -- User:Docu at 16:19, 26 October 2009 (UTC)
Hey Multichill, thanks for uploading all those Navy pics, I'm sorting through them now, looking for possible FP candidates.
A few things I found will I was looking through them:
- 1. Some images seem to have had something go wrong with their title during the upload; For example this one and this one. I'm assuming you meant to have 's around some words ('Sea Sparrow') but something's gone awry. You might want to fix it before you move on to the Army upload.
- 2. There also appears to be at least one categorisation error. If you take a look at Category:General views of USS Kearsarge (CV-33) and Category:Aboard USS Kearsarge (CV-33), you'll see a number of pictures of another ship of the same name; Category:USS Kearsarge (LHD-3). They are 2 different ships (the original Kearsage was an Essex class carrier, scrapped in 1974), and while I'm happy to re-categorise them (I'll be helping with the whole batch) is there something you can do to prevent this sort of thing happening in future?
- 3. Though you stated at the Village Pump that you'd built in a duplicate checker, I've found quite a few duplicates as I've browsed the batch. For example, your file duplicates File:USS Port Royal (CG 73) aground.jpg and File:USS Port Royal grounded.jpg. Also, this one and File:AAV Embarking.jpg. There are a few others I've spotted as well, though I didn't note them down.
Hope this helps.
Sarcastic ShockwaveLover (talk) 22:09, 26 October 2009 (UTC)
- Hi Sarcastic ShockwaveLover,
- 1.: "^ldquo" in the title seems to come from "“ in the description.
- 2.: Thanks for noticing. It should be fixed now. It's was correct when Multichill uploaded it. ;)
- 3.: If you look at the file size of this file, you will notice that File:USS Port Royal (CG 73) aground.jpg, isn't a duplicate, but a scaled-down version. File:USS Port Royal (CG 73) aground.jpg should be tagged with {{Duplicate}} for deletion. The new file is an improvement over the old one. I found a few ones too and tagged the old ones for deletion.
- -- User:Docu at 14:41, 27 October 2009 (UTC)
- I'd rather you didn't delete this one, I rotated it and cropped it to correct the tilt, I'm planning on nominating it for FP status. Sarcastic ShockwaveLover (talk) 08:57, 28 October 2009 (UTC)
- 3. yes, I listed it under "other versions" instead. Looking closer at it, it doesn't appear to be an exact duplicate or scaled down version. The few images that silp through the bot's check are some where the file was edited (and not even scaled down), e.g. this one and File:AAV Embarking.jpg. -- User:Docu at 12:33, 28 October 2009 (UTC), edited 18:22, 28 October 2009 (UTC)
- I'd rather you didn't delete this one, I rotated it and cropped it to correct the tilt, I'm planning on nominating it for FP status. Sarcastic ShockwaveLover (talk) 08:57, 28 October 2009 (UTC)
- The maximum length of file names that are being used seems to be 231 chars. While sometimes in the distant future all filenames have to be that long to be unique, I wonder if we couldn't [have] kept them shorter in the meantime. -- User:Docu at 17:34, 28 October 2009 (UTC) (inserted "have" on 04:42, 30 October 2009 (UTC))
- That would mean a mammoth renaming effort. It's already going to be huge just categorising them. That said, I think files like this one should be renamed. Also, perhaps we could put the categorisation/cleanup effort on the front page, much like that large German upload a few months back? It might help get some more people involved. I've categorised about
100150 images so far using HotCat (thank God for that tool), but that's just a drop in the proverbial bucket. Sarcastic ShockwaveLover (talk) 11:58, 29 October 2009 (UTC)
- That would mean a mammoth renaming effort. It's already going to be huge just categorising them. That said, I think files like this one should be renamed. Also, perhaps we could put the categorisation/cleanup effort on the front page, much like that large German upload a few months back? It might help get some more people involved. I've categorised about
- No, I don't think we should move them. The advantages of the current file names are that they are generally descriptive titles and it's the title the Navy published it with.
- The ^ldquo,/^rdquo,/^rsquo, could be fairly easy to fix (by an adminbot), there are approx. 3500 (Special:Search/^ldquo, OR ^rdquo, OR ^rsquo, prefix:File:US Navy 0). As we generally don't do cosmetics on file names, we could leave them that way though.
- The categorization part should be easier once my bot has created additional categories (see here). I probably should get to work on that.
- Besides these hardware based categories, there is still much to be done to create categories for specific events/operations etc. (e.g. Category:Vertical replenishment). It's fairly easy to build temporary categories from search results. One just needs to go through the category afterwards and remove a few false positives, most categories of FEMA officials were done that way. What generally threw it off were images of "A. on the phone with B." or "A. B. and C. (not pictured) attending Z.)", but they were easy to sort. If you want me to prepare you some temporary categories to review, I'd be glad to do so. -- User:Docu at 04:42, 30 October 2009 (UTC)
- Please and thank you! Sarcastic ShockwaveLover (talk) 12:10, 30 October 2009 (UTC)
- I did the test run for the bot. BTW which searches would you want me to put into temporaries categories? e.g. I used something like this to extend that cat. -- User:Docu at 15:29, 30 October 2009 (UTC)
- Please and thank you! Sarcastic ShockwaveLover (talk) 12:10, 30 October 2009 (UTC)
- I found two incomplete uploads (the only so far):
- -- User:Docu at 18:43, 31 October 2009 (UTC)
- I might be a bug in preview/thumb, looks ok in full resolution. -- User:Docu at 18:46, 31 October 2009 (UTC)
- I can confirm that are really incomplete (also in full resolution). But on the source its the same. Only the preview on source is fine. I cant fix. Only crop will be a solution. --Slick (talk) 07:56, 15 September 2012 (UTC)
- BTW These days, Emijrpbot is fixing the date format on this batch (sample: [1]) -- User:Docu at 14:00, 9 January 2010 (UTC)
- That's very nice. Multichill (talk) 14:16, 9 January 2010 (UTC)
- Can I archive this?--Diaa abdelmoneim (talk) 08:30, 25 April 2010 (UTC)
- The question is if the categorization has to be cleaned up before archiving or not.
Both the Starr batch and the 1st Geograph upload still have quite a few things to clean up, but the initial upload is done and further files could be in a new request. The first of these two had been archived, the second one not.
The Navy news one still has some 4000 location categories, some of which should be merged others removed (I merged approx. 100 of these into 30 one or two weeks ago). Avron is doing quite a lot of categorization on these, but personally I had lost interest sometimes last year. -- User:Docu at 08:40, 25 April 2010 (UTC)- Looks like most images get categorized by ship. I'm now adding temporary ship categories to help in this process. Multichill (talk) 08:53, 1 May 2010 (UTC)
- Some of the location categories were already in the form "Images from US Navy, Location Aboard <ship name>", with tons of spelling variations. Many of these were merged into "Aboard <ship name>" categories. Would be great if you'd help with that too. -- User:Docu at 09:14, 1 May 2010 (UTC)
- I spend some time on categorization. I first added a lot of temp ship categories and than moved images to real ship categories. I now changed the upload bot to first try to find a real ship category, fallback to a temp ship category or add a location category if no ship is found. I also nuked a lot of not so useful location categories (mainly seas). The aboard categories still have to be done. The same strategy could probably be applied. So the next step in big categorization is to either match a temp category with a real category (if it makes sense) or empty it out and nuke it (if the category doesn't make sense). Probably makes sense to start with the biggest temp categories, who wants to help? Multichill (talk) 11:36, 23 May 2010 (UTC)
- Looks like most images get categorized by ship. I'm now adding temporary ship categories to help in this process. Multichill (talk) 08:53, 1 May 2010 (UTC)
- The question is if the categorization has to be cleaned up before archiving or not.
Assigned to | Progress | Bot name |
---|---|---|
Multichill | Finished the initial upload, now resyncing and categorization | BotMultichillT |