Commons:Batch uploading/Navy News Service

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Navy News Service[edit]

The Fema request got me started. The US Navy got about 75.000(!) images available at http://www.navy.mil/view_photos_top.asp just waiting to be copied to Commons. I wrote a bot based on the FEMA upload.

  • The bot loops over all the images.
  • From the META fields I get the url, long description and short description
  • A regex extracts the date from the long description
  • A regex extracts the author from the long description
  • A regex extracts the location from the long description
  • The title is constructed based on the url and the short description
  • Image is uploaded and ends up in one of these categories

This is just a general overview. The source is available here. Multichill (talk) 16:48, 16 October 2009 (UTC)[reply]

Opinions[edit]

  1. There is a template for the US Navy images {{ID-USMil}} you could use this or create one only for the US navy and add it in the source.--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)[reply]
    Looks nice. I'll probably use it for the next files. Multichill (talk) 17:30, 19 October 2009 (UTC)[reply]
  2. I'm not sure if the ID should be stated first. Like on File:000629-N-5686B-001 Sailor Returns Home.jpg I think US Navy should be before the numbering.--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)[reply]
    Sure, so it would be File:US Navy 000629-N-5686B-001 Sailor Returns Home.jpg in this case. Multichill (talk) 17:30, 19 October 2009 (UTC)[reply]
  3. Some images like File:020121-N-5563S-003 .50-Caliber Machine Gun.jpg don't have date and location. This is because the date isn't in brackets. It is however between ")" and "--" or ")" and "–". I also don't know why the location isn't grabbed...--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)[reply]
    Looks like I have to improve the regex to catch these cases. Both date and location use the same regex for maching. Multichill (talk) 17:30, 19 October 2009 (UTC)[reply]
  4. You don't need the ID in the description. Create or use a source template for the upload where the ID is stated and a link to the site is given.--Diaa abdelmoneim (talk) 17:49, 16 October 2009 (UTC)[reply]
    I do to prevent naming collisions. Multichill (talk) 17:30, 19 October 2009 (UTC)[reply]

Ok. Bot is changed to include the suggestions. Now it's running again. Multichill (talk) 19:50, 21 October 2009 (UTC)[reply]

  1. One small problem - it doesn't seem to like image descriptions with quotation marks in them, and so cuts off partway through - eg/ File:US Navy 071227-N-4014G-037 An MH-60S Seahawk assigned to the .jpg; File:US Navy 071227-N-6125G-184 ailors attached to the Nimitz-class aircraft carrier USS Harry S. Truman (CVN 75) enjoy a USO concert preformed by the band .jpg. Shimgray (talk) 21:18, 23 October 2009 (UTC)[reply]
    Ah, an escaping problem. This probably only happens to a couple of images. We can always move them to a better name if the current name is not clear. Multichill (talk) 21:58, 23 October 2009 (UTC)[reply]

I worked my way through some of the aircraft carrier categories. Interesting! Still a lot of additional categorization to do though.

  1. For one carrier, generally several temporary "aboard USS .." categories could be combined into one.
  2. The ship based temporary categories seem more helpful than the ones for stable locations, e.g. "Arabian Gulf".
  3. For the captions, maybe {{Original caption}} could have been used.
  4. Minor point: given the small size of the license tag, it could have been included directly into {{Information}}.
  5. It might be worth going through the descriptions by bot to wikify names of units, ships, etc., linking them to the corresponding articles at en.wp

-- User:Docu at 06:43, 25 October 2009 (UTC), edited 06:59, 25 October 2009 (UTC), 08:10, 25 October 2009 (UTC)[reply]

  1. You should probably move it to topic categories right away. Maybe you could use a bot.
  2. Stable locations only seem to be useful for photos on land.
  3. That could have been used, but I didn't.
  4. That could have been done.
Nice to see people working on this! Multichill (talk) 11:18, 25 October 2009 (UTC)[reply]
I mentioned 3 and 4 mainly for future uploads. BTW I made a bot request at Commons:Bots/Requests/vertrepbot. -- User:Docu at 16:19, 26 October 2009 (UTC)[reply]


Hey Multichill, thanks for uploading all those Navy pics, I'm sorting through them now, looking for possible FP candidates.

A few things I found will I was looking through them:

  • 1. Some images seem to have had something go wrong with their title during the upload; For example this one and this one. I'm assuming you meant to have 's around some words ('Sea Sparrow') but something's gone awry. You might want to fix it before you move on to the Army upload.

Hope this helps.

Sarcastic ShockwaveLover (talk) 22:09, 26 October 2009 (UTC)[reply]

Hi Sarcastic ShockwaveLover,
  • 1.: "^ldquo" in the title seems to come from "“ in the description.
  • 2.: Thanks for noticing. It should be fixed now. It's was correct when Multichill uploaded it. ;)
  • 3.: If you look at the file size of this file, you will notice that File:USS Port Royal (CG 73) aground.jpg, isn't a duplicate, but a scaled-down version. File:USS Port Royal (CG 73) aground.jpg should be tagged with {{Duplicate}} for deletion. The new file is an improvement over the old one. I found a few ones too and tagged the old ones for deletion.
-- User:Docu at 14:41, 27 October 2009 (UTC)[reply]
  • The maximum length of file names that are being used seems to be 231 chars. While sometimes in the distant future all filenames have to be that long to be unique, I wonder if we couldn't [have] kept them shorter in the meantime. -- User:Docu at 17:34, 28 October 2009 (UTC) (inserted "have" on 04:42, 30 October 2009 (UTC))[reply]
    • That would mean a mammoth renaming effort. It's already going to be huge just categorising them. That said, I think files like this one should be renamed. Also, perhaps we could put the categorisation/cleanup effort on the front page, much like that large German upload a few months back? It might help get some more people involved. I've categorised about 100 150 images so far using HotCat (thank God for that tool), but that's just a drop in the proverbial bucket. Sarcastic ShockwaveLover (talk) 11:58, 29 October 2009 (UTC)[reply]
  • No, I don't think we should move them. The advantages of the current file names are that they are generally descriptive titles and it's the title the Navy published it with.
  • The ^ldquo,/^rdquo,/^rsquo, could be fairly easy to fix (by an adminbot), there are approx. 3500 (Special:Search/^ldquo, OR ^rdquo, OR ^rsquo, prefix:File:US Navy 0). As we generally don't do cosmetics on file names, we could leave them that way though.
  • The categorization part should be easier once my bot has created additional categories (see here). I probably should get to work on that.
  • Besides these hardware based categories, there is still much to be done to create categories for specific events/operations etc. (e.g. Category:Vertical replenishment). It's fairly easy to build temporary categories from search results. One just needs to go through the category afterwards and remove a few false positives, most categories of FEMA officials were done that way. What generally threw it off were images of "A. on the phone with B." or "A. B. and C. (not pictured) attending Z.)", but they were easy to sort. If you want me to prepare you some temporary categories to review, I'd be glad to do so. -- User:Docu at 04:42, 30 October 2009 (UTC)[reply]
-- User:Docu at 18:43, 31 October 2009 (UTC)[reply]
  • Can I archive this?--Diaa abdelmoneim (talk) 08:30, 25 April 2010 (UTC)[reply]
    • The question is if the categorization has to be cleaned up before archiving or not.
      Both the Starr batch and the 1st Geograph upload still have quite a few things to clean up, but the initial upload is done and further files could be in a new request. The first of these two had been archived, the second one not.
      The Navy news one still has some 4000 location categories, some of which should be merged others removed (I merged approx. 100 of these into 30 one or two weeks ago). Avron is doing quite a lot of categorization on these, but personally I had lost interest sometimes last year. -- User:Docu at 08:40, 25 April 2010 (UTC)[reply]
      • Looks like most images get categorized by ship. I'm now adding temporary ship categories to help in this process. Multichill (talk) 08:53, 1 May 2010 (UTC)[reply]
        • Some of the location categories were already in the form "Images from US Navy, Location Aboard <ship name>", with tons of spelling variations. Many of these were merged into "Aboard <ship name>" categories. Would be great if you'd help with that too. -- User:Docu at 09:14, 1 May 2010 (UTC)[reply]
      • I spend some time on categorization. I first added a lot of temp ship categories and than moved images to real ship categories. I now changed the upload bot to first try to find a real ship category, fallback to a temp ship category or add a location category if no ship is found. I also nuked a lot of not so useful location categories (mainly seas). The aboard categories still have to be done. The same strategy could probably be applied. So the next step in big categorization is to either match a temp category with a real category (if it makes sense) or empty it out and nuke it (if the category doesn't make sense). Probably makes sense to start with the biggest temp categories, who wants to help? Multichill (talk) 11:36, 23 May 2010 (UTC)[reply]
Assigned to Progress Bot name
Multichill Finished the initial upload, now resyncing and categorization BotMultichillT