Commons:Batch uploading/World Digital Library

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Images from World Digital Library[edit]

New site with PD-images - http://www.wdl.org. Contain 1170 items --Butko (talk) 06:52, 22 April 2009 (UTC)[reply]

User:Sj shown interest in working on this upload. Looks like a very nice collection. Some points:
  • The items have an id (http://www.wdl.org/en/item/100/), so easy to loop over
  • The description of the items is available in a lot of languages, we should use that
  • Lot's of metadata is available, this should make categorization easier
  • One item can contain multiple files. We should be aware of that
  • Files are available in the tiff file format. We should either have tiff thumbnails or upload tiff and a jpg version (transcoding!)
  • Experience and code gained with the usgov uploads should be (re)used
  • Some items have curator video's, might be fun to upload too
Multichill (talk) 14:13, 8 November 2009 (UTC)[reply]
Aside: There's a lot of interest in using data from how these images are used in encyclopedia articles, and how traffic is driven to the original archives, to inspire more libraries to take part in WDL. +sj + 14:14, 8 November 2009 (UTC)[reply]

Any progress? -- RE rillke questions? 18:29, 4 June 2012 (UTC)[reply]

Thanks for the reminder. They've done a batch of updates recently; I'll see if I can get a dump next week before finding a suitable scraper. --SJ+ 06:52, 21 June 2012 (UTC)[reply]
Hi there. I'm the Wikipedian in Residence at the World Digital Library. Some content is already found here: Category:Images from the World Digital Library. Whatever you decide to do is what you decide to do, but, WDL asked that facilitating mass uploads and encouraging extensive uploading not be a part of my scope this year. This is at the request of the majority of their partners. But, I can't control what others do, of course, if something is in the public domain, and I do upload occasional images. Do note: not all content on WDL is public domain - there is content from post-1923 (per US law) on the site, and much of that content was not created by federal/government entities. So make sure you check each page accordingly. (This includes content from the Florida State Library, for example.) Sarah (talk) 04:06, 13 July 2013 (UTC)[reply]

2014 Uploads[edit]

@Sj, SarahStierch, and Rillke: I am revisiting these based on an (independent) email request earlier this week from 維基小霸王. Progress below. It may take 5 years but we get there eventually. :-) -- (talk) 11:16, 28 February 2014 (UTC)[reply]

Thank you, Fæ.--維基小霸王 (talk) 10:40, 3 April 2014 (UTC)[reply]
Technical, comments[edit]
Format
File:<title[<=200 characters]> WDL<id>.{png, pdf, jpg}

Where <id> is the WDL database number and title may be trimmed to below 200 characters by breaking off sentences, some titles are long due to additional sentences adding details. The title is taken from the English version of the page, but may include various non-English characters such as the under-dot and over-bar in "Ṭahmāsp", consequently the Commons file name relies on utf-8 rather than being limited to ascii.

First (and possibly only) run limited to photographs which are single items unless in pdf format. Other formats such as mp3 files exist on the site and would require pre-processing, whether this is worth the time to batch automate will depend on volumes and interest.

Samples
Copyright issues
  1. File:Bombed Copy of “Defensor pacis” WDL11254.png WDL was impossible to filter automatically. There is no date for the photograph, nor any detail about the photographer, only details of the object. The same photograph (in a different tint and size) appears at kb.dk (The Royal Library of Denmark) and may be CC-BY-NC-ND under the website terms, however the absence of a copyright status or back-link when released through WDL may indicate this general term was not intended to apply to this individual photo.
completeness issues
  1. Books like File:On Aristotle’s “On the Heavens” WDL7106.jpg are incomplete: only the first page are uploaded.--維基小霸王 (talk) 04:59, 3 March 2014 (UTC)[reply]
    I think the easiest solution is to identify these as a backlog list and then create them by hand using the existing text with the original then being deleted as an inferior duplicate. This example turns into:
    On Aristotle’s “On the Heavens” WDL7106 V1.pdf & On Aristotle’s “On the Heavens” WDL7106 V2.pdf
    Others fixed
    Please add any more to be processed to Images uploaded by Fæ (check needed) as they will then appear in this live list.
rotation issues

Book scans like File:Treatise on Geometry WDL7107.pdf requires rotation.--維基小霸王 (talk) 09:42, 3 March 2014 (UTC)[reply]

I have raised this on Commons:Bots/Work_requests#Rotating_books as someone may have created a tool to do this already. -- (talk) 09:52, 3 March 2014 (UTC)[reply]
resolution issues

Resolution of book scans like File:Theater of Instruments and Machines WDL4305.pdf are too low to be able to identify the text. Resolution can be improved through reconstructing the PDFs from pictures, although it still not very clear.--維基小霸王 (talk) 01:40, 4 March 2014 (UTC)[reply]

I agree that a resolution of 675px is disappointingly small but is just about usable. I am not sure it would be worth the effort of restitching a new book unless the scanned pages were significantly improved and the available WDL png versions in this case are only 836px wide. These may be a case of attempting to find better scans elsewhere and re-uploading with new files rather than bothering with the existing WDL copy.
Checking all the books in Books from the World Digital Library, 93%+ have a width of over 800px, in fact a large proportion are over 2,000px. List for analysis below.
If any are definitely unusable, I encourage raising a deletion request under COM:SCOPE and we can discuss if there are possible replacements or if repairs are worth the effort. -- (talk) 09:48, 4 March 2014 (UTC)[reply]
I ran a little experiment with File:Theater of Instruments and Machines WDL4305.pdf by bot-downloading all the png versions of the pages, converting to very high quality jpegs and compiling them into a new pdf. The result was changing from an 8MB file to a 252MB file, so a bit more compression would probably be okay, though a reader can now see the pages at 2,000px across rather than 675px. This is all expensive in volunteer time and a bit too complex to automate, so I would only expect to do this work for a handful of desirable books. -- (talk) 20:12, 9 March 2014 (UTC)[reply]
Progress[edit]
Assigned to, task Progress Category
Fæ, mapping to {{Artwork}} using BeautifulSoup in Python, including capturing all available languages ('en', 'fr', 'ru', 'ar', 'es', 'pt', 'zh'). I was particularly pleased to get both Chinese and Arabic working on the image pages. Status:    Done -
Fæ, intelligent categorization filter based on WDL location and keywords. (Considered adding a grandparent/parent/child check on the category list, but this could be a later Faebot project to apply to many batch uploads and need not be a dependency here as the categories appear "reasonable".) Status:    Done -
Fæ, intelligent licensing to skip post-1923 works.

May miss some files due to WDL layout inconsistencies. Filter is based on date and creator fields that match re.search("192[3-9]|19[3-9]\d|20[0-1]\d", creator['en']) or re.search("192[3-9]|19[3-9]\d|20[0-1]\d", date). This will cause some Public Domain images to be skipped (such as the Orville Wright photo of 1903 below), but the law of vanishing returns on programmer time applies...

Examples of filtered images

Source on WDL

File:Pay Off of Spec—the Good Old Times WDL11199.png 
**********************************************************************
Date/Creator seems to have a date problem
 Creator: White, Miles, 1914-2000 
 Date: 1952 

Source on WDL

File:Snapshot from an Airplane- The Harbor of Copenhagen WDL11255.png 
**********************************************************************
Date/Creator seems to have a date problem
 Creator: Damgaard, Holger, 1870–1945 
 Date: 1913 

Source on WDL

File:Telegram from Orville Wright in Kitty Hawk, North Carolina, to His Father Announcing Four Successful Flights, 1903 December 17 WDL11372.png 
**********************************************************************
Date/Creator seems to have a date problem
 Creator: Wright, Orville, 1871-1948 
 Date: 1903-12-17 

Source on WDL

File:Kirka Sharif, the Shrine Where the Mantle of the Prophet is Preserved WDL11531.png 
**********************************************************************
Date/Creator seems to have a date problem
 Creator: Simpson, Benjamin, Sir, 1831-1923 
 Date: {{between|1878|1880}}

Licenses being used are not great, however with the ample metadata being included in the Artwork template, this might be easy to refine if there are suggested improvements.

Status:    Done -
Fæ, upload first batch photographs (~1,500 done)

During the first 400 or so I found a number of bugs/improvements (such as improved categorization). A couple of uploads may end up getting deleted due to uncertain copyright, however the date filter should now be adequate for the rest of the run.

Upload second batch in parallel books (max 877)

Status:    In progress Catscan report
{Images from the World Digital Library & Images uploaded by Fæ}
The WDL category started with 250 images already uploaded.

Second category specifically for books (pdf) Books from the World Digital Library

Fæ, decide what to do about uploads over 100MB, for example 10630 which is over 180MB in size.

These have to be handled manually as there is no readily available batch process, list below.

Log of WDL pdf/book files larger than 100MB
  1. The Qur'an in the Earliest Printed Version 210MB
  2. The First Folio of Shakespeare WDL11290.pdf 156MB
  3. Book of the Holy Gospel of Our Lord and God Jesus Christ WDL9917.pdf 103MB
  4. The Lincoln Bible WDL11358.pdf 367MB
    File:The Spiritual Couplets.pdf [1] Manuscript copy appears to have been created in 2006.
  5. File:An Examination of the Talents Required for the Sciences WDL10630.pdf 182MB
  6. El melopeo y maestro- Treatise on the Theory and Practice of Music WDL10633.pdf 358MB
  7. Account of the Composition of the Human Body
    http://www.wdl.org/en/item/10631/
    PDF 102.2 MB
  8. Kiev with Its Oldest School, the Academy
    http://www.wdl.org/en/item/10644/
    PDF 144.6 MB
  9. Complete Book on the Judgment of the Stars
    http://www.wdl.org/en/item/10671/
    PDF 124.3 MB
  10. Commentary on the Chapter Nine of the Book of Medicine Dedicated to Mansur
    http://www.wdl.org/en/item/10672/
    PDF 105.2 MB
  11. Compendium of Medical Texts by Mesue, with Additional Writings by Various Authors
    http://www.wdl.org/en/item/10674/
    PDF 111.5 MB
  12. The Greater Luminary
    http://www.wdl.org/en/item/10677/
    PDF 173.7 MB
  13. The Seven Books on the Therapeutic Method, Which Is the Art of Curing, by John of Damascus from the Decapolis, Major Medical Authority among the Arabs
    http://www.wdl.org/en/item/10678/
    PDF 110.4 MB
  14. Rosary and Service Dedicated to the Blessed Virgin Mary and Other Devotions Combined in Honor of the Most Holy Trinity and in Worship of the Most Venerable Queen of the Heavens
    http://www.wdl.org/en/item/11299/
    PDF 162.9 MB
  15. City of God
    http://www.wdl.org/en/item/11301/
    PDF 128.9 MB
  16. Book of Effects of Drugs
    http://www.wdl.org/en/item/11307/
    PDF 112.3 MB
  17. Arabia- The Cradle of Islam
    http://www.wdl.org/en/item/11757/
    PDF 102.2 MB
  18. Life in the Desert, or, Recollections of Travel in Asia and Africa
    http://www.wdl.org/en/item/11762/
    PDF 138.9 MB
  19. The Penetration of Arabia- A Record of the Development of Western Knowledge Concerning the Arabian Peninsula
    http://www.wdl.org/en/item/11764/
    PDF 101.8 MB
  20. Personal Narrative of a Year's Journey through Central and Eastern Arabia (1862-63)
    http://www.wdl.org/en/item/11765/
    PDF 125.2 MB
  21. From the Indus to the Tigris
    http://www.wdl.org/en/item/11793/
    PDF 126.9 MB
  22. Narrative of a Journey into Khorasan, in the Years 1821 and 1822
    http://www.wdl.org/en/item/11797/
    PDF 231.7 MB
  23. Strolls Around Tobol'sk in 1830
    http://www.wdl.org/en/item/196/
    PDF 110.6 MB
  24. Guide to the Great Siberian Railway
    http://www.wdl.org/en/item/198/
    PDF 351.6 MB
  25. The Amazon and Madeira Rivers- Sketches and Descriptions from the Note-Book of an Explorer
    http://www.wdl.org/en/item/211/
    PDF 104.9 MB
  26. Through the Brazilian Wilderness, by Theodore Roosevelt- With Illustrations from Photographs by Kermit Roosevelt and Other Members of the Expedition
    http://www.wdl.org/en/item/215/
    PDF 153.6 MB
  27. A Journal of Captain Cook's Last Voyage to the Pacific Ocean, and in Quest of a North-West Passage Between Asia & America, Performed in the Years 1776, 1777, 1778, and 1779
    http://www.wdl.org/en/item/23/
    PDF 183.8 MB
  28. White Isles of the South Sea- History of the Apostolic Vicariate of the Gilbert and Ellice Archipelagoes, by Father Fernand Hartzer
    http://www.wdl.org/en/item/2364/
    PDF 254.6 MB
  29. Funafuti; Or Three Months on a Coral Island- An Unscientific Account of a Scientific Expedition
    http://www.wdl.org/en/item/2365/
    PDF 121.6 MB
  30. Nepal and the Himalayan Countries
    http://www.wdl.org/en/item/2368/
    PDF 113.9 MB
  31. An Account of the Kingdom of Nepal
    http://www.wdl.org/en/item/2372/
    PDF 236.3 MB
  32. Journal of a Tour through Part of the Snowy Range of the Himala Mountains, and to the Sources of the Rivers Jumna and Ganges
    http://www.wdl.org/en/item/2373/
    PDF 473.2 MB
  33. The History of Genghizcan the Great, First Emperor of the Antient Moguls and Tartars
    http://www.wdl.org/en/item/2378/
    PDF 339.8 MB
  34. Germany and Its Colonies- Travels through the Empire and Its Overseas Possessions, with the Collaboration of Arthur Achleitner, Johannes Biernatzki, et al.
    http://www.wdl.org/en/item/2383/
    PDF 542.3 MB
  35. Bhotan and the Story of the Dooar War
    http://www.wdl.org/en/item/2387/
    PDF 389.9 MB
  36. A Narrative of the Mission Sent by the Governor-General of India to the Court of Ava in 1855, with Notices of the Country, Government, and People
    http://www.wdl.org/en/item/2391/
    PDF 363.1 MB
  37. Portuguese Possessions in Oceania
    http://www.wdl.org/en/item/2398/
    PDF 371.5 MB
  38. The History of Persia
    http://www.wdl.org/en/item/2399/
    PDF 334.3 MB
  39. Description of Egypt. First Edition. Antiquities, Descriptions, Volume One
    http://www.wdl.org/en/item/2410/
    PDF 288.8 MB
  40. Description of Egypt. First Edition. Antiquities, Descriptions, Volume Two
    http://www.wdl.org/en/item/2411/
    PDF 332.0 MB
  41. Description of Egypt. First Edition. Antiquities, Essays, Volume One
    http://www.wdl.org/en/item/2412/
    PDF 312.4 MB
  42. Description of Egypt. First Edition. Antiquities, Essays, Volume Two
    http://www.wdl.org/en/item/2413/
    PDF 119.6 MB
  43. Description of Egypt. First Edition. Modern State, Volume One
    http://www.wdl.org/en/item/2414/
    PDF 433.3 MB
  44. Description of Egypt. First Edition. Modern State, Volume Two
    http://www.wdl.org/en/item/2415/
    PDF 306.9 MB
  45. Description of Egypt. First Edition. Modern State, Volume Two (Additional)
    http://www.wdl.org/en/item/2416/
    PDF 301.7 MB
  46. Description of Egypt. First Edition. Natural History, Volume One
    http://www.wdl.org/en/item/2417/
    PDF 356.0 MB
  47. Description of Egypt. First Edition. Natural History, Volume Two
    http://www.wdl.org/en/item/2418/
    PDF 324.3 MB
  48. Geographical Description and Governmental Administration and Settlement of the Spanish Colonies in the Gulf of Guinea
    http://www.wdl.org/en/item/2425/
    PDF 288.1 MB
  49. Narrative of an Expedition to Explore the River Zaire, Usually Called the Congo, in South Africa, in 1816
    http://www.wdl.org/en/item/2476/
    PDF 378.3 MB
  50. The History of the Caribby-Islands
    http://www.wdl.org/en/item/252/
    PDF 243.0 MB
  51. The African West and Catholic Missions, Congo and Oubangi
    http://www.wdl.org/en/item/2530/
    PDF 229.4 MB
  52. Explorations in Africa, By Dr. David Livingstone, and Others, Giving a Full Account of the Stanley-Livingstone Expedition of Search, under the Patronage of the New York Herald, as Furnished by Dr. Livingstone and Mr. Stanley
    http://www.wdl.org/en/item/2564/
    PDF 300.4 MB
  53. Mister Johann Anderson...Reports on Iceland, Greenland, and the Davis Strait for the Proper Use of the Sciences and Commerce
    http://www.wdl.org/en/item/2574/
    PDF 301.4 MB
  54. A Voyage Down the Amoor- With a Land Journey through Siberia, and Incidental Notices of Manchooria, Kamschatka, and Japan
    http://www.wdl.org/en/item/26/
    PDF 216.4 MB
  55. The Constitution of India
    http://www.wdl.org/en/item/2672/
    PDF 224.4 MB
  56. A New, Authentic, and Complete Collection of Voyages Round the World- Undertaken and Performed by Royal Authority, Containing a New, Authentic, Entertaining, Instructive, Full, and Complete Historical Account of Captain Cook's First, Second, Third, and Last Voyages, Undertaken by Order of His Present Majesty
    http://www.wdl.org/en/item/2674/
    PDF 808.9 MB
  57. The Special Features of French Antarctica, Otherwise Called America, and of Several Lands and Islands Discovered in Our Time
    http://www.wdl.org/en/item/271/
    PDF 178.2 MB
  58. Sunday Book
    http://www.wdl.org/en/item/273/
    PDF 286.7 MB
  59. Aesop's Fables
    http://www.wdl.org/en/item/28/
    PDF 179.0 MB
  60. A Voyage Round the World, Including an Embassy to Muscat and Siam in 1835, 1836, and 1837
    http://www.wdl.org/en/item/2884/
    PDF 471.8 MB
  61. Travels in South Africa in the Years 1849 to 1857
    http://www.wdl.org/en/item/2924/
    PDF 374.0 MB
  62. Laszlo Magyar's Travels in Southern Africa Between 1849 and 1857
    http://www.wdl.org/en/item/2925/
    PDF 495.2 MB
  63. Commentary of Hugo of Sienna on the First (Book) of the Canon of Avicenna Together with His Questions
    http://www.wdl.org/en/item/2999/
    PDF 195.7 MB
  64. Early Writings of Carl von Linne
    http://www.wdl.org/en/item/3038/
    PDF 161.2 MB
  65. Ibn Battuta's Rihla
    http://www.wdl.org/en/item/7470/
    PDF 100.2 MBLog of WDL pdf files over 99MB
  66. Foreign Relations of the United States, 1894. Appendix 2- Affairs in Hawaii
    http://www.wdl.org/en/item/7767/
    PDF 247.4 MB
  67. The Lango- A Nilotic Tribe of Uganda
    http://www.wdl.org/en/item/7769/
    PDF 124.2 MB
  68. Through the Dark Continent
    http://www.wdl.org/en/item/7770/
    PDF 149.4 MB
  69. A. M. Mackay- Pioneer Missionary of the Church Missionary Society to Uganda
    http://www.wdl.org/en/item/7774/
    PDF 114.1 MB
  70. Amadis of Gaul
    http://www.wdl.org/en/item/8980/
    PDF 623.7 MB
  71. An Account of a Selection of Plants of America
    http://www.wdl.org/en/item/8981/
    PDF 354.1 MB
  72. An Account of a Selection of Plants of America
    http://www.wdl.org/en/item/8982/
    PDF 178.8 MB
  73. The Pilgrimage of Alpha (Manuel Ancizar) in the Northern Provinces of New Granada, 1850-51
    http://www.wdl.org/en/item/8984/
    PDF 236.7 MB
  74. Saint Ignatius of Loyola, Founder of the Society of Jesus- Heroic Poem
    http://www.wdl.org/en/item/8985/
    PDF 351.6 MB
  75. The Gospel of St. Matthew
    http://www.wdl.org/en/item/94/
    PDF 120.1 MB
  76. The Comprehensive Book on Medicine
    http://www.wdl.org/en/item/9553/
    PDF 125.7 MB
  77. Earthquakes of India- Volume I
    http://www.wdl.org/en/item/9693/
    PDF 157.2 MB
  78. Description of Malta
    http://www.wdl.org/en/item/9911/
    PDF 112.6 MB
Status:    Backlog -
Fæ, re-work 'zoomified' objects such as File:From Tobol'sk to Obdorsk WDL181.jpg where there are multiple images. Currently only the first has been taken, however these appear to be a small proportion of the total. In this particular example the WDL provided no pdf version. I assembled one by downloading the 33 png files used by the "zoomify" tool and repackaged them into a pdf. See File:From Tobol'sk to Obdorsk WDL181.pdf.

195 files to be checked identified. Same issue as raised in discussion. Backlog created here.

Status:    1% done -
Fæ, many of the images have links to an original library source (in particular images provided by the US Library of Congress). It should be possible to go through these and check if an original very high resolution TIFF is available and can be uploaded as an alternative version. Based on the previous NARA batch upload, the Commons Community preferred to have both a high resolution jpeg/png which is easily used elsewhere, as well as a larger or sometimes extremely large tiff, with the categorization on the usable file with a pointer in the other_versions parameter to the on-wiki tiff.

Where a tiff is available and does not already appear on Commons, it would be ideal to upload it as an alternative in this way.

Note, in some cases the original link no longer works, this is the case for the Biblioteca Nacional Digital of Brazil and it may be impossible to find an original scan.

Example
    1. File:Chinese Military Officer, in Official Uniform and Summer Straw Hat. China, 1874-75 WDL1905.png, 1,024 × 1,455 px, description in 7 languages
    2. LOC catalogue (A): http://www.loc.gov/pictures/resource/cph.3b40155, tiff available at 1,081 × 1,536 px
    3. LOC detailed record (B) linked from A: http://www.loc.gov/pictures/item/92521091, in English only
    4. Given the LCCN deduced from B, we can pull the XML record - http://lccn.loc.gov/92521091/mods
    1. File:Declaration of Independence. In Congress, July 4, 1776, a Declaration by the Representatives of the United States of America, in General Congress Assembled. WDL109.jpg - see earlier issue of multiple pages, this is only 1 page out of 2 to be uploaded, width 1,024px, 7 languages
    2. LOC display (A): http://lcweb2.loc.gov/cgi-bin/ampage?collId=rbc3&fileName=rbc0001_2004pe76546page.db (4435 × 5465px)
    3. LOC catalog, linked from A: http://lccn.loc.gov/2003576546, English only
    4. XML record - http://lccn.loc.gov/2003576546/mods

I will ponder the path to take here. Where a source exists in higher resolution online, it will invariably be in a library collection which itself may be worth creating as its own batch upload project rather than piecemeal due to happen-stance of being published on the WDL.

Status:    On backlog -