User:Fæ/Imagehash

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

If you want to jump into some source code, see a generic search routine on github.

For the generic proposal to make image hashes available on the WMF hosted wikicommons database, see Phab:T167947.

Introduction and design notes[edit]

The two images are digital illustrations rather than photographs. With 95% of the image perfectly identical, the difference hash gives them the same value. Relying on image hashes to make decisions about collections with large numbers of digital derivatives, would be inadvisable.

This project is a series of experiments using image hashing. Refer to User_talk:Fæ#Image_hashes_-_tests_on_PD_US_collections for original background.

The EXIF problem

The Commons API makes available SHA1 values for every file, unfortunately being a crypto-hash means that SHA1 values cannot be compared to see how close files appear to be to each other, so if two images are identical apart from a minor change to the invisible EXIF data, they will not be identified as similar. The original idea of hashing images was to make "non-EXIF" SHA1 values available, simply by creating a pure new visual image file, stripping EXIF data off images, and calculating the unique SHA1 value. This could then be used to check if any images are digitally identical apart from amendments to EXIF data. This was created in Python, by downloading the full image and holding it in memory using standard Python Image Library commands, then saving the image (in memory) as a default jpeg export which creates it without the EXIF as this is a recompression of the pure rendered image. This is a useful routine, but would miss any duplicates where there may be minor changes, such as adjustments to the light histogram or colour saturation, changes which are seen often in duplicated uploads.

The similarity problem

The Fleuron upload project has a specific need to match close, but not identical images. The open source ImageHash library provides different options for creating 64-bit 'fingerprints' for images, which can then be compared to give a similarity value. The perceptual hash worked very well in matching small black and white Fleuron images, which are scans of prints from different books, so vary in line thickness and may have various types of distortion from the original printing process and the book scanning process, where pages may be at various angles to the camera.

Choosing an image hash algorithm

The difference hash, relies on matching gradients in a grayscaled version of the image. The thumbnail used is tiny, just 9x8 pixels. Consequently when experimenting with the hash, there was no point in downloading the full size images, instead an 80px wide thumbnail was used to start the hash; early tests used 320px, 160px or 80px but this made no apparent difference to quality of the result. For this reason the hashing is not fully generic, as a different project might start with the full image, or another size, and could return a different hash value (albeit a close one). A consequence of this choice, is that the initial image should be larger than the chosen thumbnail size; though this could be worked around, there's probably almost no value in testing tiny sized images.

Based on later sample tests, the most effective image hash is the pHash, which has become the default on these projects and is being applied on 80px wide thumbnails. The pHash gives more consistent results across thumbnail sizes. No hash can reliably generate the same hash as the hash for the full size image, and may vary by different thumbnail sizes. The estimate is that the pHash is probably around 90% to 95% accurate in finding visually identical images, so may be missing up to 10% of potential duplicates. For this reason both pHash and dHash are used in later projects, with the dHash being a fall-back hash if pHashes vary while dHashes don't. It's estimated that this should give an accuracy of 98% or more for identifying visual duplicates.

Testing hash and thumbnail size[edit]

The sample tests below, show that the hash most likely to stay consistent across thumbnail sizes is the pHash (perception hash), with the others having varying lesser accuracy. The sample set was based on large sized paintings in the Wellcome collection.

Unfortunately no hash can reliably return a hash that will match the full size image hash for any thumbnail sizes, even if large thumbnails are selected, such as 1280px wide. The good news is that if the pHash is used, a choice of 160px or even 80px wide thumbnail seems highly reliable when compared to the results using 640px wide.

The tests indicate that pHash may be a better choice for Commons than dHash when examined for reliability across thumbnail sizes. A variation of up to 2 bits in a hash when compared across thumbnail sizes may be expected. This indicates that a 2 bit variation in general use of hashes may be normal in around 5% to 10% of cases depending on how the samples are chosen; so if a greater than 90% accuracy is wanted, more than one hash type may be useful to increase reliability.

Test sets for all 4 available image hash types over a range of thumbnail sizes

Colour key:

  • GREEN The hash for this size thumbnail is identical to the hash for the full size image.
  • RED The hash is different to the full size image hash.
  • BLUE The hash matches the 640px width thumbnail hash, but not the full size image.

The results that show the most green values will be the most desirable technique.

dHash
File Thumbsize Thumbsize Thumbsize Thumbsize Thumbsize
1 1280 80 160 320 640
2 1280 80 160 320 640
3 1280 80 160 320 640
4 1280 80 160 320 640
5 1280 80 160 320 640
6 1280 80 160 320 640
7 1280 80 160 320 640
8 1280 80 160 320 640
9 1280 80 160 320 640
10 1280 80 160 320 640
11 1280 80 160 320 640
12 1280 80 160 320 640
13 1280 80 160 320 640
14 1280 80 160 320 640
15 1280 80 160 320 640
16 1280 80 160 320 640
17 1280 80 160 320 640
18 1280 80 160 320 640
19 1280 80 160 320 640
20 1280 80 160 320 640
pHash
File Thumbsize Thumbsize Thumbsize Thumbsize Thumbsize
1 1280 80 160 320 640
2 1280 80 160 320 640
3 1280 80 160 320 640
4 1280 80 160 320 640
5 1280 80 160 320 640
6 1280 80 160 320 640
7 1280 80 160 320 640
8 1280 80 160 320 640
9 1280 80 160 320 640
10 1280 80 160 320 640
11 1280 80 160 320 640
12 1280 80 160 320 640
13 1280 80 160 320 640
14 1280 80 160 320 640
15 1280 80 160 320 640
16 1280 80 160 320 640
17 1280 80 160 320 640
18 1280 80 160 320 640
19 1280 80 160 320 640
20 1280 80 160 320 640
average_hash
File Thumbsize Thumbsize Thumbsize Thumbsize Thumbsize
1 1280 80 160 320 640
2 1280 80 160 320 640
3 1280 80 160 320 640
4 1280 80 160 320 640
5 1280 80 160 320 640
6 1280 80 160 320 640
7 1280 80 160 320 640
8 1280 80 160 320 640
9 1280 80 160 320 640
10 1280 80 160 320 640
11 1280 80 160 320 640
12 1280 80 160 320 640
13 1280 80 160 320 640
14 1280 80 160 320 640
15 1280 80 160 320 640
16 1280 80 160 320 640
17 1280 80 160 320 640
18 1280 80 160 320 640
19 1280 80 160 320 640
20 1280 80 160 320 640
wHash
File Thumbsize Thumbsize Thumbsize Thumbsize Thumbsize
1 1280 80 160 320 640
2 1280 80 160 320 640
3 1280 80 160 320 640
4 1280 80 160 320 640
5 1280 80 160 320 640
6 1280 80 160 320 640
7 1280 80 160 320 640
8 1280 80 160 320 640
9 1280 80 160 320 640
10 1280 80 160 320 640
11 1280 80 160 320 640
12 1280 80 160 320 640
13 1280 80 160 320 640
14 1280 80 160 320 640
15 1280 80 160 320 640
16 1280 80 160 320 640
17 1280 80 160 320 640
18 1280 80 160 320 640
19 1280 80 160 320 640
20 1280 80 160 320 640

Casebook[edit]

Finding duplicates for Department of Defense and other U.S. PD images[edit]

Category:Faebot identified duplicates

Using resolution[edit]

The idea for this housekeeping exercise was to check through the 12,900 DoD images in Category:Images from DoD uploaded by Fæ (check needed) to discover duplicates. Duplicates for DoD files is a well known problem as uploads from different U.S. military sources such as navy.mil and dvidshub.net often have different EXIF data. The EXIF comments are probably generated as part of the image management database before publishing on their websites. A secondary issue is that though DVIDS has become the 'official' agent for all DoD images, they are inconsistent in their use of the unique military identity, COM:VIRIN, often this being only partly displayed on the catalogue entries.

The workflow is:

  1. Take an image from the source category.
  2. Read its height and width.
  3. Run a general Commons search for filew:{} fileh:{} filemime:image/jpeg insource:"PD-USGov-", this returns up to 10,000 images with the right types of PD-USGov licenses and with identical resolutions.
  4. If the returns are greater than 9,990, skip the rest as testing a partial return will potentially miss candidate duplicates.
  5. Create dHash values for all images in the search set.
  6. Compare all dHash values against each other, creating a set of sets of identical values.
  7. For the sets of equal hash images, categorize into Category:Faebot identified duplicates.
  8. Start again, remembering the image sizes already tested.
‡: We may work around the 10,000 search limit, but to compare 10,000 images against each other may take as many as ~49m operations, and as the number of comparisons is in proportion to the square of the number of images, we would need to rethink the way the comparisons are done if attempting sample spaces of 100,000 or more.

The result of this workflow is that the DoD category acts as a "seed" set of images, and though most returned duplicates are DoD photographs, the sizes happen to coincide with other non-military PD-USGov collections, such as NARA and NOAA, and those duplicates are also categorized.

A disadvantage of this slicing by image resolution approach is that duplicates with non-identical resolutions will be missed. However for the original particular problem of changing EXIF values, it remains a reasonable solution for files where we have been unable to reliably identify duplicates using any existing Commons tools or methods. The results have been near 100% reliable, with no significant patterns of false identifications.

A possible improvement is to check the image page of an identified duplicate to see if it already cross-links to the duplicate photos. Where an image may have been digitally enhanced or corrected, there may be reasons to keep the original and the derivative. Evidence of preexisting links would be a good reason to skip the suspected duplicate.

Example: Triple duplicate of 2013 Moore tornado damage.

This triple duplicate was created because the final row of the image has suffered damage. A detailed examination of the originals shows that only the first image, from Flickr, has no damage. The two damaged versions were uploaded under different VIRINs on the military image database, making it virtually impossible to identify these duplicates using existing automated methods. All three have different EXIF data as the field "Image title" has been altered to match changing descriptions on the military database, consequently even if these files were undamaged, they would have different SHA1 values.

Example: Quadruple non-identical duplicate of a photograph of a seisure by USS Kidd military of suspected pirates.
This photograph was uploaded in four variations from four different sources, by four different mass uploaders, over a period of four years:

  1. File:US Navy 120105-N-ZZ999-002 A U.S. Navy SH-60S Sea Hawk helicopter provides support to a visit, board, search and seizure (VBSS) team in a 7-meter r.jpg 2012-01, BotMultichillT from http://www.navy.mil/view_image.asp?id=113077
  2. File:Flickr - Official U.S. Navy Imagery - Sailors from USS Kidd assist an Iranian-flagged fishing dhow..jpg 2012-11, Matanya from http://www.flickr.com/photos/usnavy/6647349195/
  3. File:A U.S. Navy SH-60S Sea Hawk helicopter provides support to a visit, board, search and seizure (VBSS) team in a rigid-hull inflatable boat assigned to guided missile destroyer USS Kidd (DDG 100) in the Arabian 120105-N-ZZ999-002.jpg 2013-04, Fae from defenseimagery.mil (deadlink)
  4. File:120105-N-ZZ999-002 (6668113027).jpg 2016-09, Jasonanaggie from https://www.flickr.com/photos/39955793@N07/6668113027/

Progress: Over 3 days of running the script, 4,000 duplicates (2,000 duplicate pairs) have been identified using this technique. Surprisingly, this is after just 216 sampled seed images. Though there will be duplicated sizes as the category is worked through, so numbers of 'hits' will gradually drop off, this still means that there is still 98% of the check category remaining.

Using part-VIRIN[edit]

Source code at Github

Though DVIDs uploads may have either no allocated VIRIN (example) or a partial VIRIN on the catalogue page, it is still reasonable to slice the possible returns by the start of the VIRIN identity. The first 6 digits are based on date (day) taken, and the next letter indicates the military branch. The advantages of this approach are:

  • that different sized uploads of the same image might be discovered
  • the number of returns for each search are normally from 1 to ~100, far less than 10,000, the built-in limit for Pywikibot searches of this type
  • the number of cross-comparisons decreases dramatically due to the limited numbers of returns
  • the searches are unique with no overlap, so when testing a large category we can be assured that each image is only examined once

The "seed" category taken for this duplicate finding task was Images from DoD uploaded by Fæ, which starts with over 300,000 images. This is such a large category that finding visual duplicates manually would be a completely unrealistic expectation.

Examples:

  • "090726-A-" filemime:image/jpeg insource:"PD-USGov-"search, gives 10 images of which two were found to be duplicates: One and Two. Both uploads were from dvidshub, but uploaded under different image identities and with slightly varying EXIF data.
  • "081024-N-" filemime:image/jpeg insource:"PD-USGov-"search, returned 52 matches with one duplicate pair: One and Two. One from navy.mil in 2009 and the other from dvidshub.net in 2015.

Using pHash and dHash[edit]

Experimenting with different sized thumbnails shows that hash calculations may sometimes vary by a couple of bits for the same image, leading to an expectation that a hash might only be able to detect ~90% of duplicates if a hash difference of zero is applied. Once a thumbnail is locally loaded into memory, the additional clock time to calculate both a pHash and dHash is miniscule compared to the time taken up by waiting for the thumbnail to be downloaded via an internet connection. Consequently the dHash was added as a back-up check to the pHash for DoD collections, giving a likely 98%+ success rate in discovering visual duplicates, if earlier estimates were accurate.

This was seen to be an effective improvement, with previously run part-VIRIN searches returning new matches based on zero-difference in dHashes where pHashes did not match.

Example: File:Defense.gov News Photo 010914-F-8006R-006.jpg (defense.gov, 2012) was identified as a duplicate relying on dHash values. The match was to previously identified images using pHash matching, so the other duplicates were skipped. This file is especially problematic as File:911-pentagon-3days.jpg (defenseimagery.mil, 2009) is a different resolution and saturation, and File:Pentagon on 9.11 - 2.jpg (navy.mil, 2008) is not matched as a duplicate, instead found by searching for VIRIN, possibly due to the method of lightening the photograph creating non-trivial changes to the image hash value.

Implementing non-EXIF SHA1 check[edit]

Category:Faebot identified identical duplicates

With prospective identical duplicates identified using Imagehash, these make perfect candidates to attempt a "non-EXIF" SHA1 calculation on the duplicates. Where the two files are digitally identical apart from EXIF data, then the standard {{duplicate}} process can be followed by choosing the file with the latest upload date, or other factors, as the duplicate for deletion.

Taking a real example to illustrate workflow:

  1. Identify File1 and File2 from the home category
  2. For both files dHash = 00181b19147afcf8
  3. Running a more process expensive non-EXIF calculation gives SHA1=677deaaaaff0a198a801e1ae782eb394a93c4ed4 for both files (note that the SHA1 values for the full original files are different), the workflow would skip if not identical
  4. Checking dates shows File1 has a timestamp in 2013, earlier than File2 in 2016, so add 1 to File1 score
  5. Check information templates shows File2 is using {{milim}}, so add 2 to File2 score for having a better layout
  6. Consequently the second file is put up for deletion with a duplicate template, which points to the lower scoring file. The duplicate is moved to Category:Faebot identified identical duplicates until it gets deleted, and the main duplicate category is removed from the file which should remain.

Other factors which will change the score is whether www.defenseimagery.mil is referenced (-2), as links to that site are redirected to a meaningless top level http://www.dimoc.mil page, and whether there is linkrot (-1 for each dead link).

The rate of discovery of pure identical images, i.e. only varying by EXIF data, was around 20% of the total number of images found by image hashing. It can be safely presumed that the majority of the non-identical duplicates which have identical hashes vary because of "enhancements" to light levels or saturation. A small minority have been matched which are specific digital enhancements of a separate primary version, such as by removing marks or target hairs.

NOAA illustrating duplicates with saturation differences[edit]

The two original images have identical dHash values but vary both by colour saturation and by EXIF data. These duplicates were found during the search for 'US PD' duplicates, which is constrained by only images of the same resolution being tested, at the time of analysis there were 12 duplicate pairs from NOAA identified this way.

The difference is likely to have been introduced during NOAA's release to Flickr, possibly there was an automatic enhancement on transfer. A very common reason for non-identical duplicates being created is uploads from Flickr rather than the original archives. Even if the EXIF data were stripped off these files, they would remain impossible to find by SHA1 values. These images were uploaded to Commons in 2009 and 2012 by different contributors.

Examining this result, encouraged trying a sample search for "insource:/51647007@N08|noaa.gov/ crab", to test if Flickr uploads have a consistent pattern of duplicates with changes. This gave 156 images in the search, of which 3 pairs are of interest as duplicates:

The first two pairs of duplicates, have slight variations in light levels between the Flickr source and the NOAA library version. The second of these pairs are different sizes, 1,590×1,230 versus 2,850×2,205 pixels, showing that using the image hash technique is great for automatically scanning for the same photograph uploaded in different resolutions.

The last pair are a puzzle, as it is unclear why the difference hash finds any difference between them, so a simple run finding duplicates would miss this pair. Examination in GIMP was unable to highlight differences. These are especially interesting as one is a Featured Picture on 3 different projects and has significant global use.

Closing down this particular rabbit-hole, a search was run for "insource:'NOAA' insource:/51647007@N08|noaa.gov/i filew:>899 filemime:image/jpeg". This returned 8,682 images for testing. The hashing started at 18:16 and ended at 23:29, that's 2.16 seconds average to create each image hash.[1] After 20 minutes of running the comparisons (~64m+ operations), there were 128 duplicate pairs discovered and added to Category:Faebot identified duplicates. Among these was the first noticeable non-duplicate, File:Betsy from TIROS VIII Spac0010-repair.jpg and File:Hurricane Betsy.jpg where one is a deliberately digitally cleaned up derivative.

In addition to the 128 duplicate pairs, there were many near matches where the dHash varied by 1 or 2, such as where enhanced versions have been created or where there are maps with slight varying details. The list is at NOAA near matches. The worst near matches appear to be photographs of sunsets where there are no clear lines. For more accurate attempts to find near matches, this may be why a combination of at least two types of hash may be beneficial.

Using difference hash to find non-identical 18th century printed versions of a graphic[edit]

The Fleuron project has a specific challenge of how to "cluster" many scanned variations of the same printed graphic. Doing this "manually" will be impossible due to the expected size of the project, being more than 200,000 scanned images. Though the scans are being categorized into year subcategories, the printer's fleurons might be found both used in different books and in different years over a 100 year period. By experimentation, we can find that if two matches have a difference in dHash values of less than 12, they are likely to be fairly similar, and with a difference of 8 or less they are likely to be variations of the same source image or scene.

As an illustration of a project wide search, a standard Commons search for "intitle:Fleuron fashion" will return 80 images within the Fleuron project, with matches in the title or text for the word "fashion". Setting a maximum dHash variation to 6, returned no duplicates, however nine sets of interesting matches were discovered across several books, sampling these:

Here 4 copies of the same fleuron are matched, two with differences of just 1 in the dHash. Automatic clustering may be sensible for very close matching, say less than 3, if not with categories then by adding galleries of similar fleurons to image pages.
These two figures are matched, though there are significant differences from distortion, either caused by page angle or printing variation. The two figures are from different prints of the book, so within-book searching would not have discovered these.
Here two fleurons have been matched, they happen to be in the same book, but it is worth highlighting how different they are in contrast/printed line thickness. The dHash has still matched them closely, with a variation of just 3, presumably because the overall shape is the same.

Glossary[edit]

Imagehash
A digital fingerprint for an image, which can be used to tell if two images are similar. In this case study an open source standard module provides hashes (source), but it is thought that image searching algorithms used by Google, Tineye and others work in a very similar way. The Python Image Hashing library provides four types of hashing algorithm, all return fingerprints for images which can be shown as a 16 digit hexadecimal. A key benefit of the imagehash is that two fingerprint values can be compared to see how similar two images appear, with a difference of zero being almost identical. See User:Fæ/Project_list/Fleuron#Imagehash for a project where it was useful to find fairly similar images.

The four types of imagehash available are:

  1. average hashing (aHash), where colour values are averaged.
  2. perception hashing (pHash), which uses discrete cosine transforms on a grey-scaled thumbnail so that comparisons are done on frequencies, a technique used to create jpeg compression.
  3. difference hashing (dHash), using differences in brightness to create a fingerprint based on gradients.
  4. wavelet hashing (wHash), based on frequencies, similar to perception hashing but applying discrete wavelet transformation, which may give more accurate compression results.
SHA1
A globally recognized type of unique digital fingerprint for any type of file, which can be displayed as a 40 digit hexadecimal. The Wikimedia Commons API makes these available so that any file can be checked to see if it already exists on Commons or if the same file has been deleted in the past, and the standard upload wizard does precisely this, giving warnings if detected. The SHA1 value can only be used to tell if two file are digitally identical, it gives no insight to whether files are almost identical. Refer to mw:API:Imageinfo and SHA1.