User:AbdealiJK/file-metadata

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

This page holds information about the file-metadata application which is my GSoC 2016 project in Wikimedia/pywikibot. The crux of the project is to categorize images on wikimedia using a smart bot, mainly with exif data and computer vision techniques. It is a revamp of the older catimages script which was being run by User:DrTrigonBot.

We've run some of the analysis on a large number of files on wikimedia, and the analysis of those results can be found at User:AbdealiJKTravis/logs.

Installation

[edit]

Using pypi (recommended)

[edit]

Step 1: Install the build dependencies, using the distribution's package manager: Note: The dependencies mentioned below are for installing wheels (pre-built python packages). To ensure your pip can handle wheels, upgrade pip to one of the latest versions. It is recommended to use a pip version 7.x or higher.

For Ubuntu or Debian:

$ sudo apt-get install pkg-config exiftool libmagickwand-dev libzbar-dev default-jre libav-tools python-dev cmake libboost-python-dev python-virtualenv

For Fedora, CentOS or RHEL:

$ sudo yum install pkgconfig perl-Image-ExifTool ImageMagick-devel zbar-devel java-1.8.0-openjdk ffmpeg python-devel cmake boost-python boost-devel python-virtualenv

Step 2: Next, we create a virtual environment using virtualenv, in which the package can be installed, and activate the virtual environment:

$ virtualenv venv
$ source venv/bin/activate

The virtual environment adds a prefix with the name of the virtual environment to the command line prompt. This can be checked to find whether the command you're running is inside a virtual environment or not. It is recommended to update the pip in the virtual environment to avoid issues and to use python wheels which make the installation much faster:

(venv)$ pip install -U pip

Step 3: To install the latest stable version of file-metadata (v0.2.0 at the time of writing), you can install it with pip by doing:

(venv)$ pip install file-metadata

Using the system's package manager

[edit]

If you wish to install file-metadata using the distribution's package manager, no virtual environment is required. This would normally be faster because a lot of packages don't need to be compiled by pypi.

Step 1: To install the dependencies using the distribution's package manager: For Ubuntu or Debian:

$ sudo apt-get install openjdk-7-jre python-pip python-setuptools python-appdirs python-magic python-numpy python-scipy python-matplotlib python-wand python-skimage python-zbar cmake libboost-python-dev libjpeg-dev libz-dev

Step 2: There are still some packages not available in the distribution's manager. To install them, we use pip to install file-metadata and the dependencies:

$ sudo pip install file-metadata

Usage

[edit]

To use the package, you first need a file which can be any media file. Let us first download an example qrcode from commons wikimedia:

$ wget https://upload.wikimedia.org/wikipedia/commons/5/5b/Qrcode_wikipedia.jpg -O qrcode.jpg

And now, let us create a File object from this:

>>> from file_metadata.generic_file import GenericFile
>>> qr = GenericFile.create('qrcode.jpg') 

Notice that when creating the file, the class automatically finds the best type of class to analyze the file. In this case, it auto detecs that the file is an image file, and uses the ImageFile class:

>>> qr.__class__.__name__ 'ImageFile' 

Now, to find possible analysis routines supported for the file, help(qr) can be checked. All routines beginning withanalyze_ perform analysis. As the example we have is a qrcode, let us use analyze_barcode_zxing():

>>> qr.analyze_barcode_zxing()
{'zxing:Barcodes': [{'data': 'http://www.wikipedia.com<nowiki/>',
   'format': 'QR_CODE',
   'points': [(50.0, 316.0), (50.0, 52.0), (314.0, 52.0), (278.0, 280.0)],
   'raw_data': 'http://www.wikipedia.com'}]}

Which tells us the bounding box of the barcode (points) and also the data (http://www.wikipedia.com). It also mentions that the format of the barcode is QR_CODE.

Similarly, to check the mimetype, the analysis routing analyze_mimetype() can be used:

>>> qr.analyze_mimetype()
{'File:MIMEType': 'image/jpeg'}

In a pywikibot script

[edit]

If you plan on using a bot on commons, this can be done using pywikibot, There are some pywikibot scripts that are bundled with file-metadata. These are:

  • wikibot-filemeta-simple - A simple bot script using file-metadata meant as an example script which can be used as an example.
  • wikibot-filemeta-log - To create elaborate log pages like User:AbdealiJKTravis/logs/newimages/20160716 where all details can be seen.

To run these, you can simply use the wikibot-filemeta-log executable provided. For example, to run on every 5th image uploaded on 2016 Jul 17th, you can so:

$ wikibot-filemeta-log -logname:newimages/20160716 -limitsize:50 -logevents:upload,,20160717,20160716 -skip:5

More help about the scripts can be found with:

$ wikibot-filemeta-log -help
OR
$ wikibot-filemeta-simple -help

Contributing

[edit]

Any bugs and feature requests can be raised at the issues page https://github.com/AbdealiJK/file-metadata/issues. Every issue created helps us improve the library ! So don't hesitate to make one in case of questions or trouble when running the library.

As a developer

[edit]

Developers can find our source code at the github repository https://github.com/AbdealiJK/file-metadata. To contribute, create a fork of the repository and make a Pull request with your patch.

As a tester

[edit]

Testers who wish to make the library better by testing the application on different OSes and in different conditions can find our test suite in the tests folder in github. We use pytest to run our tests, and the required dependencies to run the test can be found in the test-requirements.txt file.

To run this, you can clone the github repository, cd into the github repository and do:

$ pip install .
$ pip install -r test-requirements.txt
$ python -m pytest

As a documenter

[edit]

To contribute to documentation, feel free to edit the user pages within User:AbdealiJK/file-metadata and also to contribute to the docstrings, README, etc. in the github repository. If you have suggestions on how to improve to documentation (and don't feel like writing the documentation yourself), feel free to open an issue on github to discuss, or leave a note in my Talk page - User talk:AbdealiJK.

FAQ

[edit]

Which python versions and OSes are supported ?

[edit]

Currently, only Linux with python 2.7 is being tested regularly and supported. Although, the library is appropriately written to support python 3.x and other distributions, this is not being tested regularly on Travis and bugs may creep in. Hence, we do not guarantee that it works on anything other than Linux+python2.7 at this time.

What can this library do ?

[edit]

The library aims to provide analysis on images, audio and documents which are uploaded to wikimedia commons. It can find the mimetype (analyze_mimetype()), fetch exif data (analyze_exiftool()), detect barcodes in images (analyze_barcode()), detect faces (analyze_faces()) and so on. You can see some statistics of the script being run on a large number of files at User:AbdealiJKTravis/logs.

How do I find more analysis methods to use ?

[edit]

The analysis applicable to a file depends on the file type itself. Hence, the simplest way to identify the analysis methods applicable to a file is to create a file and check the help text:

>>> from file_metadata.generic_file import GenericFile
>>> _file = GenericFile.create('./path/to/file')
>>> help(_file)

and finding all method beginning with "analyze" in the help text provided.

pkg_resources.DistributionNotFound: cython>=0.21 (or similar)

[edit]

While installing as described you may encounter this error. This can be solved in different ways:

  1. make a virtualenv (recommended; makes you not having to use sudo all the time)
  2. uninstall the cython system package
  3. force pip to install its package along with the cython system package:
    $ python -m pip install --ignore-installed cython
    (this is what DrTrigon did, see https://phabricator.wikimedia.org/T136985#2358607 - not sure yet whether this is a good solution...)

Do I need to update the system pip ?

[edit]

Using an older pip version can cause several weird and strange behavior with pip (e.g. not using wheels, having issues with the system cython in Ubuntu, etc.). We recommend updating pip, as that will ensure a much faster and easier installation procedure.

Updating pip in a virtualenv is much easier, and hence a better (and recommended) option. To update pip in a virtual environment, just do:

(venv)$ pip install -U pip

To update pip for the system in Python 2.7, use get-pip. For Ubuntu or Debian:

$ sudo apt-get update
$ sudo apt-get purge python-pip
$ sudo apt-get autoremove
$ wget https://bootstrap.pypa.io/get-pip.py
$ sudo python get-pip.py
[edit]