User:AbdealiJK/file-metadata/GSoC2016

This page has a final report for the GSoC 2016 project: "Port catimges.py to pywikibot-core".

Student: User:AbdealiJK
Mentors: User:Jayvdb, User:DrTrigon
Original project description: phab:T66838
Project proposal submitted: phab:T129611
Weekly reports of the project: phab:T133762
Source Code: github:file-metadata, pypi:file-metadata

Aim

The aim of the project was to convert functionality provided by the catimages.py script in pywikibot-compat to pywikibot-core. The catimages.py project used various methods to identify the category an image falls in, including reading metadata like EXIF and Computer Vision to detect faces.

While doing this, some of the key aims were:

To make the code more stable, as the earlier catimages was considered more of a proof of concept
To use the latest frameworks and libraries when possible rather than the older ones used in the earlier script
To ensure that all dependencies were well supported with unittests and continuous integration and had an active maintainer
To clean up the code of catimages and add CI and unittests to ensure that other contributors can extend upon it easily
To make catimages user friendly with an easy installation procedure to get it up and running

Work done during project

As part of the project, we made a new github repository which contains all the code of the project. This github project is going to be the official location of the catimages script.

Contributions to catimages: v0.2.0 of the script was released by the end of the project (github release, pypi release). Statistics of contributions during the project can be seen at github:graphs:contributors
Documentation to run the code and also to people who wish to contribute can be found at https://commons.wikimedia.org/wiki/User:AbdealiJK/file-metadata
Contributions to other opensource projects: During the project, we also contributed and helped out in other projects which we were considering as dependencies (commits or discussions on issues):
- dlib - commits, PRs, issues
- scikit-image - commits, PRs, issues
- pillow - commits, PRs, issues
- matplotlib - commits, PRs, issues
- pytest - commits, PRs, issues
- Yaafe - PRs, issues

Final Results

The script was run on various days of a week, and the results of categories found and so on have been found and documented here. These statistics was got by running the log-bot binary provided by file-metadata which was run on ToolLabs.

The detailed logs can be seen at:

20 Jul 2016 - On uncategoried files, On uploaded files
21 Jul 2016 - On uncategoried files, On uploaded files
22 Jul 2016 - On uncategoried files, On uploaded files
23 Jul 2016 - On uncategoried files, On uploaded files
24 Jul 2016 - On uncategoried files, On uploaded files
25 Jul 2016 - On uncategoried files, On uploaded files
26 Jul 2016 - On uncategoried files, On uploaded files

Notes:

The script ran on every 5th file in the case of uploaded files, because the total number of files is too large (~10000 per day)
To understand the quality of the categorization, the images were put into 3 buckets:
- Type: Categories that describe what type of image it is and also other metadata about the image (For example JPEG file, Graphic, Taken with <camera model>)
- Content: Categories that describe what is being seen in the image (For example, Faces)
- Location: Categories related to the location the image was taken (using GPS)
The tables rendered below may not be optimal for small displays
The uncategorized images were analyzed on 19 Aug 2016. Some of the images may not be uncategorized now.

Buckets captured over days

All uploaded images

Uncategorized images

10

20

30

40

50

60

70

80

90

100

20 Jul

21 Jul

22 Jul

23 Jul

24 Jul

25 Jul

26 Jul

Type
Content
Location

10

20

30

40

50

60

70

80

90

100

20 Jul

21 Jul

22 Jul

23 Jul

24 Jul

25 Jul

26 Jul

Type
Content
Location

Bucket	20 Jul 2016	21 Jul 2016	22 Jul 2016	23 Jul 2016	24 Jul 2016	25 Jul 2016	26 Jul 2016
Type	100	100	100	100	100	100	100
Content	21.24	18.49	19.12	22.26	13.50	23.16	18.03
Location	9.75	6.97	10.27	8.41	11.02	11.38	11.34

Bucket	20 Jul 2016	21 Jul 2016	22 Jul 2016	23 Jul 2016	24 Jul 2016	25 Jul 2016	26 Jul 2016
Type	100	100	100	100	100	100	100
Content	18.76	30.01	26.71	11.57	38.62	20.94	31.85
Location	6.69	12.50	5.02	6.22	3.11	2.64	4.55

Files analyzed and distinct categories used per day

All uploaded images

Uncategorized images

500

1,000

1,500

2,000

2,500

3,000

20 Jul

21 Jul

22 Jul

23 Jul

24 Jul

25 Jul

26 Jul

Files analyzed
Distinct Categories used

250

500

750

1,000

1,250

1,500

20 Jul

21 Jul

22 Jul

23 Jul

24 Jul

25 Jul

26 Jul

Files analyzed
Distinct Categories used

	20 Jul 2016	21 Jul 2016	22 Jul 2016	23 Jul 2016	24 Jul 2016	25 Jul 2016	26 Jul 2016
Files	1558	1449	1537	2044	1496	1977	1525
Categories	352	306	349	360	383	401	358

Bucket	20 Jul 2016	21 Jul 2016	22 Jul 2016	23 Jul 2016	24 Jul 2016	25 Jul 2016	26 Jul 2016
Files	986	1016	1134	916	963	1098	901
Categories	251	208	233	216	195	185	245

Examples

Here are a few examples of the results obtained by the script:

Faces

File: File:Ole Amund Sveen scora 2-2-målet mot Bodø-Glimt 24. juli 2016.jpg Time taken: 27.48045 sec Categories (8): Category:Groups of people in Norway {{GPS EXIF}} Category:Norway Category:JPEG files Category:Human faces {{Created with Adobe Photoshop}} Category:Unidentified people in Norway Category:Taken with Pentax K-3 II

Football kits

File: File:Kit right arm cdequidad16t.png Time taken: 2.62712 sec Categories (3): Category:Graphics Category:Football kit sleeves Category:PNG files
File: File:Kit socks utrecht1516h.png Time taken: 2.60331 sec Categories (3): Category:Graphics Category:PNG files Category:Football kit socks

User:AbdealiJK/file-metadata/GSoC2016

Contents

Aim

Work done during project

Final Results

Examples

Faces

Football kits

Navigation menu

User:AbdealiJK/file-metadata/GSoC2016

Aim

Work done during project

Final Results

Examples

Faces

Football kits

Navigation menu

Search