User:Multichill/Categorization bot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Idea for a new categorization bot.

The current bot (imagerecat.py) depends on Commonsense. This is often slow or even worse: Broken. The new bot should be stand alone.

At the moment about 40.000 files are uncategorized, but used at some project: query and result. Would be nice if a bot could categorize these files.

The should work on a list of images (for example the the list of used uncategorized files). For each image:

  • See where the image is used.
  • For each article, see if it contains a commonscat link
  • If the article doesn't contain a Commonscat link, go work on the categories and interwiki's of the article
  • The process should be recursive with finding a link to Commons as the stop condition.
  • There should be some sort of scoring
    • The earlier, the higher the score
    • Multiple hits for the same category is a combined higher score
  • There should be some kind of filtering
    • Blacklist
    • Skip hidden categories
    • (etc etc)