Commons:Bots/Requests/FastilyBot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

FastilyBot (talk · contribs)

Operator: Fastily (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: I maintain several database reports at toolforge:fastilybot-reports. Unfortunately the WMF devs have decided to disable cross-wiki table joins (without a replacement), which severely impacts my ability to maintain these reports. As a result, several of my enwp bot tasks (specifically the tasks that assist admins with files transferred from enwp to Commons) will be unable to run. I'm requesting the bot flag so that I can make large queries via the API in a performance friendly way (e.g. reducing the overall number of HTTP round-trips by requesting 5000 entries at a time instead of 500). This bot task will never make any edits to Commons, so any single edit made by the bot will be a valid reason to block it.

Automatic or manually assisted: Automatic

Edit type (e.g. Continuous, daily, one time run): Weekly

Maximum edit rate (e.g. edits per minute): None

Bot flag requested: (Y/N): Y

Programming language(s): Python

Thanks, FASTILY 03:15, 2 April 2021 (UTC)[reply]

Discussion
  • It seems astonishing that "volunteer usability" seems to rate pretty low in the WMF Dev decisions to drop critical tables and make existing table indexes more and more convoluted in the asserted need to keep the architecture stable, in this case making the physical break up override the unified virtual environment. Would there be any viability in considering generating separate tables, perhaps on a separate volunteer-run database, for these missing parts and missing indexes? Even if they were radically laggy, say up to a week out of date, they would probably be almost as useful long term. Considering this because my own experience of building API reporting as an alternative to SQL can be a magnitude slower and massively more processing expensive. -- (talk) 08:15, 2 April 2021 (UTC)[reply]
    I agree that a solution on database level would be preferable, but also I don't see it happen. Besides that I'd would like to avoid having more read-only bots here and prefer to consider to handle these cases with the account creator flag, adding apihighlimits to it. Does that make sense? --Krd 08:35, 2 April 2021 (UTC)[reply]
    The account creator right does not contain apihighlimits, otherwise I would have already requested it. Please see Special:ListGroupRights -FASTILY 10:41, 2 April 2021 (UTC)[reply]
    That's why I said "adding apihighlimits to it". Should be arrangable IMHO. If it takes too long we can assign interim bot rights. --Krd 11:57, 2 April 2021 (UTC)[reply]
    Oh okay, that makes sense. Thanks for clarifying -FASTILY 21:07, 2 April 2021 (UTC)[reply]
    https://phabricator.wikimedia.org/T279226 --Krd 07:21, 3 April 2021 (UTC)[reply]

@Fastily: I'd like to make some clarifications here. Cross database joins were not disabled for the purpose of disabling them. Previously, the Toolforge database replicas ran on a single database server, which is the only reason cross-database joins were possible at all. With the new change, they split it by database sections to multiple database servers to improve performance and make it scale well, which has actually pretty positive impact (for instance, https://xtools.wmflabs.org/globalcontribs/Martin_Urbanec used to load minutes, now it's pretty fast).

I would also like to note it should still be possible to do "cross-wiki joins" on the application level. For instance, if you want images uploaded to commons by cswiki admins, you can first get the usernames of cswiki admins (SELECT user_name FROM user_groups JOIN user ON ug_user=user_id WHERE ug_group="sysop") and then run a second query SELECT actor_name, log_title FROM logging JOIN actor ON actor_id=log_actor WHERE log_type='upload' AND actor_name IN (...), replacing ... with the usernames you fetched in the previous step. Putting this all into a small bash script:

urbanecm@tools-sgebastion-07  ~/tmp
$ cat script.sh
#!/bin/bash

usernameList=$(echo 'SELECT user_name FROM user_groups JOIN user ON ug_user=user_id WHERE ug_group="sysop";' | sql cswiki | sed 1d | sed 's/^/"/g;s/$/"/g' | tr '\n' ',' | sed 's/,$//g')
echo "SELECT actor_name, log_title FROM logging JOIN actor ON actor_id=log_actor WHERE log_type='upload' AND actor_name IN ($usernameList) LIMIT 10" | sql commonswiki
urbanecm@tools-sgebastion-07  ~/tmp
$ time bash script.sh
actor_name      log_title
Mormegil        Coat_of_arms_of_the_Czech_Republic.png
Mormegil        Small_coat_of_arms_of_the_Czech_Republic.png
Mormegil        Flag_of_the_president_of_the_Czech_Republic.png
Mormegil        Seal_of_the_Czech_Republic.png
Mormegil        Small_coat_of_arms_of_the_Czech_Republic.png
Mormegil        Coat_of_arms_of_the_Czech_Republic.png
Mormegil        Knight_crop.JPG
Mormegil        George_W._Bush.jpeg
Mormegil        Blue_Screen_Phone.jpg
Mormegil        Nagasakibomb.jpg

real    0m0.694s
user    0m0.145s
sys     0m0.058s
urbanecm@tools-sgebastion-07  ~/tmp
$

I'm not sure which kind of crosswiki joins you usually need to perform, but hopefully this serves as a quick example on how to do a kind of a "crossdatabase join". And it actually performs better, when I changed the query to COUNT(*) and measured how long it ran, it took about 6 minutes. When I rewrote that query to a real xdatabase join and started it on the old replicas, it runs for 8 minutes and still didn't finish. --Martin Urbanec (talk) 12:27, 4 April 2021 (UTC)[reply]

Hi Martin Urbanec, thanks for sharing. My use case for cross wiki "joins" is specifically related to generating the following database reports:
  • enwp files with a duplicate on Commons
  • enwp file description pages shadowing a file description page on Commons
  • enwp files shadowing a Commons file where the enwp file is not a duplicate of the Commons file
  • orphaned file talk pages on enwp (where a file with the same title does not exist on Commons)
Needless to say, this is difficult to do client-side because the datasets are *so* large. I'm investigating workarounds, and this BRFA is one possible solution. -FASTILY 02:16, 6 April 2021 (UTC)[reply]
Actually, I tried to do the same tasks myself in ru.wiki and I haven’t found other ways than proposed in the request. I also do it through API-requests from my bot account but it is quite surprising (at least for me) that we don’t have separate special pages with this information: it would be easier to gather information (something launched, for example, once a day on a server side), it would be much easier to process by the community. @Martin Urbanec is there any blocker or should it be posted as a task to Phabricator? I am sure it would be helpful for most of big wiki-projects, not only en.wiki or ru.wiki with some imperfect tools for that. rubin16 (talk) 09:31, 8 April 2021 (UTC)[reply]
I see, thanks for the context. I'm obviously not objected to this, I just wanted to note other possible solutions I see. A DB-only way would be dumping the tables to your personal database (tools-db host), and process it there, but this is fine as well.
@Rubin16: The project you suggest sounds like a cool thing to have indeed. I don't see a technical blocker that would make it impossible. You could probably use the updateSpecialPages.php framework for that, similar to special:Mostlinkedtemplates and other pages.
Best, Martin Urbanec (talk) 15:17, 8 April 2021 (UTC)[reply]
phab:279688 rubin16 (talk) 16:20, 8 April 2021 (UTC)[reply]

Bot flag now set. Can be converted to another sufficient user right later, if required. --Krd 07:16, 9 April 2021 (UTC)[reply]

Thanks! -FASTILY 21:54, 9 April 2021 (UTC)[reply]