Commons:Bots/Requests/PeerBot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

PeerBot (talk · contribs)

Operator: Mitar (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: Bot is used to fetch related data when processing Wikidata dumps. This data together with Wikipedia dumps is used to populate the experimental search engine index Mitar is working on.

Automatic or manually assisted: Runs automatically

Edit type: Runs intermittently

Maximum edit rate (e.g. edits per minute): No edits

Bot flag requested: No (or not sure, is apihighlimits associated with the bot flag?)

Programming language(s): Python

Mitar (talk) 06:18, 12 May 2021 (UTC)[reply]

Discussion

I am requesting a bot permission because I would like to be able to do API queries with more than 500 IDs listed in the pageids parameter. From my understanding (after reading this page) that bots can have apihighlimits which is what I would like to ask for. I am still developing both the bot and my experimental search engine (all will be open source) but I could do it slightly faster if I could fetch related data from commons with a higher API limit. Mitar (talk) 06:18, 12 May 2021 (UTC)[reply]

How does project close to completion? If it's in starting stages, I think database dumps should be enough. Live update are necessary when it'll be close to production. --EugeneZelenko (talk) 14:22, 12 May 2021 (UTC)[reply]
I do not need live updates, but to fetch additional metadata. For example, a dump has an item which references a file on Wikimedia Commons by ID. I need to fetch the URL of the thumbnail for that file and store it, so that I can show it in search results. I want to do that for all Wikidata items and that is currently a bottleneck for me. Being able to increase this to 5000 entries at one call would allow me to not have this bottleneck and be able to faster update the index to the new dump. (Dump has just file ID, not also file revision, so I have to check if there is a newer file when I am processing a newer dump.) Mitar (talk) 17:48, 12 May 2021 (UTC)[reply]
@Mitar: Can we see the result of the generated searches somewhere? --Krd 08:28, 19 July 2021 (UTC)[reply]
@Mitar:  ? --Krd 07:20, 4 August 2021 (UTC)[reply]
Not yet. I put the project on hold waiting on a decision here because the implementation depends on the decision here. After I finish the implementation I am planing to make it public and available to everyone. Mitar (talk) 19:32, 12 August 2021 (UTC)[reply]
I sadly cannot follow. You said the code is not complete enough to show anything, but you are already blocked by the API limits? If we cannot see anything, and only have the short description above, how can we make a decision here? --Krd 07:40, 18 August 2021 (UTC)[reply]
Hm, so I started implementing this, as I have been doing the implementation I have hit the rate limit and I have now decide what is the design to do it going further. If I can get a bot flag, then I can do one design which is nicer and more real-time, otherwise I have to another design which uses database dumps. But further implementation depends on the decision here. I thought that this decision will be relatively easy, given that the bot will be just reading things and not writing, so not doing much extra work for the community. And then once I know which direction I can implement it, I can finish the implementation and show both the result and the code. Mitar (talk) 21:23, 31 August 2021 (UTC)[reply]
If a real lot of queries would be needed to implement this via API, I'd strongly suggest to do it via database on labs, or via dumps. Different opinions welcome. --Krd 15:08, 5 September 2021 (UTC)[reply]
Correct me if I am wrong, but database on labs would mean that I have to run the bot itself on labs, no? Dumps have a downside that they are not real-time and I could not update search as changes are happening, which is what I hope to be able to do. I do not think there will be necessary many queries, my main hope with the bot flag is to be able to list more `ids` in `pageids`. That would in fact allow me to do less queries, one, but a larger one. Mitar (talk) 16:19, 26 September 2021 (UTC)[reply]
I don't think the bot is required to run on labs. You acn perhaps tunnel DB connections through ssh or anything similar. (I don't know if that is allowed, please double check before doing so!) --Krd 06:53, 1 October 2021 (UTC)[reply]

Stale. --Krd 14:13, 9 November 2021 (UTC)[reply]

Reopened per request. --Krd 10:20, 12 February 2022 (UTC)[reply]
Thanks. Since I made this request, I have made some progress: I have open sourced the code (but it is not yet deployed anywhere) and I implemented it to use dumps. But dumps (nor SQL tables) do not provide all data I need, so I still have to use API calls. Being able to do more requests in parallel would help me speed up indexing. So I would like to make my request for apihighlimits flag for PeerBot user again. Mitar (talk) 11:45, 12 February 2022 (UTC)[reply]

Approved fpr apihighlimits. Please report when no longer needed. --Krd 05:47, 19 March 2022 (UTC)[reply]