Commons:Bots/Requests/Embedded Data Bot (alteration)
Embedded Data Bot (talk · contribs) (alteration)
Operator: Zhuyifei1999 (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information) and Steinsplitter (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)
Bot's tasks for which permission is being sought: Alteration of Commons:Bots/Requests/Embedded Data Bot (adminbot): Remove condition #3 (require hit on Special:AbuseFilter/166 before deletion)
Pro:
- A lot more abusive uploads will be deleted faster, reducing manual work. Compare deleted contribs and delete log (note that the bot is now coded to tag before delete in any case, due to the description of the embedded data may get very long because of the recursive detection I added a few hours ago)
- The AF can be hidden for anti-abuse purposes -- though, this may not be necessary if the bot no longer depend on it anyways
Con:
- Dunno... The deletion false positive rate will be raised by a negligible (?) amount. I've yet to see one false positive of "having embedded data" that also pass through both condition #1 (file ending determined by a parser) and #2 (the mime sniffing of the embedded part shows an archive)
- And I might get even lazier to block accounts, since blocking usernames does little anyways, with all the VPN being used to evade blocks :(
Automatic or manually assisted: Automatic unsupervised
Edit type (e.g. Continuous, daily, one time run): Continuous via RC
Maximum edit rate (e.g. edits per minute): 6 per min
Bot flag requested: (Y/N): N
Programming language(s): Python: Pywikibot
Zhuyifei1999 (talk) 14:26, 4 April 2017 (UTC)
Discussion
- Unlike other bots where minor alteration may be done without going through this full process, I thought this would be better because of its adminbot nature. It'll be implemented once it is approved. --Zhuyifei1999 (talk) 14:26, 4 April 2017 (UTC)
- Support Sounds great, and would make all of our jobs easier in the long run...TJH2018talk 15:16, 4 April 2017 (UTC)
- The way this works seems established, so there no issues I can see here. I would like there to be a better general understanding of why these files are on the increase and what drives this change. Someone mentioned that Wikipedia Zero is part of the increase, due to people wanting to fileshare on the cheap, but I don't know if this is backed up by hard facts. It would make a excellent tech blog post if someone wanted to invest time in writing up. Thanks --Fæ (talk) 10:50, 6 April 2017 (UTC)
- It looks periodic to me. Angola case (audio/video), then Bangladesh case (video), then Myanmar / Telenor case (video & embedded data) (though I can't see OTRS stuffs), then now somewhere-Arabic-speaking case (embedded data). I suspect that it's a cycle of: some place started to spread the words of free piracy here => mass abuse => a lot of work gets done against them => quiet down slowly. I don't know all the facts about this (nor am I a good writer) so better leave it to someone else --Zhuyifei1999 (talk) 11:46, 6 April 2017 (UTC)
- I think checkusers should be involved. Could bot keep history and detect any kinds of data patterns, like file names and descriptions alphabets/scripts? --EugeneZelenko (talk) 14:06, 6 April 2017 (UTC)
- Special:DeletedContributions/Embedded_Data_Bot can be used as the logs. Currently the languages in the filenames appear to include French, English, Arabic (both in their own script and latinized), and random characters. --Zhuyifei1999 (talk) 14:27, 6 April 2017 (UTC)
- But it's necessary to analyze deleted files. May be we should consider machine learning to find correlations? --EugeneZelenko (talk) 14:03, 7 April 2017 (UTC)
- Machine learning is unfortunately not within my ability :/ --Zhuyifei1999 (talk) 14:52, 7 April 2017 (UTC)
- I think will be good idea to describe this task in Wikitech-l or dedicated bot owners list. May be somebody else solved or considered to solve similar problems? --EugeneZelenko (talk) 15:02, 9 April 2017 (UTC)
- phab:T48921 phab:T12847 (WMF-ers commented as well in these tickets), it's not that easy. And I don't imagine many bot owners know how to do machine learning. --Zhuyifei1999 (talk) 15:16, 9 April 2017 (UTC)
- I think will be good idea to describe this task in Wikitech-l or dedicated bot owners list. May be somebody else solved or considered to solve similar problems? --EugeneZelenko (talk) 15:02, 9 April 2017 (UTC)
- Machine learning typically needs a large number of files (ABF 166 has only 1,549 hits). It also a statistical method meaning false positives. Time would be better spent improving the bot's parsers. —Dispenser (talk) 18:37, 10 April 2017 (UTC)
- My suggestion was to clarify material for checkusers. If parser could be good enough for this task, it'll be great. --EugeneZelenko (talk) 14:12, 14 April 2017 (UTC)
- Machine learning is unfortunately not within my ability :/ --Zhuyifei1999 (talk) 14:52, 7 April 2017 (UTC)
- But it's necessary to analyze deleted files. May be we should consider machine learning to find correlations? --EugeneZelenko (talk) 14:03, 7 April 2017 (UTC)
- Special:DeletedContributions/Embedded_Data_Bot can be used as the logs. Currently the languages in the filenames appear to include French, English, Arabic (both in their own script and latinized), and random characters. --Zhuyifei1999 (talk) 14:27, 6 April 2017 (UTC)
- I think checkusers should be involved. Could bot keep history and detect any kinds of data patterns, like file names and descriptions alphabets/scripts? --EugeneZelenko (talk) 14:06, 6 April 2017 (UTC)
- It looks periodic to me. Angola case (audio/video), then Bangladesh case (video), then Myanmar / Telenor case (video & embedded data) (though I can't see OTRS stuffs), then now somewhere-Arabic-speaking case (embedded data). I suspect that it's a cycle of: some place started to spread the words of free piracy here => mass abuse => a lot of work gets done against them => quiet down slowly. I don't know all the facts about this (nor am I a good writer) so better leave it to someone else --Zhuyifei1999 (talk) 11:46, 6 April 2017 (UTC)
- Support I only envisioned ABF 166 as a way to make manual review feasible. Integrated into a bot like this only prevents 1) deleting small files (pirated games), 2) uploads by sleeper accounts, or 3) other legitimized throwaway accounts. —Dispenser (talk) 21:42, 6 April 2017 (UTC)
No objections, minor issues will be addressed. Approved. --Krd 06:31, 24 April 2017 (UTC)