Commons talk:SPARQL query service/Upcoming General Availability release

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

SUL authentication?[edit]

Will the authentication be a new separate system or will it be a seamless SUL experience? Ainali (talk) 18:36, 16 December 2021 (UTC)[reply]

Mandatory authentication considered harmful[edit]

Regarding this part of the announcement:

The biggest change to user behavior will be the requirement for user authentication to use all endpoints.

To my recollection, this is the only instance of needing to be authenticated to experience the main corpus of Wikimedia content. So this is a major policy shift. There are a number of concerns:

Endangered species? Are publicly viewable knowledge graphs like this at risk with WCQS locked up behind an authentication system?
  • Restricted reading. We've come so far in finally spending the time to model and add millions of statements to Commons (huzzah!) and we finally have a usable query service for it (huzzah!) and then for the last mile, we're restricting access to it by instituting authentication? For a community that has "open by default" as an ethos, it feels like such an "own goal" misstep here.
  • Tools implications. I'm thinking about the number of tools, scripts, and utilities that utilize SPARQL queries via Wikidata/WDQS that have given us tremendous capabilities... and the same approach or set of activities cannot be realized for WCQS because of this constraint. We cannot underestimate the headache of having to implement OAuth2 for each and every SPARQL query. I'm also puzzled how a service that has not even launched yet has to be this closed when none of our other APIs and services have started this way. Other tool creators have shared this common concern at this Phabricator thread T290300.
  • Public perception. In terms of public outreach, especially for our GLAM-Wiki work, this is hard to swallow and reconcile with what we are evangelizing. As we are asking cultural and heritage partners to open up their collections and to share their metadata, we are doing so with the expectation of showcasing the benefits of open knowledge to the world. Or we thought we were. With this WCQS policy, every mention of "open content" and "open access" will require an asterisk. This will introduce an asymmetry in contributing content and experiencing its benefits.
  • Alternative solutions. I am sympathetic to the complex support issues when any service is made available for public access, whether it's the Mediawiki API or a SPARQL endpoint. However, our "open by default" ethos is a core tenet for the movement and for equitable access to knowledge. Like-minded entities like openverse have found ways to have different tiers of access, while not requiring API keys. We should bend over backward to find "least restrictive" solutions such as throttling or limiting call frequency before we completely block access with mandatory authentication.

Thanks. - Fuzheado (talk) 19:59, 16 December 2021 (UTC)[reply]

Agree, filed phab:T297995. Multichill (talk) 17:01, 18 December 2021 (UTC)[reply]
 Support - We really shouldn't be restricting access to knowledge like this. Husky (talk to me) 22:18, 18 December 2021 (UTC)[reply]
 Support - Plenty of other fallback options. Perhaps worth avoiding this [anti]pattern in betas as well, so that those other options are consistently tested. --SJ+ 23:21, 18 December 2021 (UTC)[reply]
 Neutral I'm aware of the benefits that could bring required authentication (I read them in the Project Page). I want to mention that it will also add significant friction for a user to experiment with WCQS, an important step for gaining the interest of the user in the project.
Speaking for myself, I'm an undergraduate student in Computer Science. I was introduced to Wikidata in my database class by being shown query examples. I was able to experiment with them by just copying them in WDQS and this gained my interest in Wikidata. I'm sure that if, at that time, I had had to create an account, I would just probably pay attention to the class instead of experimenting on my own which could have ended up in me not being so interested in Wikidata.
Rdrg109 (talk) 16:24, 20 December 2021 (UTC)[reply]
 Support , I also believe the big draw of Wikidata is WDQS Jimman2003 (talk) 18:39, 22 December 2021 (UTC)[reply]
 Support Additionally, often the authentication process doesn't work and a 403 error is producec. Wikidata query service, without authentication, doesn't have this problem.--Pere prlpz (talk) 09:28, 21 March 2023 (UTC)[reply]

Edit latency + other metrics of use[edit]

Is edit latency the only service objective tracked? Is it the top priority for the WD and WC services, or just a natural thing to track first?

What about...

% of queries that time out,
% of sessions that end w/o a successful query,
% of sessions that end without entering a query at all,
avg Time To First Query (e.g., time taken to log in if needed, load interface, enter and run a query),
others?

From the feedback at the wikidata conference, some people commented that they were not latency sensitive. Some comments on the ticket listed here suggest the same (including from tool devs, people who teach w/ these queries, editors). --SJ+ 00:01, 19 December 2021 (UTC)[reply]

Lessons for scaling future federated endpoints?[edit]

I appreciated this passage, though I don't fully understand the implications:

Implementing user authentication helps us avoid these same issues on WCQS, and gives us more precise tools for pinpointing and banning user queries causing issues that can lead to service outages. In the future, the API Gateway will give us additional tooling that will help us improve the ease of using WCQS.
  • What are the available tools for pinpointing problematic queries; how do these get better w/ persistent auth? What could a gateway do that can't be done now?
  • Say WCQS requires auth, and some communities need an option without it. What would you estimate are the primary requirements for maintaining an auth-free endpoint? Say w/ high edit-latency, but similarly monitoring queries to stop those causing issues before they take down the service.

Warmly, --SJ+ 01:25, 19 December 2021 (UTC)[reply]


Follow up response to WCQS Authentication[edit]

Thank you all for your responses, questions and continued dedication to WCQS – we (WMF Search and other teams involved with WCQS) appreciate hearing your thoughts around the authentication announcement for the upcoming WCQS production release, and acknowledge that it is an issue currently being escalated for further discussion (likely after WMF staff return from vacation in January). In the meantime, we’ll try our best to give a little more context below around why we decided to implement authentication.

The rationale for WCQS authentication is a bit more complex than just blocking abusive users. It is motivated by a strategy centered on equitable access and protected reliability.

Our experience with Wikidata Query Service (WDQS), which serves as a base for how WCQS is built, makes us very concerned about exposing such a fragile service to the internet without some constraints to ensure stability, security and functionality.

Knowledge graphs are hard. WDQS as it currently exists is very brittle, subject to crashing, slowdowns and various issues. It is a service that exposes extremely powerful capabilities to users, with the possibility of any user abusing them (in this case, we use “abusing” in a broad sense): using the services for things it was not designed for and not supposed to support, sending more load than reasonable, etc.

Knowledge is more equitably accessible to users by preventing query abusers from inundating query traffic to the point where other users are unable to access or experience long wait times. This enables users to have a consistent experience regardless of skills, location, etc. Wikipedia and other Wiki projects already comprise part of the knowledge backbone of the internet, with our goal for this foundation to grow over time, with large corporate entities using our services – we do not want a world in which a user like Google or Apple can prevent equitable access to everyone.

“Abusive” activities can and have all caused the service to fail, requiring days to weeks to detect, identify a likely source of failure, take action to resolve the issue, and potentially reload the data (there is an element of luck present in each of these steps). Not only are these issues a problem for users, but they prevent the Search team currently responsible for maintaining and scaling query services from actively being able to properly scale the query service infrastructure and build new features that improve its general usability and functionality.

If we had to start such a service today, we might not do it at all, because it is very hard to sustain. Since WDQS and WCQS already exist, we choose to support them into the future.

One of the things we are struggling with WDQS is managing load from users. We have some level of throttling / rate limiting to prevent a single user from overloading the service. But without authentication, we are limited to relying on things like user-agent and IP addresses, which are a fairly primitive proxy to actual users.

For WCQS, we want to start from the beginning with authentication, so that when problems start to occur (and they will if this service has any kind of success), we are ready to react. Adding authentication after the fact is really hard as it would break all existing tools and workflows. So while it is not strictly necessary right now to have authentication on WCQS, it is very likely to be crucial in the future.

Our goal has never been to restrict access for the sake of restricting access. It has been to ensure fairness and reliability.

Q: Is it possible to commit to a lower Service Level Objective (SLO) if it means not having authentication?

Technically yes. However, without any tools like authentication to manage abusive behavior, it is unlikely we will be able to commit to any meaningful SLO at all, as all reliability maintenance would need to be manually investigated and handled individually. As we have already seen, this can result in multiple days of downtime, with the potential for taking even longer as the nature of the problem and team availability dictates.

Q: Is it possible to have two levels of access - full access for authenticated users, and rate-limited access for everyone else?

This is a great idea, but it is technically very difficult (potentially impossible). We want to rate-limit individual users, so that everyone puts the same load on the service and no single user can overload the system. For example, we want to allow each user to make 10 requests per second, but no more than that - but to implement this restriction, we need to be able to link a request to a user. And that means authentication.

Putting aside technical difficulty for a moment, this is the kind of feature idea we would like to be able to build for a better overall service for users, but currently do not have the bandwidth to address, due to the maintenance attention currently required to keep our query services from falling over. Having a more reliable service would allow our teams more time to properly design and develop features like this.

Q: Can you monitor queries instead of users to prevent issues before they take down the service?

No. We wish this were possible, but part of the power of SPARQL has made detecting problematic queries before they are executed impossible so far. Once a problematic query is executed, it can lock up Blazegraph and cause it to become unresponsive – because we are unable to kill the query, we are forced to restart Blazegraph manually each time, causing user-wide disruptions, and taking a lot of emergency time and energy for the WMF team to resolve (at the expense of other improvements, features, projects).

Thanks, and stay tuned for follow up conversations. MPham (WMF) (talk) 15:05, 20 December 2021 (UTC)[reply]

"we want to allow each user to make 10 requests per second, but no more than that - but to implement this restriction, we need to be able to link a request to a user. And that means authentication."
Of course it doesn't! It requires identity, but it doesn't require strong identity, and it certainly doesn't require authentication. If this is just for rate limiting, you can bind it to something as rudimentary as IP space. Andy Dingley (talk) 16:29, 20 December 2021 (UTC)[reply]
We are currently using a combination of IP addresses and User-Agent for throttling / rate limiting on WDQS. This solution is highly problematic. Multiple users can be using the same IP address (school proxy, mobile connections, etc...). And a single user can use multiple IP addresses. We've seen workload from AWS or Google Cloud using a number of IPs. We already had instances where in emergency we had to block all of a cloud provider for what looks like a single misbehaving user. This is obviously not something we want to replicate. GLederrey (WMF) (talk) 18:35, 20 December 2021 (UTC)[reply]
If you have to, there are techniques for sticking a temporary token onto the client (a temporary cookie, and their many variants). Yes, a simplistic IP-only solution is simplistic. But they're all better than requiring authentication. Andy Dingley (talk) 18:49, 20 December 2021 (UTC)[reply]
"WDQS as it currently exists is very brittle, subject to crashing, slowdowns and various issues."
"If we had to start such a service today, we might not do it at all, because it is very hard to sustain."
This combined with the rest of the post makes it sound like WDQS is incredibly problematic to maintain and have a ton of downtime but to most end users WDQS works like a charm and provides a ton of value. Abbe98 (talk) 17:00, 20 December 2021 (UTC)[reply]
Speaking as a Search team SRE (Search owns WDQS and WCQS), WDQS is incredibly problematic from an operational perspective. We've had at least a few total or near-total outages per year (exs: https://wikitech.wikimedia.org/wiki/Incident_documentation/2020-07-23_wdqs-outage and https://wikitech.wikimedia.org/wiki/Incident_documentation/2020-09-02_wdqs-outage), which requires frantically spelunking through logs/visualizations to try to guess which [unauthenticated, of course] users/queries are likely causing blazegraph to lockup. Then we start banning the user agents and hoping that we chose the right user to ban. But that's just the full outages we've dealt with, we also more regularly have single hosts get their blazegraph locked up, which can be seen in our time series metrics as the host failing to report metrics (for example, if you glance at wdqs1006 during today's window you'll see it failing to report its triple count for several hours: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&var-cluster_name=wdqs&from=1639955298258&to=1640030089050). A single host dropping offline for several hours isn't the end of the world (user impact is pretty minimal) but it's worth noting as a sign of blazegraph's propensity to fall into deadlock under our normal usage patterns.
WDQS provides incredible value to the community and it's a very impactful and frankly very "cool" service. It is just unfortunately much harder to scale than something like Elasticsearch, which is almost infinitely horizontally scalable and has much more bounded computational needs than that of a knowledge graph like blazegraph. Basically what I'm trying to convey is that the appearance of the service working like a charm and providing a ton of value to the everyday user, is not the same thing as the internal reality of what it actually takes to keep the service running and healthy from the operators' perspective, particularly when understood in the context of our team having multiple different services/projects/new features to balance.
Thanks for your perspective. My intention is/was to provide a bit of context on what it looks like behind the scenes to be operating these services. I hope you found it helpful. RKemper (WMF) (talk) 20:09, 20 December 2021 (UTC)[reply]
Thank you! The example showing how wdqs1006 is looked up today illustrates your point very well as it shows that the struggle is ongoing day-to-day. This is exactly the type of context the communication has been missing. Abbe98 (talk) 21:30, 20 December 2021 (UTC)[reply]
I am sorry to read that WDQS is incredibly problematic from an operational perspective. My motivation to work on Wikidata is dependent on the query service. Without it I can’t monitor changes or gaps. On Commons I would do more with SDOC if the query service was easy to use in combination with WDQS. I hope these issues can be resolved so that we can all enjoy a robust query service for both Wikidata and Commons. Jane023 (talk) 18:47, 21 December 2021 (UTC)[reply]
Question: what will the consequences of this for federated SPARQL queries? (including the mission-fundamental case of queries federated between WCQS and WDQS in either direction; but also queries on WCQS and WDQS requested as sub-queries by external SPARQL endpoints). Has the team sketched out thoughts for this? My (very limited) understanding is that the current SPARQL standard does not offer any facility for access tokens etc to be passed in alongside sub-query requests (cf eg this thread on the draft SPARQL 1.2 'issues for consideration' board). Does the team have a cunning plan to nevertheless make this possible? Jheald (talk) 19:26, 21 December 2021 (UTC)[reply]
 Support I support user authentication or some form of it that allows the team to know who is causing service disruptions for all. It's a small price to pay to ensure free availability for all. Since the Search team has looked into this already quite thoroughly and stated they see no alternative to knowing which specific users are issuing problematic queries that can bring down the service for everyone, then I think to ensure minimum availability to all users that authentication should be required. Without some form of individual identification through an authentication mechanism of their choosing, the team says that service degradation can still occur but we would not have a clear signal who might be causing a problem for all to help correct behavior or that address that users' needs. Folks, this is no different than other API service providers with public endpoints requiring an API user key or token for identification in case it's needed to contact that user. Being anonymous with your queries are still an option for anyone if they desire: Run your own service or download the dumps and query with Wikibase or something like KGTK--Thadguidry (talk) 21:07, 28 January 2022 (UTC)[reply]
Late reply, but wanted to address this: "It's a small price to pay to ensure free availability for all." I disagree. It may be still free as in beer, but it's not truly free as in freedom. The barrier to entry is significant and introduces the rare case in the Wikimedia movement where login is required to read content. That is a drastic shift in ethos, and a very surprising one. - Fuzheado (talk) 10:01, 21 April 2022 (UTC)[reply]