Commons:SPARQL query service/API endpoint

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Overview

[edit]

Because WCQS is an authenticated service, it is currently not easy or straightforward to use it as an API endpoint in a way you might expect if you are familiar with Wikidata's unauthenticated endpoint.

Nuts and Bolts

[edit]

Authentication is managed through two cookies:

  • wcqsOauth - A long lived cookie that wcqs uses to authenticate the user against commons.wikimedia.org. This cookie is used to refresh the wcqsSession cookie.
  • wcqsSession - A short lived cookie holding a JWT that grants query access to wcqs. This cookie is provided in a 307 redirect to any request that has a valid wcqsOauth cookie but no valid wcqsSession cookie.

Practical usage

[edit]
  • The wcqsOauth cookie needs to be provided by the programmer
  • The wcqsSession cookie will be issued every few hours in a 307 redirect based on the wcqsOauth cookie.
  • Bots will need to use an http library that includes a cookie jar capable of tracking the Set-Cookie calls for wcqsSession and providing the value in future requests.

FAQ

[edit]

How do I get a wcqsOauth cookie?

[edit]

There is currently no easy way to access the cookie programmatically, and it must be done manually from a web browser after visiting https://commons-query.wikimedia.org.

The wcqsOauth cookie is provided in a Set-Cookie header by WCQS after authenticating the user against mediawiki. Unfortunately, there are no API's that expose this value.

Once a browser has successfully authenticated, open the browsers settings menu, view the cookies for the site (typically found under privacy settings), and look for the wcqsOauth cookie. The appropriate value should have two hexadecimal strings with a period in the middle (Example: b8d15147d1cfed1129f1f7f38e2acb03.9f8b5d50d1861bf2198ead63ff345c4a94986c61).

Below are step-by-step directions for both Firefox and Chrome:

Firefox
[edit]
  • Visit https://commons-query.wikimedia.org/ and make sure you are logged in
  • From your menu, choose Tools->Browser Tools->Web Developer Tools. (or hit F12)
  • An inspector should pop up at the bottom.
  • Click on the "Storage" tab and expand "Cookies."
  • There you should find "https://commons-query.wikimedia.org"
  • Copy the cookie associated with "wcqsOauth"
Chrome
[edit]
  • Visit https://commons-query.wikimedia.org/ and make sure you are logged in
  • Go to Settings or Preferences
  • Choose "Privacy and security"
  • Choose "Cookies and other site data"
  • Choose "Set all cookies and site data"
  • In the "Search cookies" box, type "commons-query"
  • Click on commons-query.wikimedia.org (may say "3 cookies")
  • Expand the entry with wcqsOauth
  • Copy the cookie in the "Content" field
[edit]

The wcqsOauth cookie is valid until the user visits Special:OAuthManageMyGrants and revokes access to Wikimedia Commons Query Service. Logging out of commons has no effect on this.

Why am i getting an html page as a response?

[edit]

This typically occurs when the provided wcqsOauth cookie is invalid and the http request was redirected to commons.wikimedia.org to approve a new session. To resolve verify the wcqsOauth cookie is being set and is populated with an appropriate value.

Why is my http client in a redirect loop?

[edit]

This typically occurs when the http client used does not parse Set-Cookie headers and provide them in followup requests. To resolve verify a cookie jar has been configured within the http client library.

Are there rate limits or other controls?

[edit]

Full question: Since user is authenticated and can be tracked, are there rate limits or other controls? There are no automated controls in place related to the authentication. The system has the ability to reject requests from specific authenticated users. Users will be added to this list on an as-needed basis. This is intended to replace the method used on WDQS where we attempt to ban abusive requests using either IP address ranges or User-Agent strings. Bans are not in any way automated, historically WDQS bans have been applied when the system is falling over due to a single over-aggressive consumer.

Are users tracked?

[edit]

The username associated with the authenticated session is included in backend request logging. It is found in the performer.user_text field of the /sparql/query schema, found in the wcqs-external.sparql-query event stream.

Can I do authorization in one click?

[edit]

Full question: If you must be logged in via browser because the user needs to click the "allow" button, can a URL be generated that can open the authorization popup on Commons in one click, or must the user load WCQS and then allow it to redirect them to the permission screen?

No. Behind the scenes WCQS uses MediaWiki's OAuth integration. To create a new OAuth grant WCQS needs to send a signed request to MediaWiki, parse the response, then send the user to commons using specific query string parameters provided by MediaWiki in the response to the initial signed request. This can only be done by visiting WCQS.

Show me the code

[edit]

All examples assume the wcqsOauth cookie value from above has been stored in the environment as WCQS_AUTH_TOKEN.

PHP

[edit]

cURL will require you to store cookies in a file on disk somewhere.

<?php

$endpoint = 'https://commons-query.wikimedia.org/sparql';
$query = <<<EOD
    SELECT ?file WHERE {
        ?file wdt:P180 wd:Q42 .
    }
EOD;

$ch = curl_init();
curl_setopt_array( $ch, [
    CURLOPT_URL => $endpoint,
    CURLOPT_USERAGENT => 'DocExampleBot/1.0',
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => http_build_query( [
        'query' => $query,
    ] ),
    CURLOPT_HTTPHEADER => [
        'accept: application/json',
    ],
    CURLOPT_COOKIE => 'wcqsOauth=' . getenv( 'WCQS_AUTH_TOKEN' ),
    CURLOPT_COOKIEJAR => __DIR__ . '/cookie.txt',
    CURLOPT_COOKIEFILE => __DIR__ . '/cookie.txt',
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_RETURNTRANSFER => true,
]);

$response = curl_exec( $ch );

print("Results\n");
var_dump( json_decode( $response, true ) );

Python

[edit]

Using the python requests library. The session should be reused for all further requests.

from http.cookiejar import Cookie
import os
import requests
from urllib.parse import urlparse

def init_session(endpoint, token):
    domain = urlparse(endpoint).netloc
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'DocExampleBot/1.0',
    })
    session.cookies.set_cookie(Cookie(0, 'wcqsOauth', token, None, False, domain, False, False, '/', True,
        False, None, True, None, None, {}))
    return session

ENDPOINT = 'https://commons-query.wikimedia.org/sparql'
session = init_session(ENDPOINT, os.environ['WCQS_AUTH_TOKEN'])
response = session.post(
    url=ENDPOINT,
    data={'query': """
        SELECT ?file WHERE {
          ?file wdt:P180 wd:Q42 .
        }
    """},
    headers={'Accept': 'application/json'}
)
response.raise_for_status()

print('Results')
print(response.json())

For python, the python-dotenv package may be useful for managing the environment variable if stored in a ".env" file in the same directory as your script.

from dotenv import load_dotenv # add this line
load_dotenv() # load environment for oAuth token

Pywikibot

[edit]

This requires that pywikibot user is logged in to Wikimedia Commons.

import pywikibot
import json
from pywikibot.data import sparql

# Login to pywikibot    
site = pywikibot.Site('commons', 'commons')
site.login()

# Define the SPARQL query
query = """
        
SELECT ?item ?described_url WHERE {
  ?item wdt:P7482 wd:Q74228490 .       # P7482 (source of file) = Q74228490 (file available on the internet)
  ?item p:P7482 ?statement .            
  ?statement pq:P973 ?described_url.
} LIMIT 5
"""
                
# Set up the SPARQL endpoint and entity URL
# Note: https://commons-query.wikimedia.org requires user to be logged in
                        
entity_url = 'https://commons.wikimedia.org/entity/'
endpoint = 'https://commons-query.wikimedia.org/sparql'
                        
# Create a SparqlQuery object
query_object = sparql.SparqlQuery(endpoint= endpoint, entity_url= entity_url)

# Execute the SPARQL query and retrieve the data
data = query_object.select(query, full_data=True)

# Convert SPARQL result mediainfo uri to Pywikibot.Page() object
for row in data:
    page_id=int(row['item'].getID().replace('M',''))
    pages = list(site.load_pages_from_pageids([page_id]))
    if len(pages) == 1:
        page=pages[0]
        print(page)