Tabulae [tǽbjəliː], “tables” in Latin, is a tool for combining multiple SPARQL queries to create well-prepared tabular datasets for subsequent applications. In particular, it is intended to be used from the extensive ecosystem in the field of machine learning. Tabulae creates base tables from SPARQL queries (called Layer 1), and then combines and processes these tables in DuckDB to form new tables (called Layer 2). The output table dumps can be hosted statically on any HTTP(s) server, along with a web UI that allows users to browse a summary of tables. For convenience, dumps are provided in several formats. You can use it in spreadsheets, import it from Jupyter notebooks, or query it directly over the network using DuckDB.
First of all, we need to create queries
directory. This directory will contain two subfolders, named layer1
and layer2
. Place SPARQL queries in layer1
and SQL queries for DuckDB in layer2
. Layer 1 queries must have a .rq
extension. Layer 2 queries must have a .sql
extension. The file name without the suffix becomes the table name as it is. A typical directory structure is as follows:
queries
├── layer1
│ ├── chembl_compound_atc_classification.rq
│ ├── chembl_compound_uniprot_via_activity_assay.rq
│ └── uniprot_tissue.rq
└── layer2
└── combined.sql
And this is exactly what we are going to build in this tutorial.
We are going to create a SPARQL query. Here, we use the endpoint provided by rdfportal.org to query the ChEMBL database for “atcClassification”, as follows:
# chembl_compound_atc_classification.rq
# Endpoint: https://rdfportal.org/ebi/sparql
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?chembl_compound_id ?label ?atc
WHERE {
?Molecule a cco:SmallMolecule ;
skos:prefLabel ?label ;
cco:atcClassification ?atc .
BIND (STRAFTER(STR(?Molecule), "http://rdf.ebi.ac.uk/resource/chembl/molecule/") AS ?chembl_compound_id)
}
The # Endpoint
declaration in the first line is a magic comment that Tabulae recognizes specially. A SPARQL query is issued to this endpoint. This declaration is required for all SPARQL queries.
The variable names of the bindings returned by the SELECT query become the column names of the generated table. In addition, for each column, the type is automatically inferred (if all SPARQL Results are the same type and the type is supported by Tabular). If not possible, the default is VARCHAR
.
Let's create another query, as shown in the following list:
# chembl_compound_uniprot_via_activity_assay.rq
# Endpoint: https://rdfportal.org/ebi/sparql
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX tax: <http://identifiers.org/taxonomy/>
SELECT DISTINCT
?chembl_compound_id
?uniprot_id
?conf_score
WHERE {
?Molecule a cco:SmallMolecule ;
cco:hasActivity/cco:hasAssay [
a cco:Assay ;
cco:targetConfScore ?conf_score ;
cco:hasTarget/skos:exactMatch [
cco:taxonomy tax:9606 ;
skos:exactMatch ?uniprot
]
] .
?uniprot a cco:UniprotRef .
BIND (STRAFTER(STR(?Molecule), "http://rdf.ebi.ac.uk/resource/chembl/molecule/") AS ?chembl_compound_id)
BIND (STRAFTER(STR(?uniprot), "http://purl.uniprot.org/uniprot/") AS ?uniprot_id)
}
# Paginate: 1000000
This query has the text # Paginate: 1000000
at the end. This is also a magic comment. When this is specified, Tabulae parses the given query and obtains the entries while rewriting them with the number of entries given by OFFSET and LIMIT (1 million in this example). Please note that the magic comments just need to start at the beginning of a line, and can be placed on any line.
Let's add one more query, as follows. This is the last one:
# uniprot_tissue.rq
# Endpoint: https://rdfportal.org/sib/sparql
PREFIX core: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprot: <http://purl.uniprot.org/uniprot/>
PREFIX db: <http://purl.uniprot.org/database/>
PREFIX chu: <http://purl.uniprot.org/proteomes/UP000005640#Unplaced>
PREFIX ch: <http://purl.uniprot.org/proteomes/UP000005640#Chromosome%20>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?uniprot_id ?tissue ?tissue_label
WHERE {
VALUES ?human_proteome { ch:1 ch:2 ch:3 ch:4 ch:5 ch:6 ch:7 ch:8 ch:9 ch:10
ch:11 ch:12 ch:13 ch:14 ch:15 ch:16 ch:17 ch:18 ch:19
ch:20 ch:21 ch:22 ch:X ch:Y ch:MT chu: }
?uniprot core:proteome ?human_proteome ;
core:isolatedFrom ?tissue .
?tissue skos:prefLabel ?tissue_label .
BIND (STRAFTER(STR(?uniprot), "http://purl.uniprot.org/uniprot/") AS ?uniprot_id)
}
We have created a query for the three tables for Layer 1. Let's try running it. In the example, we use podman to run the Docker container:
mkdir dist # Create a directory to store the results
podman run -it --rm -v ./queries:/work/queries -v ./dist:/work/dist ghcr.io/dbcls/tabulae build
It may take a while. The query will be executed and the tables will be generated. The results will be placed under the dist
directory.
The dist directory should be something like this:
dist
├── assets
│ ├── index-Duuvv_a3.js
│ ├── index--oYnr9KL.css
│ ├── sparql-ClhyTxQt.js
│ ├── sql-D8pCRXw6.js
│ └── wasm-CG6Dc4jp.js
├── index.html
├── layer1
│ ├── chembl_compound_atc_classification.csv
│ ├── chembl_compound_atc_classification.parquet
│ ├── chembl_compound_atc_classification.tsv
│ ├── chembl_compound_uniprot_via_activity_assay.csv
│ ├── chembl_compound_uniprot_via_activity_assay.parquet
│ ├── chembl_compound_uniprot_via_activity_assay.tsv
│ ├── uniprot_tissue.csv
│ ├── uniprot_tissue.parquet
│ └── uniprot_tissue.tsv
├── layer1.duckdb
├── layer2
├── layer2.duckdb
└── manifest.json
It includes files for web front ends and dumps of each table format. There are also DuckDB database files.
In order to check the results, we need to serve the contents under the dist directory as a static website using a web server. Here, we start an nginx container with Podman.
podman run -it --rm --name tabulae-nginx -v ./dist:/usr/share/nginx/html:ro -p 8080:80 docker.io/library/nginx
Open localhost:8080 in your web browser. The Web UI should then be displayed.
The Web UI shows only the Layer 2 tables by default. You will need to turn on the “Show Layer 1 tables” switch in order to see the three Layer 1 tables you created earlier.
If you have updated, added or deleted queries, just re-run the above command. Tabulae checks the timestamp of the query files and only re-runs the queries that have been updated. (This is because Layer 1 tables usually take a long time to generate), while Layer 2 tables are always re-generated.
Next, let's create a Layer 2 table by combining these three tables.
Layer 2 tables must have a .sql extension. Just remember that the first part of the file name will become the name of the table that is created.
Write the following query:
-- queries/layer2/combined.sql
SELECT
*
FROM 'chembl_compound_atc_classification'
NATURAL JOIN 'chembl_compound_uniprot_via_activity_assay'
NATURAL JOIN 'uniprot_tissue';
This is where you write your DuckDB queries. In this context, you can see the Layer 1 tables (you can't see the Layer 2 tables, as this is to simplify the dependency management). The output table for the statement in this file will be a Layer 2 table. The column names and types will also be used as they are.
Run tabulae build
again:
podman run -it --rm -v ./queries:/work/queries -v ./dist:/work/dist ghcr.io/dbcls/tabulae build
Then, if you reload localhost:8080, you should see the Layer 2 table you just created.
Here, you can use any query from DuckDB. Therefore, you can also publish data obtained from other data sources as Layer 2 tables, not just SPARQL.
Using Tabulae, you can integrate multiple tables like this to create and publish tables for specific applications.
If you want to keep the trial-and-error loop for building the Layer 2 table short, you can also try out queries directly using the DuckDB CLI. The Layer 1 table is included in dist/layer1.duckdb. So, if DuckDB CLI is installed, you can query the layer1 table in read-only mode as follows ('D ' is the DuckDB prompt):
❯ duckdb --readonly dist/layer1.duckdb
v1.2.0 5f5512b827
Enter ".help" for usage hints.
D SELECT * FROM 'chembl_compound_atc_classification' LIMIT 5;
┌────────────────────┬────────────┬─────────┐
│ chembl_compound_id │ label │ atc │
│ varchar │ varchar │ varchar │
├────────────────────┼────────────┼─────────┤
│ CHEMBL1027 │ TIAGABINE │ N03AG06 │
│ CHEMBL1089 │ PHENELZINE │ N06AF03 │
│ CHEMBL115 │ INDINAVIR │ J05AE02 │
│ CHEMBL11672 │ ROQUINIMEX │ L03AX02 │
│ CHEMBL1171837 │ PONATINIB │ L01EA05 │
└────────────────────┴────────────┴─────────┘
If you have a good query, just place it as a file under queries/layer2/ and rebuild.
The Layer2 tables are also in dist/layer2.duckdb, so you can use them in the same way.
If you make it publicly available via HTTP(s), it can be used by various clients. You can also query it directly from DuckDB:
❯ duckdb
v1.2.0 5f5512b827
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D INSTALL httpfs;
D ATTACH 'http://localhost:8080/layer1.duckdb';
D USE layer1;
D SHOW TABLES;
┌────────────────────────────────────────────┐
│ name │
│ varchar │
├────────────────────────────────────────────┤
│ chembl_compound_atc_classification │
│ chembl_compound_uniprot_via_activity_assay │
│ uniprot_tissue │
└────────────────────────────────────────────┘
D SELECT * FROM 'chembl_compound_atc_classification' LIMIT 5;
┌────────────────────┬────────────┬─────────┐
│ chembl_compound_id │ label │ atc │
│ varchar │ varchar │ varchar │
├────────────────────┼────────────┼─────────┤
│ CHEMBL1027 │ TIAGABINE │ N03AG06 │
│ CHEMBL1089 │ PHENELZINE │ N06AF03 │
│ CHEMBL115 │ INDINAVIR │ J05AE02 │
│ CHEMBL11672 │ ROQUINIMEX │ L03AX02 │
│ CHEMBL1171837 │ PONATINIB │ L01EA05 │
└────────────────────┴────────────┴─────────┘
Of course, you can query an individual table dump:
D SELECT * FROM 'http://localhost:8080/layer1/chembl_compound_atc_classification.parquet' LIMIT 5;
┌────────────────────┬────────────┬─────────┐
│ chembl_compound_id │ label │ atc │
│ varchar │ varchar │ varchar │
├────────────────────┼────────────┼─────────┤
│ CHEMBL1027 │ TIAGABINE │ N03AG06 │
│ CHEMBL1089 │ PHENELZINE │ N06AF03 │
│ CHEMBL115 │ INDINAVIR │ J05AE02 │
│ CHEMBL11672 │ ROQUINIMEX │ L03AX02 │
│ CHEMBL1171837 │ PONATINIB │ L01EA05 │
└────────────────────┴────────────┴─────────┘
You can find these URLs from the Web UI.
If you want to retrieve these tables via a browser, you must set the CORS headers appropriately. Note that DuckDB also has a WebAssembly build, which is required when using the tables from there.
Here is a simple configuration example based on https://enable-cors.org/server_nginx.html. Please rewrite this to suit your needs:
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
gzip on;
server {
listen 80;
listen [::]:80;
server_name localhost;
location / {
root /usr/share/nginx/html;
index index.html index.htm;
if ($request_method = 'OPTIONS') {
add_header 'Access-Control-Allow-Origin' '*';
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS';
#
# Custom headers and headers various browsers *should* be OK with but aren't
#
add_header 'Access-Control-Allow-Headers' 'DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range';
#
# Tell client that this pre-flight info is valid for 20 days
#
add_header 'Access-Control-Max-Age' 1728000;
add_header 'Content-Type' 'text/plain; charset=utf-8';
add_header 'Content-Length' 0;
return 204;
}
if ($request_method = 'POST') {
add_header 'Access-Control-Allow-Origin' '*' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS' always;
add_header 'Access-Control-Allow-Headers' 'DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range' always;
add_header 'Access-Control-Expose-Headers' 'Content-Length,Content-Range' always;
}
if ($request_method = 'GET') {
add_header 'Access-Control-Allow-Origin' '*' always;
add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS' always;
add_header 'Access-Control-Allow-Headers' 'DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range' always;
add_header 'Access-Control-Expose-Headers' 'Content-Length,Content-Range' always;
}
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
}
}
When starting with Podman, this configuration file must be enabled, as follows:
podman run -it --rm --name tabulae-nginx -v ./dist:/usr/share/nginx/html:ro -v ./nginx.conf:/etc/nginx/nginx.conf:ro -p 8080:80 docker.io/library/nginx