Via this comment on Hacker News I started exploring the ClickHouse Playground. It's really cool, and among other things it allows CORS-enabled API hits that can query a decade of history from the GitHub events archive in less than a second.
ClickHouse is an open source column-oriented database, originally developed at Yandex but spun out into a separate, VC-funded company in 2021. It's designed for big data analytical queries - in a similar space to HBase, BigQuery and DuckDB.
It turns out it can do that trick with HTTP range queries where you can point it at the URL to a Parquet or .native.zst
file (ClickHouse native format, optionally compressed using Facebook Zstandard) and run queries without downloading the entire file first.
The ClickHouse Playground is a free hosted environment for trying out ClickHouse. You can access it here:
https://play.clickhouse.com/play?user=play
Try this query, taken from the ClickHouse Everything You Always Wanted To Know About GitHub (But Were Afraid To Ask) tutorial:
SELECT count() FROM github_events WHERE event_type = 'WatchEvent'
The github_events
table contains a copy of the GH Archive - a project that archives and makes available the public GitHub timeline, I think using data from the public events API. GH Archive then makes that data available as compressed newline-delimited JSON in this bucket. The archive stretches back to February 2011, and is constantly updated.
The ClickHouse demo table is continually updated with the latest archived data, by this script, which runs every 10 minutes.
You can do all sorts of fun stuff with it. Here's my recent activity acrosss all of GitHub:
SELECT
created_at,
actor_login,
repo_name,
event_type,
title
FROM
github_events
WHERE
actor_login = 'simonw'
AND repo_name != 'simonw/disaster-data'
ORDER BY
created_at DESC
LIMIT
100
This link executes that query - note how it includes a base64 encoded copy of the SQL query following the #
in the URL.
There are 77 tables total in the Playground instance - you can get a list of them like this:
SELECT database, name FROM system.tables
You can access the API via curl
like this:
curl 'https://play.clickhouse.com/' \
-X POST \
-H 'Authorization: Basic cGxheTo=' \
--data-raw $'
SELECT created_at, actor_login, repo_name
FROM github_events
WHERE event_type = \'WatchEvent\'
ORDER BY created_at DESC LIMIT 100'
This defaults to returning TSV without column headers, like this:
2023-01-01 03:59:59 Willmac16 nlohmann/json
2023-01-01 03:59:59 Samrose-Ahmed Stebalien/stash-rs
2023-01-01 03:59:57 CodePromoter aplus-framework/image
To get back data in JSON instead, add ?default_format=JSON
to the URL. Here I'm piping that through jq
to pretty print it:
curl 'https://play.clickhouse.com/?default_format=JSON' \
-X POST \
-H 'Authorization: Basic cGxheTo=' \
--data-raw $'
SELECT created_at, actor_login, repo_name
FROM github_events
WHERE event_type = \'WatchEvent\'
ORDER BY created_at DESC LIMIT 1' | jq
Output:
{
"meta": [
{
"name": "created_at",
"type": "DateTime"
},
{
"name": "actor_login",
"type": "LowCardinality(String)"
},
{
"name": "repo_name",
"type": "LowCardinality(String)"
}
],
"data": [
{
"created_at": "2023-01-01 03:59:59",
"actor_login": "Willmac16",
"repo_name": "nlohmann/json"
}
],
"rows": 1,
"rows_before_limit_at_least": 341429632,
"statistics": {
"elapsed": 0.925636889,
"rows_read": 341540363,
"bytes_read": 10567093585
}
}
More format options are documented here.
This pattern works for running queries from JavaScript. CORS is enabled - I pasted this into the Firefox DevTools console on https://www.example.com/ and it returned the results I expected:
r = await fetch("https://play.clickhouse.com/?user=play", {
method: "POST",
body: `SELECT
created_at,
event_type,
actor_login,
repo_name,
number,
title,
body
FROM
github_events
WHERE
actor_login = 'simonw'
ORDER BY
created_at desc
LIMIT
100
FORMAT JSON`,
});
d = await r.json();
Here I'm using FORMAT JSON
at the end of the query itself, and passing the requested user as ?user=play
rather than sending an Authorization
header.