Shortest path algorithm between pages on Wikipedia. Let's say that you want to get from page Exploding animal
to Cucumber
. This thing will find the set of shortest paths between them, and return the path as a list of page titles.
- Six Degrees of Wikipedia by Jacob Wenger - The bi-directional BFS algorithm is based on this project. There are a lot of differences between the implementations: SDOW is in Python, and uses SQLite. This is in Rust and uses a bespoke graph memory-mapped database that I made. SDOW's import process uses shell scripts rather than a parser combinator library, and as a result, it is currently broken. Nonetheless, it's pretty brilliant,
There are two files that comprise the adjacency list database of edges:
Vertex Adjacency List
An array of records of edges
in this format:
null ⇒ 0 as u32
vertex ⇒ u32
vertexes ⇒ vertex+
edges ⇒ outgoing_vertexes, null, incoming_vertexes, null
To determine which vertex each record belongs to, see below (vertex_al_ix
)
Vertex Adjacency List Index
An array of u32
indexed by vertex ID. Each u32
is the offset into vertex_al
at which the edge data is located. So to load the record for the page with ID 1337
:
- Read the eight bytes at offset
1337 * 8
invertex_al_ix
⇒offset_bytes
- Decode
offset_bytes
into au64
⇒offset
- If
offset == 0
, the page either does not exist, or has no edges attached to it. - Otherwise, read the record at
offset
invertex_al
(see above)
If you want to spot-check some data from MySQL, it's faster to import the dumps without indexes. First define the tables thusly:
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL DEFAULT 0,
`pl_from_namespace` int(11) NOT NULL DEFAULT 0,
`pl_target_id` bigint(20) unsigned NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;
CREATE TABLE `redirect` (
`rd_from` int(8) unsigned NOT NULL DEFAULT 0,
`rd_namespace` int(11) NOT NULL DEFAULT 0,
`rd_title` varbinary(255) NOT NULL DEFAULT '',
`rd_interwiki` varbinary(32) DEFAULT NULL,
`rd_fragment` varbinary(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;
CREATE TABLE `page` (
`page_id` int(8) unsigned NOT NULL,
`page_namespace` int(11) NOT NULL DEFAULT 0,
`page_title` varbinary(255) NOT NULL DEFAULT '',
`page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
`page_random` double unsigned NOT NULL DEFAULT 0,
`page_touched` binary(14) NOT NULL,
`page_links_updated` varbinary(14) DEFAULT NULL,
`page_latest` int(8) unsigned NOT NULL DEFAULT 0,
`page_len` int(8) unsigned NOT NULL DEFAULT 0,
`page_content_model` varbinary(32) DEFAULT NULL,
`page_lang` varbinary(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;
CREATE TABLE `linktarget` (
`lt_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`lt_namespace` int(11) NOT NULL,
`lt_title` varbinary(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;
Then read in dumps, skipping DDL
pv enwiki-*-pagelinks.sql | tail +34 | mysql wiki
pv enwiki-*-page.sql | tail +45 | mysql wiki
pv enwiki-*-redirect.sql | tail +38 | mysql wiki
pv enwiki-*-linktarget.sql | tail +34 | mysql wiki
Then build indexes
create index page_title_ix on page (page_title);
create index page_title_id on page(page_id);
create index page_title_namespace on page(page_namespace);
To export for later:
select * from page into outfile '/tmp/wiki-page-dump';
select * from pagelinks into outfile '/tmp/wiki-pagelinks-dump';
select * from redirect into outfile '/tmp/wiki-redirect-dump';
select * from linktarget into outfile '/tmp/wiki-linktarget-dump';
Then compress and such:
sudo mv /tmp/wiki-*-dump ~/data
sudo chown $(id -u):$(id -g) ~/data/wiki-*-dump
zstd -T0 ~/data/wiki-*-dump
Then to import (assuming wiki-page-dump is on the server at some location):
LOAD DATA INFILE 'wiki-page-dump' INTO TABLE page;
LOAD DATA INFILE 'wiki-pagelinks-dump' INTO TABLE pagelinks;
LOAD DATA INFILE 'wiki-redirect-dump' INTO TABLE redirect;
LOAD DATA INFILE 'wiki-linktarget-dump' INTO TABLE linktarget;
There are two SQLite databases:
- master.db - this is the persistent database with an unlimited lifetime. It is used to log performance and usage.
- graph.db - this is generated by the build routine, and also contains a cache. The cache is here because it becomes useless when the rest of the graph is updated.
To find large paths:
SELECT
sv.title AS 'source_title',
tv.title AS 'target_title',
coalesce(json_array_length(json_extract(path_data, '$[0]')),0) AS degrees
FROM path p
INNER JOIN vertexes sv ON p.source_page_id=sv.id
INNER JOIN vertexes tv ON p.target_page_id=tv.id
WHERE degrees >= 5
ORDER BY degrees DESC
LIMIT 20;
The build process is a bit involved, so here's a high-level overview of how it's done in code.
-
"tool" downloads the latest Wikipedia dump from here. Specifically, the files are:
enwiki-[date]-pagelinks.sql.gz
enwiki-[date]-page_props.sql.gz
(not currently used)enwiki-[date]-page.sql.gz
enwiki-[date]-redirect.sql.gz
enwiki-[date]-linktarget.sql.gz
-
Read the
page
table, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a page ID and a page title. In batches, write these records to SQLite. They go in the thevertexes
table. -
Read
pagelinks
, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a source page ID and a "link target" ID. -
Because
pagelinks
only specifies a target page title (but a source page ID), we want to resolve the target pages to IDs. We do this by iterating over the channel from the above step, combining the titles to look up into chunks in order to reduce the number of queries, then looking up the IDs in SQLite. We the write the source and destination IDs to a very simple flat-file adjacency list made in "edge_db.rs". -
This edge list is then duplicated: one for the incoming edges, and one for the outgoing edges. This is because the algorithm is bi-directional, and the
vertex_al
file (see "Graph Database" above and last step below) contains both incoming and outgoing edges. -
Both of these edge data files are then sorted by their respective page IDs.
-
We then read the sorted edge data files, and write them to the
vertex_al
file. We also write an index file,vertex_al_ix
, which is an array of offsets into thevertex_al
file based on the page ID. This is used to quickly look up the edges for a given page ID.
Prior to July 1, 2024, linktarget was unnecessary, and the way to generate the list of links (from Page ID to Page ID) was:
- Read the
page
table, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a page ID and a page title. In batches, write these records to SQLite. They go in the thevertexes
table. - Read
pagelinks
, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contained a source page ID and a target page title. Send each record down a channel. - Because
pagelinks
only specifies a target page title (but a source page ID), we want to resolve the target pages to IDs. We do this by iterating over the channel from the above step, combining the titles to look up into chunks in order to reduce the number of queries, then looking up the IDs in SQLite. We the write the source and destination IDs to a very simple flat-file adjacency list made in "edge_db.rs". - This edge list is then duplicated: one for the incoming edges, and one for the outgoing edges. This is because the algorithm is bi-directional, and the
vertex_al
file (see "Graph Database" above and last step below) contains both incoming and outgoing edges. - Both of these edge data files are then sorted by their respective page IDs.
- We then read the sorted edge data files, and write them to the
vertex_al
file. We also write an index file,vertex_al_ix
, which is an array of offsets into thevertex_al
file based on the page ID. This is used to quickly look up the edges for a given page ID.