wikiwalk

Shortest path algorithm between pages on Wikipedia. Let's say that you want to get from page Exploding animal to Cucumber. This thing will find the set of shortest paths between them, and return the path as a list of page titles.

Prior Art

Six Degrees of Wikipedia by Jacob Wenger - The bi-directional BFS algorithm is based on this project. There are a lot of differences between the implementations: SDOW is in Python, and uses SQLite. This is in Rust and uses a bespoke graph memory-mapped database that I made. SDOW's import process uses shell scripts rather than a parser combinator library, and as a result, it is currently broken. Nonetheless, it's pretty brilliant,

Graph Database

There are two files that comprise the adjacency list database of edges:

vertex_al

Vertex Adjacency List

An array of records of edges in this format:

null        ⇒ 0 as u32
vertex      ⇒ u32
vertexes    ⇒ vertex+
edges       ⇒ outgoing_vertexes, null, incoming_vertexes, null

To determine which vertex each record belongs to, see below (vertex_al_ix)

vertex_al_ix

Vertex Adjacency List Index

An array of u32 indexed by vertex ID. Each u32 is the offset into vertex_al at which the edge data is located. So to load the record for the page with ID 1337:

Read the eight bytes at offset 1337 * 8 in vertex_al_ix ⇒ offset_bytes
Decode offset_bytes into a u64 ⇒ offset
If offset == 0, the page either does not exist, or has no edges attached to it.
Otherwise, read the record at offset in vertex_al (see above)

Importing MySQL data for test

If you want to spot-check some data from MySQL, it's faster to import the dumps without indexes. First define the tables thusly:

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT 0,
  `pl_from_namespace` int(11) NOT NULL DEFAULT 0,
  `pl_target_id` bigint(20) unsigned NOT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=binary;

 CREATE TABLE `redirect` (
  `rd_from` int(8) unsigned NOT NULL DEFAULT 0,
  `rd_namespace` int(11) NOT NULL DEFAULT 0,
  `rd_title` varbinary(255) NOT NULL DEFAULT '',
  `rd_interwiki` varbinary(32) DEFAULT NULL,
  `rd_fragment` varbinary(255) DEFAULT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;

 CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL,
  `page_namespace` int(11) NOT NULL DEFAULT 0,
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_random` double unsigned NOT NULL DEFAULT 0,
  `page_touched` binary(14) NOT NULL,
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT 0,
  `page_len` int(8) unsigned NOT NULL DEFAULT 0,
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;

CREATE TABLE `linktarget` (
  `lt_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `lt_namespace` int(11) NOT NULL,
  `lt_title` varbinary(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;

Then read in dumps, skipping DDL

pv enwiki-*-pagelinks.sql | tail +34 | mysql wiki
pv enwiki-*-page.sql | tail +45 | mysql wiki
pv enwiki-*-redirect.sql | tail +38 | mysql wiki
pv enwiki-*-linktarget.sql | tail +34 | mysql wiki

Then build indexes

create index page_title_ix on page (page_title);
create index page_title_id on page(page_id);
create index page_title_namespace on page(page_namespace);

To export for later:

select * from page into outfile '/tmp/wiki-page-dump';
select * from pagelinks into outfile '/tmp/wiki-pagelinks-dump';
select * from redirect into outfile '/tmp/wiki-redirect-dump';
select * from linktarget into outfile '/tmp/wiki-linktarget-dump';

Then compress and such:

sudo mv /tmp/wiki-*-dump ~/data
sudo chown $(id -u):$(id -g) ~/data/wiki-*-dump
zstd -T0 ~/data/wiki-*-dump

Then to import (assuming wiki-page-dump is on the server at some location):

LOAD DATA INFILE 'wiki-page-dump' INTO TABLE page;
LOAD DATA INFILE 'wiki-pagelinks-dump' INTO TABLE pagelinks;
LOAD DATA INFILE 'wiki-redirect-dump' INTO TABLE redirect;
LOAD DATA INFILE 'wiki-linktarget-dump' INTO TABLE linktarget;

SQLite Databases

There are two SQLite databases:

master.db - this is the persistent database with an unlimited lifetime. It is used to log performance and usage.
graph.db - this is generated by the build routine, and also contains a cache. The cache is here because it becomes useless when the rest of the graph is updated.

Queries

To find large paths:

SELECT
  sv.title AS 'source_title',
  tv.title AS 'target_title',
  coalesce(json_array_length(json_extract(path_data, '$[0]')),0) AS degrees
  FROM path p
  INNER JOIN vertexes sv ON p.source_page_id=sv.id
  INNER JOIN vertexes tv ON p.target_page_id=tv.id
  WHERE degrees >= 5
  ORDER BY degrees DESC
  LIMIT 20;

Database build process

The build process is a bit involved, so here's a high-level overview of how it's done in code.

"tool" downloads the latest Wikipedia dump from here. Specifically, the files are:
- enwiki-[date]-pagelinks.sql.gz
- enwiki-[date]-page_props.sql.gz (not currently used)
- enwiki-[date]-page.sql.gz
- enwiki-[date]-redirect.sql.gz
- enwiki-[date]-linktarget.sql.gz
Read the page table, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a page ID and a page title. In batches, write these records to SQLite. They go in the the vertexes table.
Read pagelinks, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a source page ID and a "link target" ID.
Because pagelinks only specifies a target page title (but a source page ID), we want to resolve the target pages to IDs. We do this by iterating over the channel from the above step, combining the titles to look up into chunks in order to reduce the number of queries, then looking up the IDs in SQLite. We the write the source and destination IDs to a very simple flat-file adjacency list made in "edge_db.rs".
This edge list is then duplicated: one for the incoming edges, and one for the outgoing edges. This is because the algorithm is bi-directional, and the vertex_al file (see "Graph Database" above and last step below) contains both incoming and outgoing edges.
Both of these edge data files are then sorted by their respective page IDs.
We then read the sorted edge data files, and write them to the vertex_al file. We also write an index file, vertex_al_ix, which is an array of offsets into the vertex_al file based on the page ID. This is used to quickly look up the edges for a given page ID.

Prior to July 1, 2024, linktarget was unnecessary, and the way to generate the list of links (from Page ID to Page ID) was:

Read the page table, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a page ID and a page title. In batches, write these records to SQLite. They go in the the vertexes table.
Read pagelinks, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contained a source page ID and a target page title. Send each record down a channel.
Because pagelinks only specifies a target page title (but a source page ID), we want to resolve the target pages to IDs. We do this by iterating over the channel from the above step, combining the titles to look up into chunks in order to reduce the number of queries, then looking up the IDs in SQLite. We the write the source and destination IDs to a very simple flat-file adjacency list made in "edge_db.rs".
This edge list is then duplicated: one for the incoming edges, and one for the outgoing edges. This is because the algorithm is bi-directional, and the vertex_al file (see "Graph Database" above and last step below) contains both incoming and outgoing edges.
Both of these edge data files are then sorted by their respective page IDs.
We then read the sorted edge data files, and write them to the vertex_al file. We also write an index file, vertex_al_ix, which is an array of offsets into the vertex_al file based on the page ID. This is used to quickly look up the edges for a given page ID.

Name		Name	Last commit message	Last commit date
Latest commit History 943 Commits
.github		.github
.idea		.idea
.vscode		.vscode
migration		migration
server		server
tool		tool
wikiwalk-ui		wikiwalk-ui
wikiwalk		wikiwalk
.dockerignore		.dockerignore
.gitignore		.gitignore
.justfile		.justfile
.mise.toml		.mise.toml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile.app		Dockerfile.app
Dockerfile.chef-app		Dockerfile.chef-app
Dockerfile.service		Dockerfile.service
LICENSE		LICENSE
README.md		README.md
graph_db.py		graph_db.py
run-build		run-build
run-import-dumps		run-import-dumps
stream-build-logs		stream-build-logs
wikiwalk-build-crontab		wikiwalk-build-crontab
wikiwalk-build-sentry-watchdog-commit		wikiwalk-build-sentry-watchdog-commit
wikiwalk-build-sentry-watchdog-ping		wikiwalk-build-sentry-watchdog-ping
wikiwalk-build.service		wikiwalk-build.service
wikiwalk-build.timer		wikiwalk-build.timer
wikiwalk-certs		wikiwalk-certs
wikiwalk-certs.service		wikiwalk-certs.service
wikiwalk-certs.timer		wikiwalk-certs.timer
wikiwalk-monitor		wikiwalk-monitor
wikiwalk-server.rc		wikiwalk-server.rc
wikiwalk-watchdog		wikiwalk-watchdog
wikiwalk-watchdog.service		wikiwalk-watchdog.service
wikiwalk.conf		wikiwalk.conf
wikiwalk.conf.freebsd		wikiwalk.conf.freebsd
wikiwalk.service		wikiwalk.service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikiwalk

Prior Art

Graph Database

vertex_al

vertex_al_ix

Importing MySQL data for test

SQLite Databases

Queries

Database build process

About

Releases

Packages

Contributors 2

Languages

License

hut8/wikiwalk

Folders and files

Latest commit

History

Repository files navigation

wikiwalk

Prior Art

Graph Database

vertex_al

vertex_al_ix

Importing MySQL data for test

SQLite Databases

Queries

Database build process

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages