Skip to content

Commit

Permalink
Add US zips data and data-loader/ module
Browse files Browse the repository at this point in the history
  • Loading branch information
dgroomes committed Jan 9, 2023
1 parent 2ef084c commit dc836b1
Show file tree
Hide file tree
Showing 17 changed files with 29,936 additions and 43 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
.idea/
.idea/
.gradle/
build/
85 changes: 50 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,33 @@ in the same Postgres database. We want to answer these questions:
Apache AGE is rapidly evolving and is strategically invested in relational databases, so I'm hopeful that this area
enriches over time).

This project uses US geographies as its data domain. Specifically, we'll model ZIP codes, their containing city and
their containing state. This creates a tree-like structure. This data model is not complex enough to warrant a graph
data model so let's make it more interesting and also model "state adjacencies". For example, Minnesota neighbors
Wisconsin.

This project uses Docker to run a Postgres database pre-installed with Apache AGE.

This project defines a multi-module Gradle project that defines Java programs that load the initial domain data, migrate
the data from a relational state to a graph state, and query the data.

Here is a breakdown of the components of this project:

* `docker-compose.yml` and `postgres-init/`
* This is the Docker-related stuff. The Docker Compose file defines the Postgres container and mounts the `postgres-init/`
directory into the container. The `postgres-init/` directory contains the SQL scripts that initialize the database
with the relational schema and the US state data.
* `data-loader/`
* `data-loader/` is a Gradle module. It defines a Java program that loads the ZIP code and city data from the
`zips.jsonl` file.
* `data-migrator/`
* NOT YET IMPLEMENTED
* `data-migrator/` is a Gradle module. It defines a Java program that migrates the relational data to a graph data
model.
* `data-queryer/`
* NOT YET IMPLEMENTED
* `data-queryer/` is a Gradle module. It defines a Java program that queries the graph data using Cypher.


## Background

Expand Down Expand Up @@ -76,52 +103,36 @@ Follow these instructions to get up and running with a graph database, some samp
* ```shell
docker-compose up --detach
```
3. Start a psql session.
* As part of the startup procedure, the relational schema is created and the US state data gets loaded.
3. Load the ZIP code and city data.
* ```shell
docker exec --interactive --tty cypher-playground-postgres-1 psql --username postgres
```
4. Create a sample graph.
* ```sql
SELECT create_graph('my_graph');
./gradlew :data-loader:run
```
* Note: this is following the [*Quick Start* instructions in the Apache AGE README](https://github.com/apache/age#quick-start).
5. Create a vertex.
* ```sql
SELECT *
FROM cypher('my_graph', $$
CREATE (n)
$$) as (v agtype);
* It will look something like the following.
* ```text
00:19:46 [main] INFO dataloader.Main - Loading ZIP code data from the local file into Postgres ...
00:20:22 [main] INFO dataloader.Main - Loaded 25,701 cities and 29,353 ZIP codes.
```
* Yes, it is verbose. This is because the Cypher query is embedded in a SQL query.
6. Query the graph
* ```sql
SELECT * FROM cypher('my_graph', $$
MATCH (v)
RETURN v
$$) as (v agtype);
4. Migrate the relational data to a graph model.
* NOT YET IMPLEMENTED
* ```shell
./gradlew :data-migrator:run
```
* It returns a serialized graph object. Interestingly, it is embedded in a regular table-style result set because we
are operating in a traditional psql and relational context. Altogether, the query and result looks like the
following.
* ```text
postgres=# SELECT * FROM cypher('my_graph', $$
postgres$# MATCH (v)
postgres$# RETURN v
postgres$# $$) as (v agtype);
v
----------------------------------------------------------------
{"id": 281474976710657, "label": "", "properties": {}}::vertex
(1 row)
5. Query the graph data.
* NOT YET IMPLEMENTED
* ```shell
./gradlew :data-queryer:run
```
7. When you're done, stop the database.
* Read the Java source code to understand the Cypher queries.
6. When you're done, stop the database.
* ```shell
docker-compose down
```
## Notes
The [AGE manual](https://age.apache.org/age-manual) is great. Here some quotes.
The [AGE manual](https://age.apache.org/age-manual) is great. Here are some quotes.
> Cypher uses a Postgres namespace for every individual graph. It is recommended that no DML or DDL commands are
> executed in the namespace that is reserved for the graph.
Expand Down Expand Up @@ -155,7 +166,7 @@ General clean-ups, TODOs and things I wish to implement for this project:
* [ ] Consider building from source; might not be worth it.
* [ ] Use the Apache AGE Viewer to visualize the graph.
* [ ] IN PROGRESS Bring in an interesting set of example data as *relational data*. Consider the ZIP code data of my other projects.
* [x] DONE Bring in an interesting set of example data as *relational data*. Consider the ZIP code data of my other projects.
* [My other project `dgroomes/mongodb-playground`](https://github.com/dgroomes/mongodb-playground) has ZIP data. I'll
bring it to this project here and import it maybe as CSV?
* [ ] Write a relational-to-graph migration program to port the data from relational SQL tables to an AGE graph. This
Expand All @@ -165,6 +176,10 @@ General clean-ups, TODOs and things I wish to implement for this project:
that relate to objects that look like XYZ" and then "sum up the numeric field ABC and find the top 10 results". I want
to compare and contrast Cypher with SQL but be fair and give them challenges that they are suited to.
* [ ] Write SQL queries over the graph data. They should engage the cyclic nature of the data.
* [ ] This isn't related to Cypher or AGE at all, but I'd like to maybe do a stored procedure so I can load the data in
a batch. Right now the bottleneck is that I need to do a full round trip just to insert one city and get its surrogate
key. I know Hibernate's trick is that it generates its own keys instead of letting Postgres do it. I think it locks
onto a block range of keys or something? I suppose that's clever.


## Reference
Expand Down
7 changes: 7 additions & 0 deletions buildSrc/build.gradle.kts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
plugins {
`kotlin-dsl`
}

repositories {
gradlePluginPortal()
}
16 changes: 16 additions & 0 deletions buildSrc/src/main/kotlin/common.gradle.kts
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// This is a "pre-compiled script plugin", or a "convention plugin". See the Gradle docs: https://docs.gradle.org/current/samples/sample_convention_plugins.html#compiling_convention_plugins

plugins {
java
}

java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(17))
}
}

repositories {
mavenCentral()
}

15 changes: 15 additions & 0 deletions data-loader/build.gradle.kts
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
plugins {
id("common")
application
}

dependencies {
implementation(libs.logback)
implementation(platform(libs.jackson.bom))
implementation(libs.jackson.databind)
runtimeOnly(libs.postgres.jdbc)
}

application {
mainClass.set("dataloader.Main")
}
96 changes: 96 additions & 0 deletions data-loader/src/main/java/dataloader/Main.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
package dataloader;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.PropertyNamingStrategies;
import com.fasterxml.jackson.databind.json.JsonMapper;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.List;
import java.util.Map;
import java.util.stream.Stream;

import static java.util.stream.Collectors.groupingBy;

/**
* Read the 'zips.jsonl' file line-by-line, parse each line as a JSON object, insert city rows as needed, and insert
* a ZIP code row for each line.
*/
public class Main {

public static void main(String[] args) throws SQLException {
var log = LoggerFactory.getLogger(Main.class);
log.info("Loading ZIP code data from the local file into Postgres ...");
record Zip(String zipCode, String cityName, String stateCode, int population) {}
record City(String name, String stateCode) {}

Map<City, List<Zip>> citiesToZips;
// Read the ZIP code data from the local JSON file. The cities are also inferred from the ZIP code data.
{
JsonMapper jsonMapper = JsonMapper.builder().propertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE).build();

File zipsFile = new File("zips.jsonl");
if (!zipsFile.exists()) {
String msg = "The 'zips.jsonl' file could not be found (%s). You need to run this program from the root of the 'data-loader' module.".formatted(zipsFile.getAbsolutePath());
throw new RuntimeException(msg);
}

try (Stream<String> zipsJsonLines = Files.lines(zipsFile.toPath())) {
citiesToZips = zipsJsonLines.map(zipJson -> {
JsonNode zipNode;
try {
zipNode = jsonMapper.readTree(zipJson);
} catch (JsonProcessingException e) {
throw new IllegalStateException("Failed to deserialize the JSON representing a ZIP code", e);
}

return new Zip(zipNode.get("_id").asText(), zipNode.get("city").asText(), zipNode.get("state").asText(), zipNode.get("pop").asInt());
}).collect(groupingBy(zip -> new City(zip.cityName, zip.stateCode)));
} catch (IOException e) {
throw new RuntimeException("There was an error while reading the ZIP data from the file.", e);
}
}

// Insert the ZIP and city data into the database.
try (var connection = DriverManager.getConnection("jdbc:postgresql:postgres", "postgres", null);
var insertCityStmt = connection.prepareStatement("INSERT INTO cities (city_name, state_code) VALUES (?, ?) returning id");
var insertZipStmt = connection.prepareStatement("INSERT INTO zip_codes (zip_code, city_id, population) VALUES (?, ?, ?)")) {

int loadedCities = 0;
int loadedZips = 0;

for (Map.Entry<City, List<Zip>> cityListEntry : citiesToZips.entrySet()) {
City city = cityListEntry.getKey();
int cityId;
{
log.trace("Inserting city {} ...", city);
insertCityStmt.setString(1, city.name);
insertCityStmt.setString(2, city.stateCode);
ResultSet resultSet = insertCityStmt.executeQuery();
if (!resultSet.next()) throw new IllegalStateException("Expected a result set but didn't find one.");
cityId = resultSet.getInt("id");
loadedCities++;
}

List<Zip> zips = cityListEntry.getValue();
for (Zip zip : zips) {
log.trace("Inserting ZIP {} ...", zip);
insertZipStmt.setString(1, zip.zipCode);
insertZipStmt.setInt(2, cityId);
insertZipStmt.setInt(3, zip.population);
insertZipStmt.execute();
loadedZips++;
}
}
log.info("Loaded %,d cities and %,d ZIP codes.".formatted(loadedCities, loadedZips));
} catch (SQLException e) {
throw new RuntimeException("Something went wrong while inserting data into the Postgres database", e);
}
}
}
15 changes: 15 additions & 0 deletions data-loader/src/main/resources/logback.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<!-- For help with Logback configuration, see http://logback.qos.ch/manual/configuration.html -->
<configuration>

<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>

<root level="info">
<appender-ref ref="STDOUT"/>
</root>

<logger name="dataloader" level="info"/>
</configuration>
Loading

0 comments on commit dc836b1

Please sign in to comment.