Add US zips data and data-loader/ module

dgroomes · Jan 9, 2023 · dc836b1 · dc836b1
1 parent 2ef084c
commit dc836b1
Show file tree

Hide file tree

Showing 17 changed files with 29,936 additions and 43 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,3 @@
-.idea/
+.idea/
+.gradle/
+build/
diff --git a/README.md b/README.md
@@ -31,6 +31,33 @@ in the same Postgres database. We want to answer these questions:
      Apache AGE is rapidly evolving and is strategically invested in relational databases, so I'm hopeful that this area
      enriches over time).
 
+This project uses US geographies as its data domain. Specifically, we'll model ZIP codes, their containing city and
+their containing state. This creates a tree-like structure. This data model is not complex enough to warrant a graph
+data model so let's make it more interesting and also model "state adjacencies". For example, Minnesota neighbors
+Wisconsin.
+
+This project uses Docker to run a Postgres database pre-installed with Apache AGE.
+
+This project defines a multi-module Gradle project that defines Java programs that load the initial domain data, migrate
+the data from a relational state to a graph state, and query the data.
+
+Here is a breakdown of the components of this project:
+
+* `docker-compose.yml` and `postgres-init/`
+   * This is the Docker-related stuff. The Docker Compose file defines the Postgres container and mounts the `postgres-init/`
+     directory into the container. The `postgres-init/` directory contains the SQL scripts that initialize the database
+     with the relational schema and the US state data.
+* `data-loader/`
+    * `data-loader/` is a Gradle module. It defines a Java program that loads the ZIP code and city data from the
+      `zips.jsonl` file.
+* `data-migrator/`
+    * NOT YET IMPLEMENTED
+    * `data-migrator/` is a Gradle module. It defines a Java program that migrates the relational data to a graph data
+      model.
+* `data-queryer/`
+    * NOT YET IMPLEMENTED
+    * `data-queryer/` is a Gradle module. It defines a Java program that queries the graph data using Cypher.
+
 
 ## Background
 
@@ -76,52 +103,36 @@ Follow these instructions to get up and running with a graph database, some samp
    * ```shell
      docker-compose up --detach
      ```
-3. Start a psql session.
+   * As part of the startup procedure, the relational schema is created and the US state data gets loaded.  
+3. Load the ZIP code and city data.
    * ```shell
-     docker exec --interactive --tty cypher-playground-postgres-1 psql --username postgres
-     ```
-4. Create a sample graph.
-   * ```sql
-     SELECT create_graph('my_graph');
+     ./gradlew :data-loader:run
      ```
-   * Note: this is following the [*Quick Start* instructions in the Apache AGE README](https://github.com/apache/age#quick-start).
-5. Create a vertex.
-   * ```sql
-     SELECT *
-     FROM cypher('my_graph', $$
-         CREATE (n)
-     $$) as (v agtype);     
+   * It will look something like the following.
+   * ```text
+     00:19:46 [main] INFO  dataloader.Main - Loading ZIP code data from the local file into Postgres ...
+     00:20:22 [main] INFO  dataloader.Main - Loaded 25,701 cities and 29,353 ZIP codes.
      ```
-   * Yes, it is verbose. This is because the Cypher query is embedded in a SQL query.
-6. Query the graph
-   * ```sql
-     SELECT * FROM cypher('my_graph', $$
-     MATCH (v)
-     RETURN v
-     $$) as (v agtype);
+4. Migrate the relational data to a graph model.
+   * NOT YET IMPLEMENTED
+   * ```shell
+     ./gradlew :data-migrator:run
      ```
-   * It returns a serialized graph object. Interestingly, it is embedded in a regular table-style result set because we
-     are operating in a traditional psql and relational context. Altogether, the query and result  looks like the
-     following.
-   * ```text
-     postgres=# SELECT * FROM cypher('my_graph', $$
-     postgres$# MATCH (v)
-     postgres$# RETURN v
-     postgres$# $$) as (v agtype);
-                                    v
-     ----------------------------------------------------------------
-      {"id": 281474976710657, "label": "", "properties": {}}::vertex
-     (1 row)
+5. Query the graph data.
+   * NOT YET IMPLEMENTED
+   * ```shell
+     ./gradlew :data-queryer:run
      ```
-7. When you're done, stop the database.
+   * Read the Java source code to understand the Cypher queries.
+6. When you're done, stop the database.
    * ```shell
      docker-compose down
      ```
 
 
 ## Notes
 
-The [AGE manual](https://age.apache.org/age-manual) is great. Here some quotes.
+The [AGE manual](https://age.apache.org/age-manual) is great. Here are some quotes.
 
 > Cypher uses a Postgres namespace for every individual graph. It is recommended that no DML or DDL commands are
 > executed in the namespace that is reserved for the graph.
@@ -155,7 +166,7 @@ General clean-ups, TODOs and things I wish to implement for this project:
 
 * [ ] Consider building from source; might not be worth it.
 * [ ] Use the Apache AGE Viewer to visualize the graph.
-* [ ] IN PROGRESS Bring in an interesting set of example data as *relational data*. Consider the ZIP code data of my other projects.
+* [x] DONE Bring in an interesting set of example data as *relational data*. Consider the ZIP code data of my other projects.
   * [My other project `dgroomes/mongodb-playground`](https://github.com/dgroomes/mongodb-playground) has ZIP data. I'll
     bring it to this project here and import it maybe as CSV?
 * [ ] Write a relational-to-graph migration program to port the data from relational SQL tables to an AGE graph. This
@@ -165,6 +176,10 @@ General clean-ups, TODOs and things I wish to implement for this project:
   that relate to objects that look like XYZ" and then "sum up the numeric field ABC and find the top 10 results". I want
   to compare and contrast Cypher with SQL but be fair and give them challenges that they are suited to.  
 * [ ] Write SQL queries over the graph data. They should engage the cyclic nature of the data.
+* [ ] This isn't related to Cypher or AGE at all, but I'd like to maybe do a stored procedure so I can load the data in
+  a batch. Right now the bottleneck is that I need to do a full round trip just to insert one city and get its surrogate
+  key. I know Hibernate's trick is that it generates its own keys instead of letting Postgres do it. I think it locks
+  onto a block range of keys or something? I suppose that's clever.
 
 
 ## Reference

diff --git a/buildSrc/build.gradle.kts b/buildSrc/build.gradle.kts
@@ -0,0 +1,7 @@
+plugins {
+    `kotlin-dsl`
+}
+
+repositories {
+    gradlePluginPortal()
+}
diff --git a/buildSrc/src/main/kotlin/common.gradle.kts b/buildSrc/src/main/kotlin/common.gradle.kts
@@ -0,0 +1,16 @@
+// This is a "pre-compiled script plugin", or a "convention plugin". See the Gradle docs: https://docs.gradle.org/current/samples/sample_convention_plugins.html#compiling_convention_plugins
+
+plugins {
+    java
+}
+
+java {
+    toolchain {
+        languageVersion.set(JavaLanguageVersion.of(17))
+    }
+}
+
+repositories {
+    mavenCentral()
+}
+
diff --git a/data-loader/build.gradle.kts b/data-loader/build.gradle.kts
@@ -0,0 +1,15 @@
+plugins {
+    id("common")
+    application
+}
+
+dependencies {
+    implementation(libs.logback)
+    implementation(platform(libs.jackson.bom))
+    implementation(libs.jackson.databind)
+    runtimeOnly(libs.postgres.jdbc)
+}
+
+application {
+    mainClass.set("dataloader.Main")
+}
diff --git a/data-loader/src/main/java/dataloader/Main.java b/data-loader/src/main/java/dataloader/Main.java
@@ -0,0 +1,96 @@
+package dataloader;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.PropertyNamingStrategies;
+import com.fasterxml.jackson.databind.json.JsonMapper;
+import org.slf4j.LoggerFactory;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Files;
+import java.sql.DriverManager;
+import java.sql.ResultSet;
+import java.sql.SQLException;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Stream;
+
+import static java.util.stream.Collectors.groupingBy;
+
+/**
+ * Read the 'zips.jsonl' file line-by-line, parse each line as a JSON object, insert city rows as needed, and insert
+ * a ZIP code row for each line.
+ */
+public class Main {
+
+  public static void main(String[] args) throws SQLException {
+    var log = LoggerFactory.getLogger(Main.class);
+    log.info("Loading ZIP code data from the local file into Postgres ...");
+    record Zip(String zipCode, String cityName, String stateCode, int population) {}
+    record City(String name, String stateCode) {}
+
+    Map<City, List<Zip>> citiesToZips;
+    // Read the ZIP code data from the local JSON file. The cities are also inferred from the ZIP code data.
+    {
+      JsonMapper jsonMapper = JsonMapper.builder().propertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE).build();
+
+      File zipsFile = new File("zips.jsonl");
+      if (!zipsFile.exists()) {
+        String msg = "The 'zips.jsonl' file could not be found (%s). You need to run this program from the root of the 'data-loader' module.".formatted(zipsFile.getAbsolutePath());
+        throw new RuntimeException(msg);
+      }
+
+      try (Stream<String> zipsJsonLines = Files.lines(zipsFile.toPath())) {
+        citiesToZips = zipsJsonLines.map(zipJson -> {
+          JsonNode zipNode;
+          try {
+            zipNode = jsonMapper.readTree(zipJson);
+          } catch (JsonProcessingException e) {
+            throw new IllegalStateException("Failed to deserialize the JSON representing a ZIP code", e);
+          }
+
+          return new Zip(zipNode.get("_id").asText(), zipNode.get("city").asText(), zipNode.get("state").asText(), zipNode.get("pop").asInt());
+        }).collect(groupingBy(zip -> new City(zip.cityName, zip.stateCode)));
+      } catch (IOException e) {
+        throw new RuntimeException("There was an error while reading the ZIP data from the file.", e);
+      }
+    }
+
+    // Insert the ZIP and city data into the database.
+    try (var connection = DriverManager.getConnection("jdbc:postgresql:postgres", "postgres", null);
+         var insertCityStmt = connection.prepareStatement("INSERT INTO cities (city_name, state_code) VALUES (?, ?) returning id");
+         var insertZipStmt = connection.prepareStatement("INSERT INTO zip_codes (zip_code, city_id, population) VALUES (?, ?, ?)")) {
+
+      int loadedCities = 0;
+      int loadedZips = 0;
+
+      for (Map.Entry<City, List<Zip>> cityListEntry : citiesToZips.entrySet()) {
+        City city = cityListEntry.getKey();
+        int cityId;
+        {
+          log.trace("Inserting city {} ...", city);
+          insertCityStmt.setString(1, city.name);
+          insertCityStmt.setString(2, city.stateCode);
+          ResultSet resultSet = insertCityStmt.executeQuery();
+          if (!resultSet.next()) throw new IllegalStateException("Expected a result set but didn't find one.");
+          cityId = resultSet.getInt("id");
+          loadedCities++;
+        }
+
+        List<Zip> zips = cityListEntry.getValue();
+        for (Zip zip : zips) {
+          log.trace("Inserting ZIP {} ...", zip);
+          insertZipStmt.setString(1, zip.zipCode);
+          insertZipStmt.setInt(2, cityId);
+          insertZipStmt.setInt(3, zip.population);
+          insertZipStmt.execute();
+          loadedZips++;
+        }
+      }
+      log.info("Loaded %,d cities and %,d ZIP codes.".formatted(loadedCities, loadedZips));
+    } catch (SQLException e) {
+      throw new RuntimeException("Something went wrong while inserting data into the Postgres database", e);
+    }
+  }
+}
diff --git a/data-loader/src/main/resources/logback.xml b/data-loader/src/main/resources/logback.xml
@@ -0,0 +1,15 @@
+<!-- For help with Logback configuration, see http://logback.qos.ch/manual/configuration.html -->
+<configuration>
+
+    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
+        <encoder>
+            <pattern>%d{HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n</pattern>
+        </encoder>
+    </appender>
+
+    <root level="info">
+        <appender-ref ref="STDOUT"/>
+    </root>
+
+    <logger name="dataloader" level="info"/>
+</configuration>