Releases: neo4j-field/bigquery-connector
0.6.1
0.6.0
This release includes the following changes;
- Improved error handling
- Surfaces
neo4j_action
parameter so that a new database can also be created - Arrow connection information is now auto-discovered using Neo4j Bolt connection, meaning that
neo4j_host
parameter is now replaced byneo4j_uri
which expects an ordinary Neo4j URI - Added support for model validation
- Added pattern support for GDS->BigQuery direction and can include multiple patterns in a single run
0.5.1 - 🤏 Fix bug with write-backs for tiny datasets
If using a very small dataset (say a graph projection of < 10,000 nodes) and trying to write data back to BigQuery, we had the potential to trigger a RuntimeError
because of a call to the finalizing method on a BigQuerySink
when there's no defined BigQuery stream name.
0.5.0 - 🤪 Bumping 3rd Party Dependencies
This version mostly handles updates to the following dependencies:
google-dataproc-templates
-- for some reason they pulled a package/release from PyPI again 🤬!!!neo4j_arrow
-- updates to v0.5.0 to pull in fixes for database creation for self-managed GDS Enterprise users.
0.4.0 - ↩️ Write-backs to BigQuery
Initial support for streaming data back to BigQuery from Neo4j AuraDS (or self-managed GDS) using a new template: Neo4jGDSToBigQueryTemplate
Supports both streaming of nodes (with or without properties) and relationship/edges (with or without properties). Any properties are stored in the resulting BigQuery table using a JSON
field for flexibility.
0.3.1 - Field Filtering Fix (F^3)
Primarily a fix for supporting field filters, i.e. targeting fields in BigQuery tables based on the graph model.
While one could argue it's a new feature, the feature exists in the underlying neo4j_arrow module and wasn't properly wired in, so I consider this a bugfix 😉.
0.3.0 - 🤫 It's a Secret
Refinements since the initial prototype:
- supports using Google Secret Manager 🤫 to store the Neo4j password and any other settings
- switches to using native
ARRAY<STRING>
types fornode_tables
andedge_tables
inputs - bug fix 🐛 where the provided graph name wasn't overriding one in the graph model file
- updated docs/README 📄
The stored procedure signature is now reduced a bit to something cleaner.
0.2.0 - BQ to Neo4j Prototype
Initial functional prototype of using a BigQuery Stored Procedure for Apache Spark (say that 5 times fast) to lift and shift ~50 GiB dataset from BigQuery into Neo4j AuraDS using Neo4j's Apache Arrow Flight service.
🥳 It Lives
🤗 What's Known to Work
- Pre-engineered datasets compliant with Neo4j GDS should lift/shift fine. This means a pre-designed node id space. Currently the import job does not address that, but there's a back-of-a-napkin-design on my desk for that.
- Pushing 10's of GiB's of data works great staying in-region. Making sure the BQ dataset, Apache Spark "connection", and the AuraDS instance are co-located in the same region (e.g.
europe-west1
) keeps the throughput high without becoming network-bound. (I've observed maybe an order of magnitude drop in throughput going from US to EU.)
⚠️ Currently Known Gotchas
- The Dataproc "template" works great in Dataproc Serverless (that's where it was originally built) but requires some odd hacks to get running under BigQuery. There are currently some...issues...just running the same container image because of how BigQuery orchestrates the Apache Spark environment.
- We need to document all the cloud setup steps for this, specifically all the IAM roles that need to be granted to the service account used by BigQuery's Dataproc runner and access permissions to the Docker image. (As well as how to publish/host it.)
🔥Hot Topics for the Next Release
- Reading back from AuraDS into BigQuery using the Storage Write API
- Integration into GCP Secret Manager to get rid of plaint-text passwords (gross)
- Cleanup the stored proc inputs (currently all BigQuery
STRING
args) (e.g. table list as anARRAY<STRING>
) and maybe tuck seldom-used config options into a BigQuerySTRUCT
like GDS does to make the procedure signature shorter.
0.1.0 - Initial Prototype
- Can recreate the GraphConnect 2022 demo in under 10 minutes total runtime (including orchestration).
- Includes a user-agent string identifying Neo4j as the cloud partner driving the consumption of BigQuery data.
- Tested with both self-managed GDS on GCE and AuraDS.