Tool to generate sample graphs for Neo4j - mostly used to synthesize large graphs for testing.
- You will need to have Python 3 and compatible version of Pip installed.
- Then run
pip3 install -r requirements.txt
to obtain dependencies - If you do not have a yaml module installed, you may need to run
pip3 install pyyaml
Sample Graph can be generated using generate_graph_data.py
python3 generate_graph_data.py examples/small-test.conf
This will generate node and relationship files and output to either:
- csv
- gzip
- parquet (useful for direct graph build via Graph Data Science Library)
If csv or gzip are selected then header files and an importCommand.sh script will be generated that can be run to import via neo4j-admin.
Configuration is done via the yaml file - here are some example configs
- general directory setup/output
records_per_file
: [no records per file] - setting this will define how many files get created as (total no./records per)something to play with for larger datasets
- supports both 1000000 and 1,000,000 formatsdf_row_limit
: [max no. rows in dataframe] - helps with memory management when creating files with larger no. records, will create limit rows and append to file
This section gives the parameters that will be written to a shell script in the data generation folder
- options - any option that is available to admin-import can be placed in this section
- neo4j - this can be used to add the drop/create database commands to the shell script that will be run using cypher-shell
Repeating section for each node to be generated, common settings
label
: [label for node]no_to_generate
: [no. of nodes to generate] - supports both 1000000 and 1,000,000 formatsstart_id
: optional - give a start id that will be incremented on - useful when generating incremental sets e.g start numbering at this value (if not specified, numbering starts at 1)id_property_name
: [name for id property] if not present id column defaults to id if specified, id will be imported with this property name
Repeating section for each additional label to specify, useful for e.g. generating additional occasional labels like 'flagged'
labels
:name
: [name for section, gets used in df column names]values
: [list of labels to select from]probability
: [list of probabilities for the values] - if not specified defaults to random
Repeating section for each relationship to be generated, common settings
label
: [label for relationship]no_to_generate
: [no. of relationships to generate] - supports both 1000000 and 1,000,000 formatsratio_to_generate
: [ratio of relationships to generate] - can be used instead ofno_to_generate
will apply ratio tosource label
valueno_to_generate
source_node_label
: [label of source node]target_node_label
: [User of target node]rel_multiplier
: [will take random value between lower/upper and generate that no. of relationhships a source/target]lower
: 1upper
: 28
start_id
: optional - give a start id that will be incremented on - useful when generating incremental sets e.g start numbering at this value (if not specified, numbering starts at 1)
Each node/relationship block can have repeating properties section, properties can have
name
: [name of property]type
: [type of property] - each type has it's own set of configuration/behavioroutput_type
: [data type to be used in admin-import header] - used when generating e.g. email as while we want an email to be generated, the type for admin-import is string. Also applies to lists where its random selection of a value.
int
will generate random int between lower and upper values
lower
: [lowest int]upper
: [highest int]
float
will generate random float between lower and upper values with no. decimal places defined by precision
lower
: [lowest float]upper
: [highest float]precision
: [no decimal places]
boolean
will randomly assign true/false
date
will generate random date between lower and upper values - split down to ymd to avoid pesky US/European dates
lower
:year
: 2022month
: 1day
: 1
upper
:year
: 2023month
: 1day
: 1
datetime/epoch
will generate random datetime between lower and upper values - split down to ymd to avoid pesky US/European dates - can be output either as datetime string (default) or if using epoch use output_type: int
lower
:year
: 2022month
: 1day
: 1hour
: 1second
: 1minute
: 1
upper
:year
: 2023month
: 1day
: 1hour
: 1second
: 1minute
: 1
list
will randomly select a value from the list and pass to admin-import - if not string need to define type for admin-import using output_type
field
values
: [list of values to select from]probability
: [list of probabilities for the values] - if not specified defaults to random
array
will create an array of random ints of specified size in upper/lower bounds
size
: size of array (no. elements to generate)lower
: [lowest int]upper
: [highest int]
name
uses fake.name()
Faker function to generate random data
uses fake.company_email()
Faker function to generate random data
phone
uses fake.phone_number()
Faker function to generate random data
ssn
uses fake.ssn()
Faker function to generate random data
ip
uses fake.ipv4()
Faker function to generate random data
- support all valid types are as per admin-import documentation: https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import/#import-tool-header-format-properties