-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate records in Person_email_EmailAddress ? #388
Comments
Thanks for reporting this! Indeed, there seem to be duplicates in the email address, both in the export SF=0.003
rm -rf out-sf${SF}/
tools/run.py \
--cores $(nproc) \
--memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
-- \
--format csv \
--scale-factor ${SF} \
--mode bi \
--output-dir out-sf${SF} \
--generate-factors \
--explode-edges \
--explode-attrs $ head out-sf0.003/graphs/csv/bi/singular-projected-fk/initial_snapshot/dynamic/Person_email_EmailAddress/part-*.csv
creationDate|PersonId|EmailAddressId
2010-01-03T15:10:31.499+00:00|14|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected] export SF=0.003
rm -rf out-sf${SF}/
tools/run.py \
--cores $(nproc) \
--memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
-- \
--format csv \
--scale-factor ${SF} \
--mode bi \
--output-dir out-sf${SF} \
--generate-factors \
--explode-edges $ head -n3 out-sf0.003/graphs/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-*.csv
creationDate|id|firstName|lastName|gender|birthday|locationIP|browserUsed|language|email
2010-01-03T15:10:31.499+00:00|14|Hossein|Forouhar|male|1984-03-11|77.245.239.11|Firefox|fa;ku;en|[email protected]
2010-01-31T13:13:03.929+00:00|16|Jan|Zakrzewski|female|1986-07-05|31.41.169.140|Chrome|pl;en|[email protected];[email protected];[email protected];[email protected];[email protected] According to the specification, the email addresses are supposed to be stored in a set, so the values should be unique. Therefore, this is a bug in the generator and we'll take a look. That said, the BI workload currently does not make use of the email addresses, so this is not a critical problem for benchmark runs. |
Thanks Gabor. Keeping in mind the future state, I am only using the Spark based datagen for all cases including for interactive queries. For now, I will just workaround by de-duping inside the DB. |
I found the culprit: in the old (Hadoop-based) Datagen, the email attribute was a This was changed to a |
I was under the impression that [PersonId + emailAddressId] in the Person_email_EmailAddress dataset is a natural primary key. However, I see that is not true. I wanted to check if these duplicates are by design or not expected. As an example, here's a snippet from one such file (part-00001-87466142-687b-472e-aa0a-85ceaa2f5416-c000.csv), where I see duplicates (as indicated by the >> symbol inline:
The text was updated successfully, but these errors were encountered: