Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate records in Person_email_EmailAddress ? #388

Closed
arvindshmicrosoft opened this issue May 1, 2022 · 3 comments
Closed

Duplicate records in Person_email_EmailAddress ? #388

arvindshmicrosoft opened this issue May 1, 2022 · 3 comments
Assignees
Milestone

Comments

@arvindshmicrosoft
Copy link
Contributor

I was under the impression that [PersonId + emailAddressId] in the Person_email_EmailAddress dataset is a natural primary key. However, I see that is not true. I wanted to check if these duplicates are by design or not expected. As an example, here's a snippet from one such file (part-00001-87466142-687b-472e-aa0a-85ceaa2f5416-c000.csv), where I see duplicates (as indicated by the >> symbol inline:

>> 2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
>> 2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
>> 2011-04-08T11:33:36.428+00:00|15393162789375|[email protected]
2011-04-08T11:33:36.428+00:00|15393162789375|[email protected]
>> 2011-04-08T11:33:36.428+00:00|15393162789375|[email protected]
>> 2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
>> 2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
>> 2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
>> 2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
2012-05-22T09:19:56.885+00:00|30786325585391|[email protected]
2012-05-22T09:19:56.885+00:00|30786325585391|[email protected]
2011-05-13T01:56:12.531+00:00|17592186054855|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
@szarnyasg
Copy link
Member

szarnyasg commented May 1, 2022

Thanks for reporting this! Indeed, there seem to be duplicates in the email address, both in the composite and in singular formats:

export SF=0.003
rm -rf out-sf${SF}/
tools/run.py \
    --cores $(nproc) \
    --memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
    ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
    -- \
    --format csv \
    --scale-factor ${SF} \
    --mode bi \
    --output-dir out-sf${SF} \
    --generate-factors \
     --explode-edges \
     --explode-attrs
$ head out-sf0.003/graphs/csv/bi/singular-projected-fk/initial_snapshot/dynamic/Person_email_EmailAddress/part-*.csv 
creationDate|PersonId|EmailAddressId
2010-01-03T15:10:31.499+00:00|14|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
export SF=0.003
rm -rf out-sf${SF}/
tools/run.py \
    --cores $(nproc) \
    --memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
    ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
    -- \
    --format csv \
    --scale-factor ${SF} \
    --mode bi \
    --output-dir out-sf${SF} \
    --generate-factors \
     --explode-edges
$ head -n3 out-sf0.003/graphs/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-*.csv
creationDate|id|firstName|lastName|gender|birthday|locationIP|browserUsed|language|email
2010-01-03T15:10:31.499+00:00|14|Hossein|Forouhar|male|1984-03-11|77.245.239.11|Firefox|fa;ku;en|[email protected]
2010-01-31T13:13:03.929+00:00|16|Jan|Zakrzewski|female|1986-07-05|31.41.169.140|Chrome|pl;en|[email protected];[email protected];[email protected];[email protected];[email protected]

According to the specification, the email addresses are supposed to be stored in a set, so the values should be unique. Therefore, this is a bug in the generator and we'll take a look. That said, the BI workload currently does not make use of the email addresses, so this is not a critical problem for benchmark runs.

@arvindshmicrosoft
Copy link
Contributor Author

Thanks Gabor. Keeping in mind the future state, I am only using the Spark based datagen for all cases including for interactive queries. For now, I will just workaround by de-duping inside the DB.

@szarnyasg szarnyasg added this to the Milestone 4 milestone May 3, 2022
@szarnyasg
Copy link
Member

I found the culprit: in the old (Hadoop-based) Datagen, the email attribute was a Set:

https://github.com/ldbc/ldbc_snb_datagen_hadoop/blob/5ba94c3c0873c397c32b6e3480951f4ea7973982/src/main/java/ldbc/snb/datagen/entities/dynamic/person/Person.java#L72

This was changed to a List without consideration for potential duplicates. This should be easy to fix.

@szarnyasg szarnyasg self-assigned this May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants