Duplicate records in Person_email_EmailAddress ? #388

arvindshmicrosoft · 2022-05-01T02:22:10Z

I was under the impression that [PersonId + emailAddressId] in the Person_email_EmailAddress dataset is a natural primary key. However, I see that is not true. I wanted to check if these duplicates are by design or not expected. As an example, here's a snippet from one such file (part-00001-87466142-687b-472e-aa0a-85ceaa2f5416-c000.csv), where I see duplicates (as indicated by the >> symbol inline:

>> 2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
>> 2012-06-24T02:26:04.702+00:00|30786325582847|[email protected]
>> 2011-04-08T11:33:36.428+00:00|15393162789375|[email protected]
2011-04-08T11:33:36.428+00:00|15393162789375|[email protected]
>> 2011-04-08T11:33:36.428+00:00|15393162789375|[email protected]
>> 2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
>> 2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
2011-11-16T04:45:14.538+00:00|24189255820403|[email protected]
>> 2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
>> 2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
2012-10-04T04:07:20.524+00:00|35184372089235|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
>> 2010-03-03T07:34:45.087+00:00|495|[email protected]
2012-05-22T09:19:56.885+00:00|30786325585391|[email protected]
2012-05-22T09:19:56.885+00:00|30786325585391|[email protected]
2011-05-13T01:56:12.531+00:00|17592186054855|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]
>> 2010-06-29T13:38:07.653+00:00|4398046512623|[email protected]

The text was updated successfully, but these errors were encountered:

szarnyasg · 2022-05-01T11:42:44Z

Thanks for reporting this! Indeed, there seem to be duplicates in the email address, both in the composite and in singular formats:

export SF=0.003
rm -rf out-sf${SF}/
tools/run.py \
    --cores $(nproc) \
    --memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
    ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
    -- \
    --format csv \
    --scale-factor ${SF} \
    --mode bi \
    --output-dir out-sf${SF} \
    --generate-factors \
     --explode-edges \
     --explode-attrs

$ head out-sf0.003/graphs/csv/bi/singular-projected-fk/initial_snapshot/dynamic/Person_email_EmailAddress/part-*.csv 
creationDate|PersonId|EmailAddressId
2010-01-03T15:10:31.499+00:00|14|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-01-31T13:13:03.929+00:00|16|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]
2010-02-12T22:05:24.513+00:00|32|[email protected]

export SF=0.003
rm -rf out-sf${SF}/
tools/run.py \
    --cores $(nproc) \
    --memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
    ./target/ldbc_snb_datagen_${PLATFORM_VERSION}-${DATAGEN_VERSION}.jar \
    -- \
    --format csv \
    --scale-factor ${SF} \
    --mode bi \
    --output-dir out-sf${SF} \
    --generate-factors \
     --explode-edges

$ head -n3 out-sf0.003/graphs/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-*.csv
creationDate|id|firstName|lastName|gender|birthday|locationIP|browserUsed|language|email
2010-01-03T15:10:31.499+00:00|14|Hossein|Forouhar|male|1984-03-11|77.245.239.11|Firefox|fa;ku;en|[email protected]
2010-01-31T13:13:03.929+00:00|16|Jan|Zakrzewski|female|1986-07-05|31.41.169.140|Chrome|pl;en|[email protected];[email protected];[email protected];[email protected];[email protected]

According to the specification, the email addresses are supposed to be stored in a set, so the values should be unique. Therefore, this is a bug in the generator and we'll take a look. That said, the BI workload currently does not make use of the email addresses, so this is not a critical problem for benchmark runs.

arvindshmicrosoft · 2022-05-01T15:26:42Z

Thanks Gabor. Keeping in mind the future state, I am only using the Spark based datagen for all cases including for interactive queries. For now, I will just workaround by de-duping inside the DB.

szarnyasg · 2022-05-03T22:00:35Z

I found the culprit: in the old (Hadoop-based) Datagen, the email attribute was a Set:

https://github.com/ldbc/ldbc_snb_datagen_hadoop/blob/5ba94c3c0873c397c32b6e3480951f4ea7973982/src/main/java/ldbc/snb/datagen/entities/dynamic/person/Person.java#L72

This was changed to a List without consideration for potential duplicates. This should be easy to fix.

szarnyasg added this to the Milestone 4 milestone May 3, 2022

szarnyasg self-assigned this May 3, 2022

szarnyasg closed this as completed in 59d0032 May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate records in Person_email_EmailAddress ? #388

Duplicate records in Person_email_EmailAddress ? #388

arvindshmicrosoft commented May 1, 2022

szarnyasg commented May 1, 2022 •

edited

Loading

arvindshmicrosoft commented May 1, 2022

szarnyasg commented May 3, 2022

Duplicate records in Person_email_EmailAddress ? #388

Duplicate records in Person_email_EmailAddress ? #388

Comments

arvindshmicrosoft commented May 1, 2022

szarnyasg commented May 1, 2022 • edited Loading

arvindshmicrosoft commented May 1, 2022

szarnyasg commented May 3, 2022

szarnyasg commented May 1, 2022 •

edited

Loading