Handle PARSynthesizer model if sequence_index is missing #114

lajohn4747 · 2024-05-17T16:43:42Z

resolves sdv-dev/SDV#1972
CU-86b08wr44

When sequence index is missing, par.py adds a constant column to allow for modeling as seen here. The added context column does not exist in the data though causing KeyErrors. Added a check to prevent failures.

amontanez24 · 2024-05-17T17:32:19Z

deepecho/sequences.py

@@ -181,7 +182,8 @@ def assemble_sequences(
        groupby_columns = entity_columns[0] if len(entity_columns) == 1 else entity_columns
        for _, sequence in data.groupby(groupby_columns):
            sequence.drop(entity_columns, axis=1, inplace=True)
-            if context_columns:
+            missing_columns = [col for col in context_columns if col not in sequence.columns]
+            if context_columns and not missing_columns:


should we check the other columns that are in sequence instead of skipping over? Or is the fake column the only one in context_columns?

frances-h

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

frances-h · 2024-05-17T18:47:35Z

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

Looking into this more, I think the problem is actually that we're adding the UUID column to self._extra_context_columns. This attribute should only be used for context generated when transforming/preprocessing the data. We'll need to modify how we create the metadata for the context synthesizer so that the UUID column gets added there.

lajohn4747 · 2024-05-20T16:47:53Z

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

Looking into this more, I think the problem is actually that we're adding the UUID column to self._extra_context_columns. This attribute should only be used for context generated when transforming/preprocessing the data. We'll need to modify how we create the metadata for the context synthesizer so that the UUID column gets added there.

Why does the UUID column need to be added for modeling purposes? Seems like the issue is resolved and all tests pass (with the exception of a unit test checking for the added column) when I remove the the added UUID column, so I am not sure if it is still needed.

frances-h · 2024-05-20T17:05:22Z

Why are missing context columns being passed to PAR in the first place? My understanding is that the UUID column gets added so that we have a dummy column for the context synthesizer. Presumably it should have either (1) added a dummy constant column to the data or (2) not be passed along to PAR at all.

Looking into this more, I think the problem is actually that we're adding the UUID column to self._extra_context_columns. This attribute should only be used for context generated when transforming/preprocessing the data. We'll need to modify how we create the metadata for the context synthesizer so that the UUID column gets added there.

Why does the UUID column need to be added for modeling purposes? Seems like the issue is resolved and all tests pass (with the exception of a unit test checking for the added column) when I remove the the added UUID column, so I am not sure if it is still needed.

I think the problem is that we can't fit on an empty dataframe, so when there's no context columns we have to add a dummy column to correctly create the context model without erroring out.

lajohn4747 · 2024-05-20T22:57:54Z

Moved fix to https://github.com/sdv-dev/SDV/pull/2019/files

Handle PARSynthesizer model if sequence_index is missing

2ab5ce7

lajohn4747 requested a review from a team as a code owner May 17, 2024 16:43

lajohn4747 requested review from frances-h and amontanez24 and removed request for a team May 17, 2024 16:43

amontanez24 reviewed May 17, 2024

View reviewed changes

frances-h reviewed May 17, 2024

View reviewed changes

lajohn4747 requested review from frances-h and amontanez24 May 20, 2024 16:48

lajohn4747 closed this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle PARSynthesizer model if sequence_index is missing #114

Handle PARSynthesizer model if sequence_index is missing #114

lajohn4747 commented May 17, 2024

amontanez24 May 17, 2024

frances-h left a comment

frances-h commented May 17, 2024

lajohn4747 commented May 20, 2024

frances-h commented May 20, 2024

lajohn4747 commented May 20, 2024

Handle PARSynthesizer model if sequence_index is missing #114

Handle PARSynthesizer model if sequence_index is missing #114

Conversation

lajohn4747 commented May 17, 2024

amontanez24 May 17, 2024

Choose a reason for hiding this comment

frances-h left a comment

Choose a reason for hiding this comment

frances-h commented May 17, 2024

lajohn4747 commented May 20, 2024

frances-h commented May 20, 2024

lajohn4747 commented May 20, 2024