Column order #123

mikeknep · 2023-06-14T16:51:39Z

Main user-facing change: We now match the source column order in transformed_{table}.csv and synth_{table}.csv output files.

Internally, we now use list more regularly when working with columns; there are still some places where we use set, but only when we genuinely want set semantics (i.e. we don't want dupes when checking which columns are safe for ancestral seeding or determining which columns to drop for independent pre-processing).

We only specify the columns parameter on df.to_csv(...) calls when writing the final output files from transforms or synthetics. For interim files (mainly synthetics stuff: pre-processed data sources for training, and seeds for ancestral generation) we do not care about the order, nor can we specify it easily if we wanted to because the columns in those files are not identical to the source (independent omits some, ancestral omits some and adds others and renames, etc.)

mikeknep added 5 commits June 14, 2023 11:16

Change type from set to list

e821aaa

Add columns to backup

eebf997

Trips test setup matches sql script order

74a964c

update some tests from set to list

0532cfe

Specify columns on final outputs to preserve order

fe799a2

mikeknep requested review from johntmyers and gracecvking June 14, 2023 16:51

johntmyers approved these changes Jun 14, 2023

View reviewed changes

mikeknep merged commit 6b450d6 into main Jun 14, 2023

mikeknep deleted the column-order branch June 14, 2023 18:35

mikeknep mentioned this pull request Jun 20, 2023

Backup cols #129

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column order #123

Column order #123

mikeknep commented Jun 14, 2023

Column order #123

Column order #123

Conversation

mikeknep commented Jun 14, 2023