-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add detect_from_csvs and detect_from_dataframes methods to MultiTableMetadata #1533
Conversation
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## master #1533 +/- ##
==========================================
+ Coverage 96.40% 96.42% +0.01%
==========================================
Files 49 49
Lines 3982 3999 +17
==========================================
+ Hits 3839 3856 +17
Misses 143 143
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
sdv/metadata/multi_table.py
Outdated
the values are the dataframes. | ||
""" | ||
if not data or not all(isinstance(df, pd.DataFrame) for df in data.values()): | ||
raise ValueError('The provided dictionary must contain only pandas DataFrame objects') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End with a period.
sdv/metadata/multi_table.py
Outdated
|
||
Args: | ||
data (dict): | ||
Dictionary of ``pandas.DataFrame`` objects where the keys are the table names and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify to `Dictionary of table names to dataframes.
sdv/metadata/multi_table.py
Outdated
folder_name (str): | ||
Name of the folder to detect the metadata from. | ||
|
||
Raises: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never write raises
in docstrings. I guess for consistency it would make sense not to add it?
sdv/metadata/multi_table.py
Outdated
Raises: | ||
ValueError: If no CSV files are detected in the folder. | ||
""" | ||
csv_files = [filename for filename in os.listdir(folder_name) if filename.endswith('.csv')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could validate the folder name exists if you want, up to you.
sdv/metadata/multi_table.py
Outdated
raise ValueError(f"No CSV files detected in the folder '{folder_name}'") | ||
|
||
for filename in csv_files: | ||
table_name = filename[:-4] # Removing the .csv extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do this directly in line 388.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine here because I need the filename
and the table_name
after
sdv/metadata/multi_table.py
Outdated
csv_files = [filename for filename in os.listdir(folder_name) if filename.endswith('.csv')] | ||
|
||
if not csv_files: | ||
raise ValueError(f"No CSV files detected in the folder '{folder_name}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End with period, pretty much all our error messages do, so it makes sense to be consistent.
|
||
# Run and Assert | ||
expected_message = re.escape("No CSV files detected in the folder '{}'".format(tmp_path)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New line not necessary.
} | ||
} | ||
}, | ||
'relationships': [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will add the logic to detect the relationships in another issue, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
sdv/metadata/multi_table.py
Outdated
if os.path.exists(folder_name) and os.path.isdir(folder_name): | ||
csv_files = [ | ||
filename for filename in os.listdir(folder_name) if filename.endswith('.csv') | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using os.path
we can use Pathlib
which is more powerful and will improve the code down the line.
For example:
Path().rglob(f'{folder_name}/*.csv')
will return a list with all the paths of csv files-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good advice, thanks! Done in 4c37e19.
sdv/metadata/multi_table.py
Outdated
raise ValueError(f"No CSV files detected in the folder '{folder_name}'.") | ||
|
||
for filename in csv_files: | ||
table_name = filename[:-4] # Removing the .csv extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if changed to pathlib
you can directly access the name of the file.
sdv/metadata/multi_table.py
Outdated
|
||
for filename in csv_files: | ||
table_name = filename[:-4] # Removing the .csv extension | ||
csv_file = os.path.join(folder_name, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't need this since pathlib
will have the full path already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 LGTM!
|
||
metadata = MultiTableMetadata() | ||
|
||
with tempfile.TemporaryDirectory() as temp_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the tmp_path fixture instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done in 8cf7f02
sdv/metadata/multi_table.py
Outdated
@@ -354,11 +370,32 @@ def detect_table_from_csv(self, table_name, filepath): | |||
""" | |||
self._validate_table_not_detected(table_name) | |||
table = SingleTableMetadata() | |||
data = table._load_data_from_csv(filepath) | |||
table._detect_columns(data) | |||
table.detect_from_csv(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a reason this was the way it was. Something to do with avoiding undesired printing maybe? Idk if we want to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this change because _load_data_from_csv
no longer exists in SingleTableMetadata
. I added an integration test in 922aad0 for the detect_table_from_csv
which fails in the master branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got moved here
Line 168 in 387e9dd
def load_data_from_csv(filepath, pandas_kwargs=None): |
I think we don't want to call
detect_from_csv
because if the error gets raised it saysSDV/sdv/metadata/single_table.py
Lines 287 to 289 in 387e9dd
raise InvalidMetadataError( | |
'Metadata already exists. Create a new ``SingleTableMetadata`` ' | |
'object to detect from other data sources.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation, done in ef79cf8
@@ -1443,7 +1443,7 @@ def test_detect_table_from_csv(self, single_table_mock, log_mock): | |||
# Setup | |||
metadata = MultiTableMetadata() | |||
fake_data = Mock() | |||
single_table_mock.return_value._load_data_from_csv.return_value = fake_data | |||
single_table_mock.return_value.detect_from_csv.return_value = fake_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments! 🚢 📦
ef79cf8
to
ab8352e
Compare
Resolve #1520