-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Dataset could not be previewed" #1847
Comments
The code looks right to me. @SajidAlamQB , since u added 4 dataset previews recently, did you encounter this error? Is there anything u think looks incorrect? |
Maybe you could try setting a default value for |
Setting a default value avoids that error, although there must be something fishy happening here: kedro-viz/package/kedro_viz/models/flowchart.py Lines 806 to 811 in 3eeb565
On the other hand, now I get another error, but that's unrelated to this issue:
|
Hi @astrojuanlu , Thank you for raising the issue. I am able to produce the bug on my side. The documentation for creating a custom dataset seems to create a dictionary but the value inside it is of type pandas.series or polars.series (in your case). I created a CustomDataset mimicking the excel_dataset implementation and modified the preview method as below - def preview(self, nrows, ncolumns, filters) -> TablePreview:
dataset_copy = self._copy()
dataset_copy._load_args["nrows"] = nrows
filtered_data = dataset_copy.load()
for column, value in filters.items():
filtered_data = filtered_data[filtered_data[column] == value]
subset = filtered_data.iloc[:nrows, :ncolumns]
df_dict = {}
for column in subset.columns:
df_dict[column] = subset[column]
return df_dict catalog - shuttles:
type: demo_project.extras.datasets.custom_dataset.CustomDataset
filepath: ${_base_location}/01_raw/shuttles.xlsx
metadata:
kedro-viz:
layer: raw
preview_args:
nrows: 5
ncolumns: 2
filters: {
engine_type: Quantum
} This gives me the below preview data - preview={'id': 0 63561
1 36260
2 57015
Name: id, dtype: int64, 'shuttle_location': 0 Niue
1 Anguilla
2 Russian Federation
Name: shuttle_location, dtype: object}, So, when FAST API tries to searialize this response it throws the error PydanticSerializationError rightly. I tried to correct the preview method as below - (change - df_dict[column] = subset[column].to_dict()) def preview(self, nrows, ncolumns, filters) -> TablePreview:
dataset_copy = self._copy()
dataset_copy._load_args["nrows"] = nrows
filtered_data = dataset_copy.load()
for column, value in filters.items():
filtered_data = filtered_data[filtered_data[column] == value]
subset = filtered_data.iloc[:nrows, :ncolumns]
df_dict = {}
for column in subset.columns:
df_dict[column] = subset[column].to_dict()
return df_dict Now the response is a valid dict like - preview={'id': {0: 63561, 1: 36260, 2: 57015}, 'shuttle_location': {0: 'Niue', 1: 'Anguilla', 2: 'Russian Federation'}} Though the error is resolved, there needs to be further discussion with the team to check if the dictionary needs to be in a certain format, so that it will be previewed as I am seeing a blank box in the UI with the above preview response. An example preview response (TablePreview) which works - preview={'index': [0, 1, 2, 3, 4], 'columns': ['id', 'company_rating', 'company_location', 'total_fleet_count', 'iata_approved'], 'data': [[35029, '100%', 'Niue', 4.0, 'f'], [30292, '67%', 'Anguilla', 6.0, 'f'], [19032, '67%', 'Russian Federation', 4.0, 'f'], [8238, '91%', 'Barbados', 15.0, 't'], [30342, nan, 'Sao Tome and Principe', 2.0, 't']]} Next Steps:
Thank you |
@ravi-kumar-pilla for the TablePreview, UI expect in below format
|
Thanks @jitu5 , I think we should document the expected {key:value} pairs for all the supported preview types -
I will create a ticket to document the expected schema and also see if we can enforce that in the backend so that, FE always gets what is required. For this ticket, I would like to add an except block to catch PydanticSerializationError and inform the users to return a valid dictionary {key:value} pair. Thank you |
Thanks, Ravi. Great debugging. I now realize this is happening because while our TablePreview accepts a generic dict, our frontend only supports Pandas DataFrames for displaying tabular data and not any other formats such as Polars. We could add validation in the backend, but I do feel our frontend needs to be more versatile to handle different tabular formats. Update - we could check with Vizro, how they do this in the BE. |
The FE logic to distinguish between Polars and Pandas dataframe is quite straightforward. Ideally, we could natively support both formats. My question is, do we anticipate more formats? If so, maybe we could use Plotly to preview data tables as well, instead of our own custom table preview? |
I guess the main problem I see is that, as a user, it's not clear what the |
Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview. and create another NewType called PolarsTablePreview which also returns a dict. The only difference is now in the FE, we have a different way of displaying both dicts as tables. What do you all think? @astrojuanlu , @ravi-kumar-pilla , @jitu5 |
@rashidakanchwala If the logic to distinguish between Polars and Pandas dataframe is straightforward at FE then bases on the type, we can create a separate components for each type to display. |
To clarify: if I change my def preview(self, nrows: int) -> PolarsTablePreview:
subset = self._load().head(nrows)
df_dict = {}
for column in subset.columns:
df_dict[column] = subset[column]
return df_dict My understanding is that someone needs to make sure the output of the function can be serialized.
I think ideally the user would be helped by the IDE to return a proper object. How can we achieve this, so the journey is more clear? |
I agree, @ravi-kumar-pilla's ticket here #1884 proposes what you mentioned. At the time we decided to use NewType, we did not anticipate running into this; if we need to enforce validation - we might have to change the NewTypes into classes that we did consider before. |
I was trying to prove whether my
and I get this on the CLI
Now, if I change the code to be this:
I get a perfectly empty table: So there are, at least, 4 problems:
|
Hey @astrojuanlu sorry its taken a bit to get back to this, could you try this instead.
The expected format in FE is this as @jitu5 mentioned: preview={ |
Thanks @SajidAlamQB, I confirm that this works. Minor tweak: # catalog.yml
companies:
type: kedro_preview.datasets.CustomCSVDataset
filepath: data/01_raw/companies.csv
metadata:
kedro-viz:
preview_args:
nrows: 5 from kedro_datasets.polars import CSVDataset
from kedro_datasets._typing import TablePreview
class CustomCSVDataset(CSVDataset):
def preview(self, nrows: int) -> TablePreview:
subset = self._load().head(nrows)
return {
"index": list(range(len(subset))),
"columns": list(subset.columns),
"data": subset.to_numpy().tolist(),
} It is my understanding then that the docs are wrong? https://docs.kedro.org/projects/kedro-viz/en/latest/preview_custom_datasets.html#extend-preview-to-custom-datasets In other news, this confirms my intuition that there's nothing pandas-specific to this functionality, so we don't need to introduce As a user, I think it would have been easier if I could have seen an error not on Kedro-Viz validation side, but on my own code. Something like:
But then again, as discussed in kedro-org/kedro-plugins#504 it's not entirely clear where this validation logic should live... |
Yes @astrojuanlu you are correct I believe the Key Points for this issue are:
I have a draft PR that adds validation check for the expected data format in kedro-viz backend, #2070 and opened a PR to update the documentation, #2074. We can discuss these solutions in the upcoming PS session. |
Update from PS Session on Dataset Previews: We’ve identified two key action items to address the ongoing issue with custom dataset previews:
|
Closing this for now as docs have been updated, we can follow the spike on this new issue: #2090. |
Description
I tried adding preview support to a custom dataset like this:
And yet I got an error:
What am I doing wrong?
Your Environment
Include as many relevant details as possible about the environment you experienced the bug in:
Checklist
The text was updated successfully, but these errors were encountered: