Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Dataset could not be previewed" #1847

Closed
1 task
astrojuanlu opened this issue Apr 8, 2024 · 19 comments
Closed
1 task

"Dataset could not be previewed" #1847

astrojuanlu opened this issue Apr 8, 2024 · 19 comments
Assignees

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Apr 8, 2024

Description

I tried adding preview support to a custom dataset like this:

diff --git a/conf/base/catalog.yml b/conf/base/catalog.yml
index 5a9adb8..aef45cf 100644
--- a/conf/base/catalog.yml
+++ b/conf/base/catalog.yml
@@ -2,6 +2,10 @@ submissions-raw:
   type: reddit_summarizer.datasets.DeltaPolarsDataset
   filepath: s3://reddit-submissions/submissions-raw
   credentials: minio_object_store
+  metadata:
+    kedro-viz:
+      preview_args:
+        nrows: 5
 
 submissions-summaries:
   type: reddit_summarizer.datasets.DeltaPolarsDataset
diff --git a/src/reddit_summarizer/datasets/delta_polars_dataset.py b/src/reddit_summarizer/datasets/delta_polars_dataset.py
index e6f0fff..a9a2724 100644
--- a/src/reddit_summarizer/datasets/delta_polars_dataset.py
+++ b/src/reddit_summarizer/datasets/delta_polars_dataset.py
@@ -3,6 +3,8 @@ import typing as t
 import polars as pl
 from kedro.io import AbstractDataset
 
+from kedro_datasets._typing import TablePreview
+
 
 class DeltaPolarsDataset(AbstractDataset[pl.DataFrame, pl.DataFrame]):
     """``DeltaDataset`` loads/saves data from/to a Delta Table using an underlying
@@ -48,3 +50,10 @@ class DeltaPolarsDataset(AbstractDataset[pl.DataFrame, pl.DataFrame]):
             storage_options=self._storage_options,
             metadata=self._metadata,
         )
+
+    def preview(self, nrows: int) -> TablePreview:
+        subset = self._load().head(nrows)
+        df_dict = {}
+        for column in subset.columns:
+            df_dict[column] = subset[column]
+        return df_dict

And yet I got an error:

[04/08/24 09:52:40] WARNING  'submissions-raw' could not be previewed. Full exception: TypeError: DeltaPolarsDataset.preview() missing 1 required positional argument:      flowchart.py:814
                             'nrows'

What am I doing wrong?

Your Environment

Include as many relevant details as possible about the environment you experienced the bug in:

  • Web browser system and version:
  • Operating system and version:
  • NodeJS version used (if relevant):
  • Kedro version used (if relevant): 0.19.3
  • Python version used (if relevant):

Checklist

  • Include labels so that we can categorise your issue
@rashidakanchwala
Copy link
Contributor

The code looks right to me. @SajidAlamQB , since u added 4 dataset previews recently, did you encounter this error? Is there anything u think looks incorrect?

@SajidAlamQB
Copy link
Contributor

Maybe you could try setting a default value for nrows in the method definition nrows: int = 5.

@astrojuanlu
Copy link
Member Author

Setting a default value avoids that error, although there must be something fishy happening here:

preview_args = (
cls.data_node.get_preview_args() if cls.data_node.viz_metadata else None
)
if preview_args is None:
return cls.dataset.preview()
return cls.dataset.preview(**preview_args)

On the other hand, now I get another error, but that's unrelated to this issue:

[04/08/24 16:31:28] ERROR    Exception in ASGI application                                                                      httptools_impl.py:424
                                                                                                                                                     
                             ╭────────────────────────────── Traceback (most recent call last) ───────────────────────────────╮                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/uvicorn/protocols/http/httptools_impl.py:419 in run_asgi                            │                      
                             │                                                                                                │                      
                             │   416 │   # ASGI exception wrapper                                                             │                      
                             │   417 │   async def run_asgi(self, app: "ASGI3Application") -> None:                           │                      
                             │   418 │   │   try:                                                                             │                      
                             │ ❱ 419 │   │   │   result = await app(  # type: ignore[func-returns-value]                      │                      
                             │   420 │   │   │   │   self.scope, self.receive, self.send                                      │                      
                             │   421 │   │   │   )                                                                            │                      
                             │   422 │   │   except BaseException as exc:                                                     │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/uvicorn/middleware/proxy_headers.py:84 in __call__                                  │                      
                             │                                                                                                │                      
                             │   81 │   │   │   │   │   port = 0                                                              │                      
                             │   82 │   │   │   │   │   scope["client"] = (host, port)  # type: ignore[arg-type]              │                      
                             │   83 │   │                                                                                     │                      
                             │ ❱ 84 │   │   return await self.app(scope, receive, send)                                       │                      
                             │   85                                                                                           │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/fastapi/applications.py:1054 in __call__                                            │                      
                             │                                                                                                │                      
                             │   1051 │   async def __call__(self, scope: Scope, receive: Receive, send: Send) -> None:       │                      
                             │   1052 │   │   if self.root_path:                                                              │                      
                             │   1053 │   │   │   scope["root_path"] = self.root_path                                         │                      
                             │ ❱ 1054 │   │   await super().__call__(scope, receive, send)                                    │                      
                             │   1055 │                                                                                       │                      
                             │   1056 │   def add_api_route(                                                                  │                      
                             │   1057 │   │   self,                                                                           │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/applications.py:123 in __call__                                           │                      
                             │                                                                                                │                      
                             │   120 │   │   scope["app"] = self                                                              │                      
                             │   121 │   │   if self.middleware_stack is None:                                                │                      
                             │   122 │   │   │   self.middleware_stack = self.build_middleware_stack()                        │                      
                             │ ❱ 123 │   │   await self.middleware_stack(scope, receive, send)                                │                      
                             │   124 │                                                                                        │                      
                             │   125 │   def on_event(self, event_type: str) -> typing.Callable:  # type: ignore[type-arg]    │                      
                             │   126 │   │   return self.router.on_event(event_type)  # pragma: nocover                       │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/middleware/errors.py:186 in __call__                                      │                      
                             │                                                                                                │                      
                             │   183 │   │   │   # We always continue to raise the exception.                                 │                      
                             │   184 │   │   │   # This allows servers to log the error, or allows test clients               │                      
                             │   185 │   │   │   # to optionally raise the error within the test case.                        │                      
                             │ ❱ 186 │   │   │   raise exc                                                                    │                      
                             │   187 │                                                                                        │                      
                             │   188 │   def format_line(                                                                     │                      
                             │   189 │   │   self, index: int, line: str, frame_lineno: int, frame_index: int                 │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/middleware/errors.py:164 in __call__                                      │                      
                             │                                                                                                │                      
                             │   161 │   │   │   await send(message)                                                          │                      
                             │   162 │   │                                                                                    │                      
                             │   163 │   │   try:                                                                             │                      
                             │ ❱ 164 │   │   │   await self.app(scope, receive, _send)                                        │                      
                             │   165 │   │   except Exception as exc:                                                         │                      
                             │   166 │   │   │   request = Request(scope)                                                     │                      
                             │   167 │   │   │   if self.debug:                                                               │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/middleware/exceptions.py:62 in __call__                                   │                      
                             │                                                                                                │                      
                             │   59 │   │   else:                                                                             │                      
                             │   60 │   │   │   conn = WebSocket(scope, receive, send)                                        │                      
                             │   61 │   │                                                                                     │                      
                             │ ❱ 62 │   │   await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)          │                      
                             │   63 │                                                                                         │                      
                             │   64 │   def http_exception(self, request: Request, exc: Exception) -> Response:               │                      
                             │   65 │   │   assert isinstance(exc, HTTPException)                                             │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/_exception_handler.py:64 in wrapped_app                                   │                      
                             │                                                                                                │                      
                             │   61 │   │   │   │   handler = _lookup_exception_handler(exception_handlers, exc)              │                      
                             │   62 │   │   │                                                                                 │                      
                             │   63 │   │   │   if handler is None:                                                           │                      
                             │ ❱ 64 │   │   │   │   raise exc                                                                 │                      
                             │   65 │   │   │                                                                                 │                      
                             │   66 │   │   │   if response_started:                                                          │                      
                             │   67 │   │   │   │   msg = "Caught handled exception, but response already started."           │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/_exception_handler.py:53 in wrapped_app                                   │                      
                             │                                                                                                │                      
                             │   50 │   │   │   await send(message)                                                           │                      
                             │   51 │   │                                                                                     │                      
                             │   52 │   │   try:                                                                              │                      
                             │ ❱ 53 │   │   │   await app(scope, receive, sender)                                             │                      
                             │   54 │   │   except Exception as exc:                                                          │                      
                             │   55 │   │   │   handler = None                                                                │                      
                             │   56                                                                                           │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/routing.py:758 in __call__                                                │                      
                             │                                                                                                │                      
                             │   755 │   │   """                                                                              │                      
                             │   756 │   │   The main entry point to the Router class.                                        │                      
                             │   757 │   │   """                                                                              │                      
                             │ ❱ 758 │   │   await self.middleware_stack(scope, receive, send)                                │                      
                             │   759 │                                                                                        │                      
                             │   760 │   async def app(self, scope: Scope, receive: Receive, send: Send) -> None:             │                      
                             │   761 │   │   assert scope["type"] in ("http", "websocket", "lifespan")                        │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/routing.py:778 in app                                                     │                      
                             │                                                                                                │                      
                             │   775 │   │   │   match, child_scope = route.matches(scope)                                    │                      
                             │   776 │   │   │   if match == Match.FULL:                                                      │                      
                             │   777 │   │   │   │   scope.update(child_scope)                                                │                      
                             │ ❱ 778 │   │   │   │   await route.handle(scope, receive, send)                                 │                      
                             │   779 │   │   │   │   return                                                                   │                      
                             │   780 │   │   │   elif match == Match.PARTIAL and partial is None:                             │                      
                             │   781 │   │   │   │   partial = route                                                          │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/routing.py:299 in handle                                                  │                      
                             │                                                                                                │                      
                             │   296 │   │   │   │   )                                                                        │                      
                             │   297 │   │   │   await response(scope, receive, send)                                         │                      
                             │   298 │   │   else:                                                                            │                      
                             │ ❱ 299 │   │   │   await self.app(scope, receive, send)                                         │                      
                             │   300 │                                                                                        │                      
                             │   301 │   def __eq__(self, other: typing.Any) -> bool:                                         │                      
                             │   302 │   │   return (                                                                         │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/routing.py:79 in app                                                      │                      
                             │                                                                                                │                      
                             │    76 │   │   │   │   response = await run_in_threadpool(func, request)                        │                      
                             │    77 │   │   │   await response(scope, receive, send)                                         │                      
                             │    78 │   │                                                                                    │                      
                             │ ❱  79 │   │   await wrap_app_handling_exceptions(app, request)(scope, receive, send)           │                      
                             │    80 │                                                                                        │                      
                             │    81 │   return app                                                                           │                      
                             │    82                                                                                          │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/_exception_handler.py:64 in wrapped_app                                   │                      
                             │                                                                                                │                      
                             │   61 │   │   │   │   handler = _lookup_exception_handler(exception_handlers, exc)              │                      
                             │   62 │   │   │                                                                                 │                      
                             │   63 │   │   │   if handler is None:                                                           │                      
                             │ ❱ 64 │   │   │   │   raise exc                                                                 │                      
                             │   65 │   │   │                                                                                 │                      
                             │   66 │   │   │   if response_started:                                                          │                      
                             │   67 │   │   │   │   msg = "Caught handled exception, but response already started."           │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/_exception_handler.py:53 in wrapped_app                                   │                      
                             │                                                                                                │                      
                             │   50 │   │   │   await send(message)                                                           │                      
                             │   51 │   │                                                                                     │                      
                             │   52 │   │   try:                                                                              │                      
                             │ ❱ 53 │   │   │   await app(scope, receive, sender)                                             │                      
                             │   54 │   │   except Exception as exc:                                                          │                      
                             │   55 │   │   │   handler = None                                                                │                      
                             │   56                                                                                           │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/starlette/routing.py:74 in app                                                      │                      
                             │                                                                                                │                      
                             │    71 │   │                                                                                    │                      
                             │    72 │   │   async def app(scope: Scope, receive: Receive, send: Send) -> None:               │                      
                             │    73 │   │   │   if is_async_callable(func):                                                  │                      
                             │ ❱  74 │   │   │   │   response = await func(request)                                           │                      
                             │    75 │   │   │   else:                                                                        │                      
                             │    76 │   │   │   │   response = await run_in_threadpool(func, request)                        │                      
                             │    77 │   │   │   await response(scope, receive, send)                                         │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/fastapi/routing.py:296 in app                                                       │                      
                             │                                                                                                │                      
                             │    293 │   │   │   │   │   │   │   response_args["status_code"] = current_status_code          │                      
                             │    294 │   │   │   │   │   │   if sub_response.status_code:                                    │                      
                             │    295 │   │   │   │   │   │   │   response_args["status_code"] = sub_response.status_code     │                      
                             │ ❱  296 │   │   │   │   │   │   content = await serialize_response(                             │                      
                             │    297 │   │   │   │   │   │   │   field=response_field,                                       │                      
                             │    298 │   │   │   │   │   │   │   response_content=raw_response,                              │                      
                             │    299 │   │   │   │   │   │   │   include=response_model_include,                             │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/fastapi/routing.py:160 in serialize_response                                        │                      
                             │                                                                                                │                      
                             │    157 │   │   │   )                                                                           │                      
                             │    158 │   │                                                                                   │                      
                             │    159 │   │   if hasattr(field, "serialize"):                                                 │                      
                             │ ❱  160 │   │   │   return field.serialize(                                                     │                      
                             │    161 │   │   │   │   value,                                                                  │                      
                             │    162 │   │   │   │   include=include,                                                        │                      
                             │    163 │   │   │   │   exclude=exclude,                                                        │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/fastapi/_compat.py:147 in serialize                                                 │                      
                             │                                                                                                │                      
                             │   144 │   │   ) -> Any:                                                                        │                      
                             │   145 │   │   │   # What calls this code passes a value that already called                    │                      
                             │   146 │   │   │   # self._type_adapter.validate_python(value)                                  │                      
                             │ ❱ 147 │   │   │   return self._type_adapter.dump_python(                                       │                      
                             │   148 │   │   │   │   value,                                                                   │                      
                             │   149 │   │   │   │   mode=mode,                                                               │                      
                             │   150 │   │   │   │   include=include,                                                         │                      
                             │                                                                                                │                      
                             │ /Users/juan_cano/Projects/QuantumBlackLabs/workshop-kedro-huggingface/.venv/lib/python3.11/sit │                      
                             │ e-packages/pydantic/type_adapter.py:333 in dump_python                                         │                      
                             │                                                                                                │                      
                             │   330 │   │   Returns:                                                                         │                      
                             │   331 │   │   │   The serialized object.                                                       │                      
                             │   332 │   │   """                                                                              │                      
                             │ ❱ 333 │   │   return self.serializer.to_python(                                                │                      
                             │   334 │   │   │   __instance,                                                                  │                      
                             │   335 │   │   │   mode=mode,                                                                   │                      
                             │   336 │   │   │   by_alias=by_alias,                                                           │                      
                             ╰────────────────────────────────────────────────────────────────────────────────────────────────╯                      
                             PydanticSerializationError: Unable to serialize unknown type: <class                                                    
                             'polars.series.series.Series'>    

@rashidakanchwala rashidakanchwala moved this from Inbox to Backlog in Kedro-Viz Apr 8, 2024
@rashidakanchwala rashidakanchwala moved this from Backlog to Todo in Kedro-Viz Apr 15, 2024
@ravi-kumar-pilla ravi-kumar-pilla self-assigned this Apr 26, 2024
@ravi-kumar-pilla ravi-kumar-pilla moved this from Todo to In Progress in Kedro-Viz Apr 26, 2024
@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu ,

Thank you for raising the issue. I am able to produce the bug on my side. The documentation for creating a custom dataset seems to create a dictionary but the value inside it is of type pandas.series or polars.series (in your case).

I created a CustomDataset mimicking the excel_dataset implementation and modified the preview method as below -

    def preview(self, nrows, ncolumns, filters) -> TablePreview:
      dataset_copy = self._copy()
      dataset_copy._load_args["nrows"] = nrows
      filtered_data = dataset_copy.load()

      for column, value in filters.items():
          filtered_data = filtered_data[filtered_data[column] == value]
      subset = filtered_data.iloc[:nrows, :ncolumns]
      df_dict = {}
      for column in subset.columns:
          df_dict[column] = subset[column]
      
      return df_dict

catalog -

shuttles:
  type: demo_project.extras.datasets.custom_dataset.CustomDataset
  filepath: ${_base_location}/01_raw/shuttles.xlsx
  metadata:
    kedro-viz:
      layer: raw
      preview_args:
        nrows: 5
        ncolumns: 2 
        filters: {
          engine_type: Quantum 
        } 

This gives me the below preview data -

preview={'id': 0    63561
1    36260
2    57015
Name: id, dtype: int64, 'shuttle_location': 0                  Niue
1              Anguilla
2    Russian Federation
Name: shuttle_location, dtype: object},

So, when FAST API tries to searialize this response it throws the error PydanticSerializationError rightly. I tried to correct the preview method as below - (change - df_dict[column] = subset[column].to_dict())

  def preview(self, nrows, ncolumns, filters) -> TablePreview:
      dataset_copy = self._copy()
      dataset_copy._load_args["nrows"] = nrows
      filtered_data = dataset_copy.load()

      for column, value in filters.items():
          filtered_data = filtered_data[filtered_data[column] == value]
      subset = filtered_data.iloc[:nrows, :ncolumns]
      df_dict = {}
      for column in subset.columns:
          df_dict[column] = subset[column].to_dict()
      
      return df_dict

Now the response is a valid dict like -

 preview={'id': {0: 63561, 1: 36260, 2: 57015}, 'shuttle_location': {0: 'Niue', 1: 'Anguilla', 2: 'Russian Federation'}}

Though the error is resolved, there needs to be further discussion with the team to check if the dictionary needs to be in a certain format, so that it will be previewed as I am seeing a blank box in the UI with the above preview response.

image

An example preview response (TablePreview) which works -

preview={'index': [0, 1, 2, 3, 4], 'columns': ['id', 'company_rating', 'company_location', 'total_fleet_count', 'iata_approved'], 'data': [[35029, '100%', 'Niue', 4.0, 'f'], [30292, '67%', 'Anguilla', 6.0, 'f'], [19032, '67%', 'Russian Federation', 4.0, 'f'], [8238, '91%', 'Barbados', 15.0, 't'], [30342, nan, 'Sao Tome and Principe', 2.0, 't']]}

Next Steps:

  1. Confirm with the team (@jitu5 , @SajidAlamQB ) if there needs to be any specific format for the preview dictionary to be viewed in the UI ?
  2. Should the user be responsible to return a valid dictionary key-value pairs, from the preview method ? (cc: @rashidakanchwala )
  3. In case the user returns an invalid dictionary key-value pairs fast api raises the exception as it did. Should we show a custom error message for the user or leave it like this ? I think we should catch (PydanticSerializationError) for now and show the user some custom message which would allow them to check the preview data being returned.

Thank you

@jitu5
Copy link
Contributor

jitu5 commented Apr 30, 2024

Confirm with the team (@jitu5 , @SajidAlamQB ) if there needs to be any specific format for the preview dictionary to be viewed in the UI ?

@ravi-kumar-pilla for the TablePreview, UI expect in below format

preview={
    'index': number[], 
    'columns': string[], 
    'data': any[][]  // List[List[Any]] 
}

@ravi-kumar-pilla
Copy link
Contributor

Thanks @jitu5 , I think we should document the expected {key:value} pairs for all the supported preview types -

TablePreview = NewType("TablePreview", dict)
ImagePreview = NewType("ImagePreview", bytes)
PlotlyPreview = NewType("PlotlyPreview", dict)
JSONPreview = NewType("JSONPreview", dict)

I will create a ticket to document the expected schema and also see if we can enforce that in the backend so that, FE always gets what is required.

For this ticket, I would like to add an except block to catch PydanticSerializationError and inform the users to return a valid dictionary {key:value} pair. Thank you

@rashidakanchwala
Copy link
Contributor

rashidakanchwala commented Apr 30, 2024

Thanks, Ravi. Great debugging. I now realize this is happening because while our TablePreview accepts a generic dict, our frontend only supports Pandas DataFrames for displaying tabular data and not any other formats such as Polars.

We could add validation in the backend, but I do feel our frontend needs to be more versatile to handle different tabular formats.

Update - we could check with Vizro, how they do this in the BE.

@rashidakanchwala
Copy link
Contributor

The FE logic to distinguish between Polars and Pandas dataframe is quite straightforward.

Ideally, we could natively support both formats. My question is, do we anticipate more formats? If so, maybe we could use Plotly to preview data tables as well, instead of our own custom table preview?

@astrojuanlu?

@astrojuanlu
Copy link
Member Author

I guess the main problem I see is that, as a user, it's not clear what the preview() method should return. I initially did the df_dict thing because I copy-pasted it from somewhere else without really understanding how it works.

@rashidakanchwala
Copy link
Contributor

Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview.

and create another NewType called PolarsTablePreview which also returns a dict.

The only difference is now in the FE, we have a different way of displaying both dicts as tables. What do you all think? @astrojuanlu , @ravi-kumar-pilla , @jitu5

@jitu5
Copy link
Contributor

jitu5 commented May 7, 2024

@rashidakanchwala If the logic to distinguish between Polars and Pandas dataframe is straightforward at FE then bases on the type, we can create a separate components for each type to display.

@astrojuanlu
Copy link
Member Author

Yes, I think TablePreview is very generic when it is actually only meant for Pandas. So I am proposing we rename TablePreview to PandasTablePreview.

and create another NewType called PolarsTablePreview which also returns a dict.

To clarify: if I change my preview function to return the PolarsTablePreview you're proposing and I don't change the body, would it work?

    def preview(self, nrows: int) -> PolarsTablePreview:
        subset = self._load().head(nrows)
        df_dict = {}
        for column in subset.columns:
            df_dict[column] = subset[column]
        return df_dict

My understanding is that someone needs to make sure the output of the function can be serialized.

  1. If this is the user responsibility, then the object returned from preview should have some constraints. Where should these be enforced?
  2. If the user can return whatever, then somebody else should take care of serializing that. But my understanding is that there are limits to how smart we can be about the object returned by preview, which takes me back to 1.

I think ideally the user would be helped by the IDE to return a proper object. How can we achieve this, so the journey is more clear?

@rashidakanchwala
Copy link
Contributor

I agree, @ravi-kumar-pilla's ticket here #1884 proposes what you mentioned.

At the time we decided to use NewType, we did not anticipate running into this; if we need to enforce validation - we might have to change the NewTypes into classes that we did consider before.

@astrojuanlu astrojuanlu moved this from In Progress to Todo in Kedro-Viz May 8, 2024
@astrojuanlu
Copy link
Member Author

I was trying to prove whether TablePreview could work with a Polars dataframe, but the truth is that I cannot even make it work with a Pandas dataframe...

my preview function is now

    def preview(self, nrows: int=5) -> TablePreview:
        subset = self._load().head(nrows).to_pandas()
        df_dict = {}
        for column in subset.columns:
            df_dict[column] = subset[column]
        return df_dict

and I get this on the CLI

PydanticSerializationError: Unable to serialize unknown type: <class                                  
                             'pandas.core.series.Series'> 

Now, if I change the code to be this:

df_dict[column] = subset[column].to_list()  # Serializable

I get a perfectly empty table:

image

So there are, at least, 4 problems:

  1. The preview function is called without arguments?
  2. The return type from the preview function does not match the schema that the frontend expects
  3. It doesn't work for Pandas dataframes as explained in the docs https://docs.kedro.org/projects/kedro-viz/en/stable/preview_custom_datasets.html
  4. Even after fixing serialization issues, the frontend shows an empty table

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Sep 2, 2024

I was trying to prove whether TablePreview could work with a Polars dataframe, but the truth is that I cannot even make it work with a Pandas dataframe...

    def preview(self, nrows: int=5) -> TablePreview:
        subset = self._load().head(nrows).to_pandas()
        df_dict = {}
        for column in subset.columns:
            df_dict[column] = subset[column]
        return df_dict

Hey @astrojuanlu sorry its taken a bit to get back to this, could you try this instead.

    def preview(self, nrows: int=5) -> TablePreview:
        subset = self._load().head(nrows).to_pandas()
        preview_data = {
            'index': list(subset.index),  # List of row indices
            'columns': list(subset.columns),  # List of column names
            'data': subset.values.tolist()  # List of rows, where each row is a list of values
        }
        return preview_data

The expected format in FE is this as @jitu5 mentioned:

preview={
'index': number[],
'columns': string[],
'data': any[][] // List[List[Any]]
}

@astrojuanlu
Copy link
Member Author

Thanks @SajidAlamQB, I confirm that this works. Minor tweak:

# catalog.yml
companies:
  type: kedro_preview.datasets.CustomCSVDataset
  filepath: data/01_raw/companies.csv
  metadata:
    kedro-viz:
      preview_args:
        nrows: 5
from kedro_datasets.polars import CSVDataset

from kedro_datasets._typing import TablePreview


class CustomCSVDataset(CSVDataset):
    def preview(self, nrows: int) -> TablePreview:
        subset = self._load().head(nrows)
        return {
            "index": list(range(len(subset))),
            "columns": list(subset.columns),
            "data": subset.to_numpy().tolist(),
        }

It is my understanding then that the docs are wrong? https://docs.kedro.org/projects/kedro-viz/en/latest/preview_custom_datasets.html#extend-preview-to-custom-datasets

In other news, this confirms my intuition that there's nothing pandas-specific to this functionality, so we don't need to introduce PandasTablePreview, PolarsTablePreview etc.

As a user, I think it would have been easier if I could have seen an error not on Kedro-Viz validation side, but on my own code. Something like:

[09/02/24 16:48:24] WARNING  'companies' could not be previewed. Full exception: ValidationError: 1 validation error for NGTablePreview                                              flowchart.py:701
                             index                                                                                                                                                                   
                               Input should be a valid list [type=list_type, input_value=1, input_type=int]                                                                                          

But then again, as discussed in kedro-org/kedro-plugins#504 it's not entirely clear where this validation logic should live...

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Sep 3, 2024

Yes @astrojuanlu you are correct I believe the Key Points for this issue are:

  • Expected Data Format: The frontend expects the preview data in a specific format for TablePreview (a dictionary with "index", "columns", and "data" keys), which wasn't initially clear.

  • Incorrect Documentation: There seems to be a mistake between what the documentation is saying and what is actually required by the frontend.

I have a draft PR that adds validation check for the expected data format in kedro-viz backend, #2070 and opened a PR to update the documentation, #2074. We can discuss these solutions in the upcoming PS session.

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Sep 5, 2024

Update from PS Session on Dataset Previews:

We’ve identified two key action items to address the ongoing issue with custom dataset previews:

  1. [Spike] Create a Proof of Concept (PoC) for Type Hints in the IDE:
    By using a type dict or other typing tools, see if the IDE will be able to provide hints that lets users know about the required structure of the preview dictionary (e.g., index, columns, and data keys for TablePreview). Assess how effective the PoC is in helping users keep to the expected preview format and reducing the likelihood of errors due to incorrect return types or structures.

  2. [Docs] Enhance Documentation for Dataset Previews:
    Expand the current documentation to provide detailed examples of the expected return format for various preview types (e.g., TablePreview, ImagePreview, PlotlyPreview, JSONPreview).
    Document the required {key: value} pairs for the relevant preview types.

@SajidAlamQB
Copy link
Contributor

Closing this for now as docs have been updated, we can follow the spike on this new issue: #2090.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants