Add read_custom_data to DocumentDataset #574

praateekmahajan · 2025-02-25T21:39:03Z

Description

We need an ability to read arbitrary datasets using a user defined function.
The UDF should accept
- files: A list of file paths.
- file_type: The type of the file to read (in case you want to handle different file types differently).
- backend: pd / cudf
- add_filename: True / False / str in which case they can use from nemo_curator.utils.distributed_utils import _resolve_filename_col
- columns: Can use this to define set of cols to return (output is always sorted columns)
- input_meta: Can use this for typecasting if eneded

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

ryantwolf

Good with me.

sarahyurick

LGTM, added minor comments for your consideration.

sarahyurick · 2025-02-26T00:54:54Z

nemo_curator/datasets/doc_dataset.py

@@ -155,6 +155,73 @@ def read_pickle(
            )
        )

+    @classmethod
+    def read_custom_data(


Suggested change

def read_custom_data(

def read_custom(

to more closely match read_json, read_parquet, read_pickle?

sarahyurick · 2025-02-26T00:56:48Z

tests/test_read_data.py

+        expected_df[["embedding", "id"]],  # because we sort columns by name,
+    )
+
+    # Test multiple files per partition


Should this comment be for the test above?

fc

9302b84

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested a review from ryantwolf February 25, 2025 21:40

praateekmahajan added the gpuci Run GPU CI/CD on PR label Feb 25, 2025

ryantwolf approved these changes Feb 25, 2025

View reviewed changes

sarahyurick reviewed Feb 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read_custom_data to DocumentDataset #574

Add read_custom_data to DocumentDataset #574

praateekmahajan commented Feb 25, 2025

ryantwolf left a comment

sarahyurick left a comment

sarahyurick Feb 26, 2025

sarahyurick Feb 26, 2025

Add read_custom_data to DocumentDataset #574

Are you sure you want to change the base?

Add read_custom_data to DocumentDataset #574

Conversation

praateekmahajan commented Feb 25, 2025

Description

Usage

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick Feb 26, 2025

Choose a reason for hiding this comment

sarahyurick Feb 26, 2025

Choose a reason for hiding this comment