Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299

gariepyalex · 2024-03-29T21:50:51Z

Currently, it is only possible to pass in a path and revision to load_dataset. This is fairly limiting, and forces users to implement their own AbsTask.load_data` function, which relies on the internals of the library.

This PR allow to specify any parameter supported by datasets.load_dataset in custom tasks. Currently, the metadata is specified as such:

    my_task = TaskMetadata(
        name="MyTask",
        hf_hub_name="org/dataset",
        revision="1.0",
        ...
    )

This is hard to extend as we would need to add new keys to that Pydantic object for each load_dataset key we want to support.

This PR proposes to instead have the following structure:

    my_task = TaskMetadata(
        name="MyTask",
        dataset={
            "path": "org/dataset",
            "revision": "123"
        }
        ...
    )

This allows users to add any key they want (dataset config name, token, etc.). All key/values are passed in to load_dataset. Note that this is done in a backward compatible manner. A pydantic validator supports the old parameters and populates the dataset dictionary.

Migration

We've included in the PR a migration of all the built-in tasks to avoid logging a deprecation warning when running them. This is what causes most of the line changes.

Refactoring

While migrating the tasks to the new metadata_dict to use dataset that a lot of time, AbsTask.load_data was overridden for the sole purpose of inserting a dataset_transform call. This was such a common pattern that this led to a lot of duplication. We propose here to call dataset_transform in AbsTask, and this default to a no-op.

Bug fixes

A few times, the revision of the dataset was not passed to load_dataset, leading to a discrepancy between the revision in the metadata and the actual loaded data. These were fixed and I indicated the location of these issues in the PR.

Testing

I see that the repo contains test suites for all abstract tasks. Please let me know if any additional testing is required from our end.

cc. @gbmarc1

mteb/tasks/BitextMining/multilingual/DiaBLaBitextMining.py

mteb/tasks/Retrieval/en/NarrativeQARetrieval.py

mteb/tasks/Retrieval/es/SpanishPassageRetrievalS2P.py

mteb/tasks/Retrieval/fr/AlloprofRetrieval.py

mteb/tasks/Retrieval/fr/SyntecRetrieval.py

mteb/abstasks/AbsTask.py

mteb/abstasks/TaskMetadata.py

KennethEnevoldsen · 2024-04-01T08:55:19Z

This pr. might also relate to the discussion in #301 about creating two dataset sources mteb org + original dataset.

gariepyalex · 2024-04-01T13:17:46Z

I'll address the comments some time tomorrow!

MartinBernstorff · 2024-04-02T08:05:01Z

This seems like a very pragmatic change, with an extremely clean PR description. Nice job!

Moving the load_dataset logic and calling dataset_transform from the AbsTask might make some dataset instances slightly less clear for new contributors. AFAICT it's a deduplication vs. indirection/coupling trade-off, and I think it's very worth it in this instance 👍

As a side-note: I've been looking into codemodification-programming for cases like this, something like gritql. If you have any experience here, I'd love to hear about it in a discussion.

gariepyalex · 2024-04-02T14:32:05Z

I've been looking into codemodification-programming for cases like this, something like gritql. If you have any experience here, I'd love to hear about it in a discussion.

Very interesting, this is much fancier than my Vim macros I ended up using 🤣

We've addressed all comments and we should be ready to go! We've also added additional testing for the load_data.

cc. @gbmarc1

KennethEnevoldsen · 2024-04-02T16:55:09Z

Perfect @gariepyalex I have set the tests to run assuming they pass it will be merged it - thanks again for the contribution. If you do want to participate in the MMTEB coming up, feel free to add your names and 2 points to the points sheet (if not feel free to ignore this part).

…eriting AbsTask (#299) * Allow extending the load_dataset parameters * format * Fix test * remove duplicated logic from AbsTask, now handled in the metadata * add tests * remove comments, moved to PR * format * extend metadata dict from super class * Remove additional load_data * test: adding very high level test * Remove hf_hub_name and add test * Fix revision in output file --------- Co-authored-by: gbmarc1 <[email protected]>

gariepyalex commented Mar 29, 2024

View reviewed changes

KennethEnevoldsen requested review from MartinBernstorff and KennethEnevoldsen March 31, 2024 15:23

KennethEnevoldsen reviewed Mar 31, 2024

View reviewed changes

mteb/abstasks/AbsTask.py Show resolved Hide resolved

gariepyalex force-pushed the allow-to-add-load-dataset-parameters branch from 59abc83 to 9a25877 Compare March 31, 2024 19:12

KennethEnevoldsen reviewed Apr 1, 2024

View reviewed changes

mteb/abstasks/TaskMetadata.py Outdated Show resolved Hide resolved

mteb/abstasks/TaskMetadata.py Outdated Show resolved Hide resolved

MartinBernstorff approved these changes Apr 2, 2024

View reviewed changes

gariepyalex and others added 10 commits April 2, 2024 10:05

Allow extending the load_dataset parameters

95ec233

format

826800e

Fix test

2c396f3

remove duplicated logic from AbsTask, now handled in the metadata

aefd97d

add tests

bfa140b

remove comments, moved to PR

e5d18d4

format

679502b

extend metadata dict from super class

ee476c7

Remove additional load_data

2a327f0

test: adding very high level test

db79794

gariepyalex force-pushed the allow-to-add-load-dataset-parameters branch from 0ed134c to db79794 Compare April 2, 2024 14:06

gariepyalex added 2 commits April 2, 2024 10:22

Remove hf_hub_name and add test

12da2d0

Fix revision in output file

39cd5ec

KennethEnevoldsen enabled auto-merge (squash) April 2, 2024 16:53

KennethEnevoldsen merged commit 953780d into embeddings-benchmark:main Apr 2, 2024
5 checks passed

imenelydiaker mentioned this pull request Apr 3, 2024

Adding French team contribution points #302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299

Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299

gariepyalex commented Mar 29, 2024 •

edited

Loading

KennethEnevoldsen commented Apr 1, 2024

gariepyalex commented Apr 1, 2024

MartinBernstorff commented Apr 2, 2024

gariepyalex commented Apr 2, 2024

KennethEnevoldsen commented Apr 2, 2024

Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299

Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299

Conversation

gariepyalex commented Mar 29, 2024 • edited Loading

Migration

Refactoring

Bug fixes

Testing

KennethEnevoldsen commented Apr 1, 2024

gariepyalex commented Apr 1, 2024

MartinBernstorff commented Apr 2, 2024

gariepyalex commented Apr 2, 2024

KennethEnevoldsen commented Apr 2, 2024

gariepyalex commented Mar 29, 2024 •

edited

Loading