Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow extending the load_dataset parameters in custom tasks inheriting AbsTask #299

Conversation

gariepyalex
Copy link
Contributor

@gariepyalex gariepyalex commented Mar 29, 2024

Currently, it is only possible to pass in a path and revision to load_dataset. This is fairly limiting, and forces users to implement their own AbsTask.load_data` function, which relies on the internals of the library.

This PR allow to specify any parameter supported by datasets.load_dataset in custom tasks. Currently, the metadata is specified as such:

    my_task = TaskMetadata(
        name="MyTask",
        hf_hub_name="org/dataset",
        revision="1.0",
        ...
    )

This is hard to extend as we would need to add new keys to that Pydantic object for each load_dataset key we want to support.

This PR proposes to instead have the following structure:

    my_task = TaskMetadata(
        name="MyTask",
        dataset={
            "path": "org/dataset",
            "revision": "123"
        }
        ...
    )

This allows users to add any key they want (dataset config name, token, etc.). All key/values are passed in to load_dataset. Note that this is done in a backward compatible manner. A pydantic validator supports the old parameters and populates the dataset dictionary.

Migration

We've included in the PR a migration of all the built-in tasks to avoid logging a deprecation warning when running them. This is what causes most of the line changes.

Refactoring

While migrating the tasks to the new metadata_dict to use dataset that a lot of time, AbsTask.load_data was overridden for the sole purpose of inserting a dataset_transform call. This was such a common pattern that this led to a lot of duplication. We propose here to call dataset_transform in AbsTask, and this default to a no-op.

Bug fixes

A few times, the revision of the dataset was not passed to load_dataset, leading to a discrepancy between the revision in the metadata and the actual loaded data. These were fixed and I indicated the location of these issues in the PR.

Testing

I see that the repo contains test suites for all abstract tasks. Please let me know if any additional testing is required from our end.

cc. @gbmarc1

@gariepyalex gariepyalex force-pushed the allow-to-add-load-dataset-parameters branch from 59abc83 to 9a25877 Compare March 31, 2024 19:12
mteb/abstasks/TaskMetadata.py Outdated Show resolved Hide resolved
mteb/abstasks/TaskMetadata.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

This pr. might also relate to the discussion in #301 about creating two dataset sources mteb org + original dataset.

@gariepyalex
Copy link
Contributor Author

I'll address the comments some time tomorrow!

@MartinBernstorff
Copy link
Contributor

This seems like a very pragmatic change, with an extremely clean PR description. Nice job!

Moving the load_dataset logic and calling dataset_transform from the AbsTask might make some dataset instances slightly less clear for new contributors. AFAICT it's a deduplication vs. indirection/coupling trade-off, and I think it's very worth it in this instance 👍

As a side-note: I've been looking into codemodification-programming for cases like this, something like gritql. If you have any experience here, I'd love to hear about it in a discussion.

@gariepyalex gariepyalex force-pushed the allow-to-add-load-dataset-parameters branch from 0ed134c to db79794 Compare April 2, 2024 14:06
@gariepyalex
Copy link
Contributor Author

I've been looking into codemodification-programming for cases like this, something like gritql. If you have any experience here, I'd love to hear about it in a discussion.

Very interesting, this is much fancier than my Vim macros I ended up using 🤣

We've addressed all comments and we should be ready to go! We've also added additional testing for the load_data.

cc. @gbmarc1

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) April 2, 2024 16:53
@KennethEnevoldsen
Copy link
Contributor

Perfect @gariepyalex I have set the tests to run assuming they pass it will be merged it - thanks again for the contribution. If you do want to participate in the MMTEB coming up, feel free to add your names and 2 points to the points sheet (if not feel free to ignore this part).

@KennethEnevoldsen KennethEnevoldsen merged commit 953780d into embeddings-benchmark:main Apr 2, 2024
5 checks passed
MartinBernstorff pushed a commit that referenced this pull request Apr 10, 2024
…eriting AbsTask (#299)

* Allow extending the load_dataset parameters

* format

* Fix test

* remove duplicated logic from AbsTask, now handled in the metadata

* add tests

* remove comments, moved to PR

* format

* extend metadata dict from super class

* Remove additional load_data

* test: adding very high level test

* Remove hf_hub_name and add test

* Fix revision in output file

---------

Co-authored-by: gbmarc1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants