-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Split up embedding_functions.py
#1965
Comments
Hi @atroyn hope you are doing great. I was looking for something else and suddenly landed in this issue. It will be my first contribution to this repo, so I'm fairly new to the code base. However, the changes feel pretty straightforward, are you happy for me to pick this up? |
Hi @nablabits, happy for you to take a crack at it. Please add me as a reviewer on your PR when it's ready. |
Hey @atroyn, just a quick update on this as I'm halfway through the changes to make sure we are on the same page: Key Bits
chromadb/utils
├── embedding_functions
│ ├── __init__.py
│ ├── amazon_bedrock_embedding_function.py
│ ├── cohere_embedding_function.py
│ ├── google_embedding_function.py
│ ├── huggingface_embedding_function.py
Other Nits
PR approachThis PR will contain a great deal of changes so to facilitate it what I have in mind is:
Sorry for this bible up here 😅 |
Hi @nablabits - this seems like the right approach to me. Another approach to producing clean self-contained changes might be to use stacked PRs with something like https://graphite.dev/ (we use this at Chroma), but I am also happy to follow along with your approach for stacking commits. I agree with default living in Regarding complaining from the pre-commit hooks, it's likely that a lot of those complaints are due to type errors. If we can also clean up the type errors in the EFs as part of these changes that would be a huge W. |
Oh wow, thanks for sharing, this graphite thing looks really neat and the learning curve seems not very steep. I'm not quite sure, though, whether this issue is the best place for to use this tool for the first time as I may end up messing up everything 😬 so, I'd say we are better off if I try this tool on some personal project and with a smaller piece of work and then on future contributions 🤞 , I can open those stacked PRs as needed. Sure thing, each EF will live in their own separate file 👍 |
## Description of changes - It addresses this issue: #1965 There are a number of things going on in this PR. I've split the changes in meaningful commits to facilitate the review. - The first commit creates the empty module and renames the old `embedding_functions.py` ffc5e91 - Then, there's one commit per embedding function moved so the reviewer can check side by side that the contents were moved word by word. - Above was made skipping the linters to avoid noise, so after them there's a commit that lints the files which should be pretty innocuous . 6ad7598 - However, the linting over the onnx embedding function felt quite sensible to me, so I decided to put it in a separate commit 8f08d60 Besides, there are a few docstrings and some tests that I'm working on as discussed [here](#1965 (comment)) for which I will open a follow up PR to avoid noise here. @atroyn, can you please take a look? ## Test plan *How are these changes tested?* I launched a couple of times the whole test suite finding that they took a lot of time to complete (in some 1h I had only covered 35%). The second time I launched them, the computer even died, but not really sure if it was because of these tests only or because something else happening at that time. I didn't know that there were tests for JS or rust, I will run them next and tick the checkboxes as appropriate. I've put the outcomes in [this comment](#2034 (comment)) - [x] Tests for this feature pass locally with `pytest` for python, - [ ] The whole test suite pass locally. - [ ] `yarn test` for js, - [ ] `cargo test` for rust ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* In principle all the docstrings are as they were, and users' implementations won't be affected. However, as said before, I'm working on a follow up PR that will add a few more tests and edit docstrings where appropriate. --------- Co-authored-by: atroyn <[email protected]>
- It addresses this issue: chroma-core#1965 There are a number of things going on in this PR. I've split the changes in meaningful commits to facilitate the review. - The first commit creates the empty module and renames the old `embedding_functions.py` ffc5e91 - Then, there's one commit per embedding function moved so the reviewer can check side by side that the contents were moved word by word. - Above was made skipping the linters to avoid noise, so after them there's a commit that lints the files which should be pretty innocuous . 6ad7598 - However, the linting over the onnx embedding function felt quite sensible to me, so I decided to put it in a separate commit 8f08d60 Besides, there are a few docstrings and some tests that I'm working on as discussed [here](chroma-core#1965 (comment)) for which I will open a follow up PR to avoid noise here. @atroyn, can you please take a look? *How are these changes tested?* I launched a couple of times the whole test suite finding that they took a lot of time to complete (in some 1h I had only covered 35%). The second time I launched them, the computer even died, but not really sure if it was because of these tests only or because something else happening at that time. I didn't know that there were tests for JS or rust, I will run them next and tick the checkboxes as appropriate. I've put the outcomes in [this comment](chroma-core#2034 (comment)) - [x] Tests for this feature pass locally with `pytest` for python, - [ ] The whole test suite pass locally. - [ ] `yarn test` for js, - [ ] `cargo test` for rust *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* In principle all the docstrings are as they were, and users' implementations won't be affected. However, as said before, I'm working on a follow up PR that will add a few more tests and edit docstrings where appropriate. --------- Co-authored-by: atroyn <[email protected]>
Describe the problem
The embedding functions module is a single file which is getting really unwieldy. https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions.py
For example, it's difficult to land PR's like #1447 because they need auxiliary utilities, which probably shouldn't live in a global module.
Describe the proposed solution
Split up the file into one file per EF, which allows us to bundle the required utilities / helpers with each function separately.
Alternatives considered
Not doing this, which I think we will eventually have to do.
Importance
would make my life easier
Additional Information
No response
The text was updated successfully, but these errors were encountered: