Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Split up embedding_functions.py #1965

Closed
atroyn opened this issue Apr 3, 2024 · 5 comments
Closed

[Feature Request]: Split up embedding_functions.py #1965

atroyn opened this issue Apr 3, 2024 · 5 comments
Labels
by-chroma enhancement New feature or request good first issue Good for newcomers

Comments

@atroyn
Copy link
Contributor

atroyn commented Apr 3, 2024

Describe the problem

The embedding functions module is a single file which is getting really unwieldy. https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions.py

For example, it's difficult to land PR's like #1447 because they need auxiliary utilities, which probably shouldn't live in a global module.

Describe the proposed solution

Split up the file into one file per EF, which allows us to bundle the required utilities / helpers with each function separately.

Alternatives considered

Not doing this, which I think we will eventually have to do.

Importance

would make my life easier

Additional Information

No response

@atroyn atroyn added enhancement New feature or request good first issue Good for newcomers labels Apr 3, 2024
@nablabits
Copy link
Contributor

Hi @atroyn hope you are doing great. I was looking for something else and suddenly landed in this issue. It will be my first contribution to this repo, so I'm fairly new to the code base. However, the changes feel pretty straightforward, are you happy for me to pick this up?

@atroyn
Copy link
Contributor Author

atroyn commented Apr 9, 2024

Hi @nablabits, happy for you to take a crack at it. Please add me as a reviewer on your PR when it's ready.

@nablabits
Copy link
Contributor

nablabits commented Apr 13, 2024

Hey @atroyn, just a quick update on this as I'm halfway through the changes to make sure we are on the same page:

Key Bits

  • I'm creating a module called embedding_functions whose __init__ file will import the classes contained in the individual files. This way we won't break the existing implementations of the users. Something like this:
chromadb/utils
├── embedding_functions
│   ├── __init__.py
│   ├── amazon_bedrock_embedding_function.py
│   ├── cohere_embedding_function.py
│   ├── google_embedding_function.py
│   ├── huggingface_embedding_function.py
  • I've placed the classes of google —GooglePalmEmbeddingFunction, GoogleGenerativeAiEmbeddingFunction, GoogleVertexEmbeddingFunction— and, HF —HuggingFaceEmbeddingFunction, HuggingFaceEmbeddingServer— in the same file
  • It feels to me that DefaultEmbeddingFunction should live in __init__ but I don't have a strong opinion on that.
  • I see that there are few tests for the embeddings. I appreciate that it's hard because they involve imports that are not required by requirements.txt but we can still test things like these exceptions. I'm happy to address them unless you tell me otherwise.

Other Nits

  • To sort the imports I will use isort (which is not a requirement) unless you tell me otherwise.
  • I saw that there are docstrings and strings over the 88 chars defined for black which is expected because black makes no effort in reflowing them. I'm a bit paranoid about that, so unless you tell me otherwise I will reflow them to fit the limitation.
  • I have spotted a few other super minor things like this useless try block as requests is a requirement or this None assignment to a str

PR approach

This PR will contain a great deal of changes so to facilitate it what I have in mind is:

  • Create a commit with the new module (empty)
  • Create a commit for each moved function so it will be easy to check that they contain exactly the same content
  • I will do all above skipping pre-commit hooks as they were complaining about something, so next I will create a commit with the pre-commit cleanings
  • Then I will address the extra tests, and the other nits

Sorry for this bible up here 😅

@atroyn
Copy link
Contributor Author

atroyn commented Apr 15, 2024

Hi @nablabits - this seems like the right approach to me. Another approach to producing clean self-contained changes might be to use stacked PRs with something like https://graphite.dev/ (we use this at Chroma), but I am also happy to follow along with your approach for stacking commits.

I agree with default living in __init__ - all it should do however is import the actual default EF from its own file, to make sure this stays decoupled and we're not doing any actual EF stuff in __init__

Regarding complaining from the pre-commit hooks, it's likely that a lot of those complaints are due to type errors. If we can also clean up the type errors in the EFs as part of these changes that would be a huge W.

@nablabits
Copy link
Contributor

Oh wow, thanks for sharing, this graphite thing looks really neat and the learning curve seems not very steep. I'm not quite sure, though, whether this issue is the best place for to use this tool for the first time as I may end up messing up everything 😬 so, I'd say we are better off if I try this tool on some personal project and with a smaller piece of work and then on future contributions 🤞 , I can open those stacked PRs as needed.

Sure thing, each EF will live in their own separate file 👍

atroyn added a commit that referenced this issue Jun 20, 2024
## Description of changes
- It addresses this issue:
#1965

There are a number of things going on in this PR. I've split the changes
in meaningful commits to facilitate the review.
- The first commit creates the empty module and renames the old
`embedding_functions.py` ffc5e91
- Then, there's one commit per embedding function moved so the reviewer
can check side by side that the contents were moved word by word.
- Above was made skipping the linters to avoid noise, so after them
there's a commit that lints the files which should be pretty innocuous .
6ad7598
- However, the linting over the onnx embedding function felt quite
sensible to me, so I decided to put it in a separate commit
8f08d60

Besides, there are a few docstrings and some tests that I'm working on
as discussed
[here](#1965 (comment))
for which I will open a follow up PR to avoid noise here.
	 
@atroyn, can you please take a look?

## Test plan
*How are these changes tested?*
I launched a couple of times the whole test suite finding that they took
a lot of time to complete (in some 1h I had only covered 35%). The
second time I launched them, the computer even died, but not really sure
if it was because of these tests only or because something else
happening at that time.

I didn't know that there were tests for JS or rust, I will run them next
and tick the checkboxes as appropriate. I've put the outcomes in [this
comment](#2034 (comment))

- [x] Tests for this feature pass locally with `pytest` for python,
- [ ] The whole test suite pass locally.
- [ ] `yarn test` for js, 
- [ ] `cargo test` for rust

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*

In principle all the docstrings are as they were, and users'
implementations won't be affected. However, as said before, I'm working
on a follow up PR that will add a few more tests and edit docstrings
where appropriate.

---------

Co-authored-by: atroyn <[email protected]>
Anush008 pushed a commit to Anush008/chroma that referenced this issue Jun 27, 2024
- It addresses this issue:
chroma-core#1965

There are a number of things going on in this PR. I've split the changes
in meaningful commits to facilitate the review.
- The first commit creates the empty module and renames the old
`embedding_functions.py` ffc5e91
- Then, there's one commit per embedding function moved so the reviewer
can check side by side that the contents were moved word by word.
- Above was made skipping the linters to avoid noise, so after them
there's a commit that lints the files which should be pretty innocuous .
6ad7598
- However, the linting over the onnx embedding function felt quite
sensible to me, so I decided to put it in a separate commit
8f08d60

Besides, there are a few docstrings and some tests that I'm working on
as discussed
[here](chroma-core#1965 (comment))
for which I will open a follow up PR to avoid noise here.

@atroyn, can you please take a look?

*How are these changes tested?*
I launched a couple of times the whole test suite finding that they took
a lot of time to complete (in some 1h I had only covered 35%). The
second time I launched them, the computer even died, but not really sure
if it was because of these tests only or because something else
happening at that time.

I didn't know that there were tests for JS or rust, I will run them next
and tick the checkboxes as appropriate. I've put the outcomes in [this
comment](chroma-core#2034 (comment))

- [x] Tests for this feature pass locally with `pytest` for python,
- [ ] The whole test suite pass locally.
- [ ] `yarn test` for js,
- [ ] `cargo test` for rust

*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*

In principle all the docstrings are as they were, and users'
implementations won't be affected. However, as said before, I'm working
on a follow up PR that will add a few more tests and edit docstrings
where appropriate.

---------

Co-authored-by: atroyn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
by-chroma enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants