Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] 1965 Split up embedding functions #2034

Merged

Conversation

nablabits
Copy link
Contributor

@nablabits nablabits commented Apr 21, 2024

Description of changes

There are a number of things going on in this PR. I've split the changes in meaningful commits to facilitate the review.

  • The first commit creates the empty module and renames the old embedding_functions.py ffc5e91
  • Then, there's one commit per embedding function moved so the reviewer can check side by side that the contents were moved word by word.
  • Above was made skipping the linters to avoid noise, so after them there's a commit that lints the files which should be pretty innocuous . 6ad7598
  • However, the linting over the onnx embedding function felt quite sensible to me, so I decided to put it in a separate commit 8f08d60

Besides, there are a few docstrings and some tests that I'm working on as discussed here for which I will open a follow up PR to avoid noise here.

@atroyn, can you please take a look?

Test plan

How are these changes tested?
I launched a couple of times the whole test suite finding that they took a lot of time to complete (in some 1h I had only covered 35%). The second time I launched them, the computer even died, but not really sure if it was because of these tests only or because something else happening at that time.

I didn't know that there were tests for JS or rust, I will run them next and tick the checkboxes as appropriate. I've put the outcomes in this comment

  • Tests for this feature pass locally with pytest for python,
  • The whole test suite pass locally.
  • yarn test for js,
  • cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

In principle all the docstrings are as they were, and users' implementations won't be affected. However, as said before, I'm working on a follow up PR that will add a few more tests and edit docstrings where appropriate.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)



class AmazonBedrockEmbeddingFunction(EmbeddingFunction[Documents]):
def __init__(
self,
session: "boto3.Session", # noqa: F821 # Quote for forward reference
session: Any,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why mypy was complaining with the literal boto3 so I decided to put an Any

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a typecheck here with dynamic boto3 import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loosely remind this to have been a pain, mypy didn't seem to like removing type hints and the literals either 🤔 Do you have something specific in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types are really nice but both python and ts have a way to drive you into using Any. Maybe we can just try with instanceof.

I also like how the ecosystem is moving away from trying to have it all in the core package. Take for example how Langchain🦜🔗 has a partner lib for Chroma that is a completely separate package with all necessary deps. But that is probably a problem for another PR :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I really like that approach, but agreed, maybe a problem for the future

@@ -34,7 +29,7 @@ def __init__(self, url: str, model_name: str) -> None:
self._model_name = model_name
self._session = requests.Session()

def __call__(self, input: Documents) -> Embeddings:
def __call__(self, input: Union[Documents, str]) -> Embeddings:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input if isinstance(input, list) else [input] is allowing for things that are not lists subclasses.

@@ -56,10 +51,10 @@ def __call__(self, input: Documents) -> Embeddings:
if "data" not in resp:
raise RuntimeError(resp["detail"])

embeddings = resp["data"]
embeddings: List[dict[str, Union[str, List[float]]]] = resp["data"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the openai one, but here we can cast the expected type as then we are doing e["index"].

"""
Generate the embeddings for the given `input`.

# About ignoring types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise now that one of the implementations does e.index and the other does e["index"]. I wasn't sure that the latter was allowed, so maybe we can cast the types as done here,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use dynamic imports from the lib?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean by the first, can you be more explicit? Is there anywhere in the codebase where I can take a look?
As for the Dict[str,Any] it throws me this:

Argument 1 to "sorted" has incompatible type "Dict[str, Any]"; expected "Iterable[Dict[str, Union[str, List[float]]]]"

Which points to what I did in jina

Also, removing the type hint ignore, will raise additional annoying issues as it says that Name "embeddings" already defined on line 124 [no-redef] so we will be forced to tweak actual code we don't have tests for. For me it feels a call for potential issues 🤔

@@ -29,7 +31,7 @@ def __init__(self, api_key: str = "", api_url="https://infer.roboflow.com") -> N
api_url (str, optional): The URL of the Roboflow API. Defaults to "https://infer.roboflow.com".
"""
if not api_key:
api_key = os.environ.get("ROBOFLOW_API_KEY")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is incompatible with the str type of the signature. I don't see much value in this conditional logic as given the default, a not api_key would mean the user passing an empty string on purpose which I find quite unlikely, so happy to remove this bit altogether (maybe as a part of the follow up PR)

@nablabits
Copy link
Contributor Author

nablabits commented Apr 21, 2024

I've been running some tests with some disappointing outcomes 😕

  • My local version of node is complaining with yarn.
  • cargo seems not fully compiling on my ubuntu:
Error: Custom { kind: Other, error: "protoc failed: chromadb/proto/chroma.proto: This file contains proto3 optional fields, but --experimental_allow_proto3_optional was not set.\n" }
  • Finally, pytest managed to run in some 2.5h but it threw several failures:
FAILED chromadb/test/property/test_collections_with_database_tenant_overwrite.py::test_collections_with_tenant_database_overwrite - exceptiongroup.BaseExceptionGroup: Hypothesis found 2 distinct failures. (2 sub-exceptions)
FAILED chromadb/test/property/test_collections_with_database_tenant_overwrite.py::test_repeat_failure - chromadb.errors.InvalidUUIDError: Could not parse A00 as a UUID
FAILED chromadb/test/stress/test_many_collections.py::test_many_collections[sqlite_persistent] - OSError: [Errno 24] Too many open files: '/tmp/tmp_g4jh1g1/07f976f6-1d15-47b6-ab12-e1e7ad2def5d/index_metadata.pickle'
FAILED chromadb/test/test_api.py::test_persist_index_loading[local_persist_api] - sqlite3.OperationalError: unable to open database file
FAILED chromadb/test/test_api.py::test_persist_index_loading_embedding_function[local_persist_api] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_get_or_create_embedding_function[local_persist_api] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist[local_persist_api] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[fastapi] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[fastapi_persistent] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[sqlite] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?
FAILED chromadb/test/test_api.py::test_persist_index_loading_params[sqlite_persistent] - ValueError: Could not connect to tenant default_tenant. Are you sure it exists?

For the first two I prefer not dealing with those things for the time being, unless someone tells me so.
Regarding the second, I will take a look this week but it does not seem particulary related to my changes although after those failures there could be some assertion pointing to them 🤔


Update: 2024-04-24
Today I spent some time checking the python failing tests:

  • Weirdly, the test_api ones passed without issues which make me think of either a one-off failure or some persistence in the pytest session:
    Screenshot from 2024-04-24 09-11-19
  • Regarding test_many_collections I had to stop it when it arrived to 11Gb of RAM 😅 which may explain why the first time my computer died. I'm keen to put some warning in Develop.md if we agree that's a sensible thing to do.
    Screenshot from 2024-04-24 09-21-38
  • In terms of the property/test_collections_* they seem related to this chromadb.errors.InvalidUUIDError: Could not parse A00 as a UUID. One of them takes 36" to run which is a bit of a pain to debug, so if anyone can shed some light it would be greatly appreciated.

Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is on the right path. There is a way to go, however.

  1. Move the embedding_function dir at the top level, and maybe for conciseness and alignment with the rest of the ecosystem, call it embeddings.
  2. Making such a significant change without the requisite testing is inappropriate. It is very difficult to compare each EF to its previous implementation in this way. I also see that you made some decisions to change things around, which makes reasoning about the changes in this PR even harder.
  3. Stacking these changes might have made it a little easier to review.

chromadb/utils/embedding_functions.py Outdated Show resolved Hide resolved
@nablabits
Copy link
Contributor Author

nablabits commented Apr 24, 2024

I think this is on the right path. There is a way to go, however.

1. Move the embedding_function dir at the top level, and maybe for conciseness and alignment with the rest of the ecosystem, call it `embeddings`.

2. Making such a significant change without the requisite testing is inappropriate. It is very difficult to compare each EF to its previous implementation in this way. I also see that you made some decisions to change things around, which makes reasoning about the changes in this PR even harder.

3. Stacking these changes might have made it a little easier to review.

Hi @tazarov, thanks for your feedback, greatly appreciated 🙂

For clarity, I'm going to put my thoughts here instead of replying the other comment you left, hope it makes sense.

First of all, I'm deeply sorry that you found the PR hard to make sense as I made every effort to make it as clear as possible 😞 splitting each EF into its own commit, so in my mind, the reviewer could put side by side two github windows with, say, this commit and easily check that the functions were moved word by word. I appreciate that if you use graphite —or gitkraken like me— you may not even go to github, sorry about that. I was advised that I could use this stacking process, but I dismissed the idea as this PR might not be the best place to test a new tool and stuck to a process I know. I could have open, though, different PRs manually as this is something that I've done in the past, but for this PR I find that a bit overkilling if you don't have a tool like graphite that, appears to me, does the heavy lifting for you.

I'm a bit confused as to why do you think the __init__.py will break existing implementations. In my experience this is what you do when you have to move a file to a module but the alternative you propose will also work. You can check that this approach works as this test file wasn't touched by this PR and it still holds. I'm happy with your approach but maybe you guys want to come to grips as to how do you want the code look as here we agreed the __init__.py approach.

Regarding the tests I totally agree with you, I found this whole thing very poorly tested, e.g. this OpenAIEmbeddingFunction does not have a single test, that's why I was planning to open a separate PR with all the missing tests even when, arguably, they are out of the scope of the issue itself. Still, I personally checked that all the tests in chromadb/test/ef/ are passing. The only ones failing locally for me are these but I didn't bother very much with passing all the test suite locally as I saw that you have a bunch of Github actions doing the full chore for you and for more platforms than my specific Ubuntu. Now I realise that the workflows need an a approval of a maintainer but in principle they should tell us what the failures are. Regarding the rust and js ones, tbh, is not in my short term plans to learn those things, that's why I prefer not to invest time on them if possible as they are ad hocs for me, hope this makes sense.

Sorry for all this lengthy reply 😅, let me know what you think. Meanwhile, I will address these two bits:

  • Make the python tests pass locally (comment)
  • Get working get_builtins() in an elegant way (comment)

@atroyn
Copy link
Contributor

atroyn commented Apr 24, 2024

@tazarov I'll take over review of this one, @nablabits and I have been working on it on #1965 and I've agreed to his approach (as you can see on the discussion on the issue)

@nablabits I'll allocate time to review this in depth in the next day or two, thanks for the changes here!

@atroyn atroyn requested a review from tazarov May 15, 2024 20:28
Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few minor nits, but I think we should merge this as soon as possible and leave some time to add tests/verify and merge the rest of the EFs prior to releasing 0.5.1



class AmazonBedrockEmbeddingFunction(EmbeddingFunction[Documents]):
def __init__(
self,
session: "boto3.Session", # noqa: F821 # Quote for forward reference
session: Any,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a typecheck here with dynamic boto3 import?

"""
Generate the embeddings for the given `input`.

# About ignoring types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use dynamic imports from the lib?

@nablabits
Copy link
Contributor Author

Hi @atroyn, just a gently reminder that this folk is dangling 🙂. Let me know if you want me to do something else to land it.

@atroyn
Copy link
Contributor

atroyn commented Jun 12, 2024

@nablabits will be landing this week!

Copy link

vercel bot commented Jun 20, 2024

@atroyn is attempting to deploy a commit to the chromacore Team on Vercel.

A member of the Team first needs to authorize it.

@atroyn
Copy link
Contributor

atroyn commented Jun 20, 2024

We are good to merge! I made a few more changes:

  • the top-level embedding_functions module now dynamically imports anything that's a sub-class of EmbeddingFunction from any file in the same directory, and also adds it to the builtins list. This means future additions will be automatic here and won't need to be edited by contributors.

  • merged in the httpx over request changes from main

  • Made sure ChromaLangchainEmbeddingFunction is in the builtins list; it previously wasn't because of the way we use a factory to construct them. Here we just hardcode adding them.

  • Added tests to make sure that the right things were being imported, and all the things which were imported were the right type.

Thank you @nablabits for your hard work and patience!

@atroyn atroyn merged commit a5ea9a5 into chroma-core:main Jun 20, 2024
64 checks passed
atroyn added a commit that referenced this pull request Jun 21, 2024
@atroyn
Copy link
Contributor

atroyn commented Jun 21, 2024

My changes broke on python versions 3.9, 3.10 - reverting and fixing.

atroyn added a commit that referenced this pull request Jun 21, 2024
Reverts #2034

Failing on Python 3.9, 3.10, not on any other versions.
atroyn added a commit that referenced this pull request Jun 21, 2024
## Description of changes
The original attempt to split up the embedding functions failed because
of python 3.9 and 3.10 incompatibilities with `issubtype`.

Original PR here: #2034

Failing tests here:
https://github.com/chroma-core/chroma/actions/runs/9605053108/job/26491923410

The fix is changing `issubtype` to `isinstance`, which has the same
functionality.

## Test plan

Along with CI, tested locally with python 3.9 and 3.10 and confirmed
passing.

## Documentation Changes
N/A
@nablabits
Copy link
Contributor Author

We are good to merge! I made a few more changes:

* the top-level `embedding_functions` module now dynamically imports anything that's a sub-class of `EmbeddingFunction` from any file in the same directory, and also adds it to the builtins list. This means future additions will be automatic here and won't need to be edited by contributors.

* merged in the `httpx` over `request` changes from main

* Made sure `ChromaLangchainEmbeddingFunction` is in the builtins list; it previously wasn't because of the way we use a factory to construct them. Here we just hardcode adding them.

* Added tests to make sure that the right things were being imported, and all the things which were imported were the right type.

Thank you @nablabits for your hard work and patience!

@atroyn

Yay! Finally 🚀
To be honest it has been a great learning experience for me, thank you for the opportunity and the trust, greatly appreciated!!
BTW, that dynamic importing looks really cool —despite of the issues on the IDEs—

Copy link
Contributor

atroyn commented Jun 22, 2024

We will find a way to make it work on IDEs, as a first cut @tazarov will take a look!

Anush008 pushed a commit to Anush008/chroma that referenced this pull request Jun 27, 2024
- It addresses this issue:
chroma-core#1965

There are a number of things going on in this PR. I've split the changes
in meaningful commits to facilitate the review.
- The first commit creates the empty module and renames the old
`embedding_functions.py` ffc5e91
- Then, there's one commit per embedding function moved so the reviewer
can check side by side that the contents were moved word by word.
- Above was made skipping the linters to avoid noise, so after them
there's a commit that lints the files which should be pretty innocuous .
6ad7598
- However, the linting over the onnx embedding function felt quite
sensible to me, so I decided to put it in a separate commit
8f08d60

Besides, there are a few docstrings and some tests that I'm working on
as discussed
[here](chroma-core#1965 (comment))
for which I will open a follow up PR to avoid noise here.

@atroyn, can you please take a look?

*How are these changes tested?*
I launched a couple of times the whole test suite finding that they took
a lot of time to complete (in some 1h I had only covered 35%). The
second time I launched them, the computer even died, but not really sure
if it was because of these tests only or because something else
happening at that time.

I didn't know that there were tests for JS or rust, I will run them next
and tick the checkboxes as appropriate. I've put the outcomes in [this
comment](chroma-core#2034 (comment))

- [x] Tests for this feature pass locally with `pytest` for python,
- [ ] The whole test suite pass locally.
- [ ] `yarn test` for js,
- [ ] `cargo test` for rust

*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*

In principle all the docstrings are as they were, and users'
implementations won't be affected. However, as said before, I'm working
on a follow up PR that will add a few more tests and edit docstrings
where appropriate.

---------

Co-authored-by: atroyn <[email protected]>
Anush008 pushed a commit to Anush008/chroma that referenced this pull request Jun 27, 2024
Reverts chroma-core#2034

Failing on Python 3.9, 3.10, not on any other versions.
Anush008 pushed a commit to Anush008/chroma that referenced this pull request Jun 27, 2024
## Description of changes
The original attempt to split up the embedding functions failed because
of python 3.9 and 3.10 incompatibilities with `issubtype`.

Original PR here: chroma-core#2034

Failing tests here:
https://github.com/chroma-core/chroma/actions/runs/9605053108/job/26491923410

The fix is changing `issubtype` to `isinstance`, which has the same
functionality.

## Test plan

Along with CI, tested locally with python 3.9 and 3.10 and confirmed
passing.

## Documentation Changes
N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants