-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] 1965 Split up embedding functions #2034
[ENH] 1965 Split up embedding functions #2034
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
||
|
||
class AmazonBedrockEmbeddingFunction(EmbeddingFunction[Documents]): | ||
def __init__( | ||
self, | ||
session: "boto3.Session", # noqa: F821 # Quote for forward reference | ||
session: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why mypy
was complaining with the literal boto3
so I decided to put an Any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have a typecheck here with dynamic boto3 import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I loosely remind this to have been a pain, mypy
didn't seem to like removing type hints and the literals either 🤔 Do you have something specific in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types are really nice but both python and ts have a way to drive you into using Any
. Maybe we can just try with instanceof
.
I also like how the ecosystem is moving away from trying to have it all in the core package. Take for example how Langchain🦜🔗 has a partner lib for Chroma that is a completely separate package with all necessary deps. But that is probably a problem for another PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I really like that approach, but agreed, maybe a problem for the future
@@ -34,7 +29,7 @@ def __init__(self, url: str, model_name: str) -> None: | |||
self._model_name = model_name | |||
self._session = requests.Session() | |||
|
|||
def __call__(self, input: Documents) -> Embeddings: | |||
def __call__(self, input: Union[Documents, str]) -> Embeddings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input if isinstance(input, list) else [input]
is allowing for things that are not lists subclasses.
@@ -56,10 +51,10 @@ def __call__(self, input: Documents) -> Embeddings: | |||
if "data" not in resp: | |||
raise RuntimeError(resp["detail"]) | |||
|
|||
embeddings = resp["data"] | |||
embeddings: List[dict[str, Union[str, List[float]]]] = resp["data"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the openai one, but here we can cast the expected type as then we are doing e["index"]
.
""" | ||
Generate the embeddings for the given `input`. | ||
|
||
# About ignoring types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise now that one of the implementations does e.index
and the other does e["index"]
. I wasn't sure that the latter was allowed, so maybe we can cast the types as done here,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use dynamic imports from the lib?
- v1 - https://github.com/openai/openai-python/blob/main/src/openai/types/create_embedding_response.py#L20
- v0 we can maybe cast to
Dict[str,Any]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you mean by the first, can you be more explicit? Is there anywhere in the codebase where I can take a look?
As for the Dict[str,Any]
it throws me this:
Argument 1 to "sorted" has incompatible type "Dict[str, Any]"; expected "Iterable[Dict[str, Union[str, List[float]]]]"
Which points to what I did in jina
Also, removing the type hint ignore, will raise additional annoying issues as it says that Name "embeddings" already defined on line 124 [no-redef]
so we will be forced to tweak actual code we don't have tests for. For me it feels a call for potential issues 🤔
@@ -29,7 +31,7 @@ def __init__(self, api_key: str = "", api_url="https://infer.roboflow.com") -> N | |||
api_url (str, optional): The URL of the Roboflow API. Defaults to "https://infer.roboflow.com". | |||
""" | |||
if not api_key: | |||
api_key = os.environ.get("ROBOFLOW_API_KEY") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is incompatible with the str
type of the signature. I don't see much value in this conditional logic as given the default, a not api_key
would mean the user passing an empty string on purpose which I find quite unlikely, so happy to remove this bit altogether (maybe as a part of the follow up PR)
I've been running some tests with some disappointing outcomes 😕
For the first two I prefer not dealing with those things for the time being, unless someone tells me so. Update: 2024-04-24
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is on the right path. There is a way to go, however.
- Move the embedding_function dir at the top level, and maybe for conciseness and alignment with the rest of the ecosystem, call it
embeddings
. - Making such a significant change without the requisite testing is inappropriate. It is very difficult to compare each EF to its previous implementation in this way. I also see that you made some decisions to change things around, which makes reasoning about the changes in this PR even harder.
- Stacking these changes might have made it a little easier to review.
Hi @tazarov, thanks for your feedback, greatly appreciated 🙂 For clarity, I'm going to put my thoughts here instead of replying the other comment you left, hope it makes sense. First of all, I'm deeply sorry that you found the PR hard to make sense as I made every effort to make it as clear as possible 😞 splitting each EF into its own commit, so in my mind, the reviewer could put side by side two github windows with, say, this commit and easily check that the functions were moved word by word. I appreciate that if you use graphite —or gitkraken like me— you may not even go to github, sorry about that. I was advised that I could use this stacking process, but I dismissed the idea as this PR might not be the best place to test a new tool and stuck to a process I know. I could have open, though, different PRs manually as this is something that I've done in the past, but for this PR I find that a bit overkilling if you don't have a tool like graphite that, appears to me, does the heavy lifting for you. I'm a bit confused as to why do you think the Regarding the tests I totally agree with you, I found this whole thing very poorly tested, e.g. this Sorry for all this lengthy reply 😅, let me know what you think. Meanwhile, I will address these two bits: |
@tazarov I'll take over review of this one, @nablabits and I have been working on it on #1965 and I've agreed to his approach (as you can see on the discussion on the issue) @nablabits I'll allocate time to review this in depth in the next day or two, thanks for the changes here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few minor nits, but I think we should merge this as soon as possible and leave some time to add tests/verify and merge the rest of the EFs prior to releasing 0.5.1
|
||
|
||
class AmazonBedrockEmbeddingFunction(EmbeddingFunction[Documents]): | ||
def __init__( | ||
self, | ||
session: "boto3.Session", # noqa: F821 # Quote for forward reference | ||
session: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have a typecheck here with dynamic boto3 import?
""" | ||
Generate the embeddings for the given `input`. | ||
|
||
# About ignoring types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use dynamic imports from the lib?
- v1 - https://github.com/openai/openai-python/blob/main/src/openai/types/create_embedding_response.py#L20
- v0 we can maybe cast to
Dict[str,Any]
Hi @atroyn, just a gently reminder that this folk is dangling 🙂. Let me know if you want me to do something else to land it. |
@nablabits will be landing this week! |
@atroyn is attempting to deploy a commit to the chromacore Team on Vercel. A member of the Team first needs to authorize it. |
We are good to merge! I made a few more changes:
Thank you @nablabits for your hard work and patience! |
My changes broke on python versions 3.9, 3.10 - reverting and fixing. |
Reverts #2034 Failing on Python 3.9, 3.10, not on any other versions.
## Description of changes The original attempt to split up the embedding functions failed because of python 3.9 and 3.10 incompatibilities with `issubtype`. Original PR here: #2034 Failing tests here: https://github.com/chroma-core/chroma/actions/runs/9605053108/job/26491923410 The fix is changing `issubtype` to `isinstance`, which has the same functionality. ## Test plan Along with CI, tested locally with python 3.9 and 3.10 and confirmed passing. ## Documentation Changes N/A
Yay! Finally 🚀 |
We will find a way to make it work on IDEs, as a first cut @tazarov will take a look! |
- It addresses this issue: chroma-core#1965 There are a number of things going on in this PR. I've split the changes in meaningful commits to facilitate the review. - The first commit creates the empty module and renames the old `embedding_functions.py` ffc5e91 - Then, there's one commit per embedding function moved so the reviewer can check side by side that the contents were moved word by word. - Above was made skipping the linters to avoid noise, so after them there's a commit that lints the files which should be pretty innocuous . 6ad7598 - However, the linting over the onnx embedding function felt quite sensible to me, so I decided to put it in a separate commit 8f08d60 Besides, there are a few docstrings and some tests that I'm working on as discussed [here](chroma-core#1965 (comment)) for which I will open a follow up PR to avoid noise here. @atroyn, can you please take a look? *How are these changes tested?* I launched a couple of times the whole test suite finding that they took a lot of time to complete (in some 1h I had only covered 35%). The second time I launched them, the computer even died, but not really sure if it was because of these tests only or because something else happening at that time. I didn't know that there were tests for JS or rust, I will run them next and tick the checkboxes as appropriate. I've put the outcomes in [this comment](chroma-core#2034 (comment)) - [x] Tests for this feature pass locally with `pytest` for python, - [ ] The whole test suite pass locally. - [ ] `yarn test` for js, - [ ] `cargo test` for rust *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* In principle all the docstrings are as they were, and users' implementations won't be affected. However, as said before, I'm working on a follow up PR that will add a few more tests and edit docstrings where appropriate. --------- Co-authored-by: atroyn <[email protected]>
Reverts chroma-core#2034 Failing on Python 3.9, 3.10, not on any other versions.
## Description of changes The original attempt to split up the embedding functions failed because of python 3.9 and 3.10 incompatibilities with `issubtype`. Original PR here: chroma-core#2034 Failing tests here: https://github.com/chroma-core/chroma/actions/runs/9605053108/job/26491923410 The fix is changing `issubtype` to `isinstance`, which has the same functionality. ## Test plan Along with CI, tested locally with python 3.9 and 3.10 and confirmed passing. ## Documentation Changes N/A
Description of changes
embedding_functions.py
#1965There are a number of things going on in this PR. I've split the changes in meaningful commits to facilitate the review.
embedding_functions.py
ffc5e91Besides, there are a few docstrings and some tests that I'm working on as discussed here for which I will open a follow up PR to avoid noise here.
@atroyn, can you please take a look?
Test plan
How are these changes tested?
I launched a couple of times the whole test suite finding that they took a lot of time to complete (in some 1h I had only covered 35%). The second time I launched them, the computer even died, but not really sure if it was because of these tests only or because something else happening at that time.
I didn't know that there were tests for JS or rust, I will run them next and tick the checkboxes as appropriate. I've put the outcomes in this comment
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
In principle all the docstrings are as they were, and users' implementations won't be affected. However, as said before, I'm working on a follow up PR that will add a few more tests and edit docstrings where appropriate.