-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) #5171
Conversation
…ift, rst) Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
could we add some unit tests for all these :) may be easier to have sep pr for each, but will let you decide |
@dev2049 ok i will add tests and split the pr :) |
Signed-off-by: byhsu <[email protected]>
Added tests! plz take a look again. Thanks! btw, may i ask why i run into mypy error locally, but not on github action?
|
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
@eyurtsev Can you review? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor suggestion with enum. Let's update the exception to a ValueError rather than throwing a generic exception.
@@ -439,3 +440,261 @@ def __init__(self, **kwargs: Any): | |||
"", | |||
] | |||
super().__init__(separators=separators, **kwargs) | |||
|
|||
|
|||
class Language(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Have you considered using a union of literals instead of an enum?
Language = Union[Literal['go'], Literal['js'], ...]
Gives the same static typing guarantees and makes it such that a user does not have to import an extra enum.
I think the only downside is that it's harder to list all the options in the set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I would think prompting available languages is more important to users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe VSCode/PyLance would still prompt appropriately for a union of literals, but the enum is more foolproof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDE auto-completion will work in either case
I meant that with enums it may be more convenient for a developer to import the enum and then do something like looping through all the possibilities.
In [1]: from enum import Enum
In [2]: class A(Enum):
...: foo = 'foo'
...: bar = 'bar'
...:
In [4]: list(A)
Out[4]: [<A.foo: 'foo'>, <A.bar: 'bar'>]
Anyway this works for now. I think we could propagate the file name through in the document metadata via the source
field, then the code splitter will be able to automatically split based on the language potentially
@ByronHsu i can take a look at mypy error on monday. i assume you checked from master? |
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
Signed-off-by: byhsu <[email protected]>
@dev2049 i've added notebook examples and improved the tests. Could you plz review again? Thanks! |
…ift, rust) (#5171) As the title says, I added more code splitters. The implementation is trivial, so i don't add separate tests for each splitter. Let me know if any concerns. Fixes # (issue) #5170 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev @hwchase17 --------- Signed-off-by: byhsu <[email protected]> Co-authored-by: byhsu <[email protected]>
…ift, rust) (langchain-ai#5171) As the title says, I added more code splitters. The implementation is trivial, so i don't add separate tests for each splitter. Let me know if any concerns. Fixes # (issue) langchain-ai#5170 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev @hwchase17 --------- Signed-off-by: byhsu <[email protected]> Co-authored-by: byhsu <[email protected]>
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each splitter.
Let me know if any concerns.
Fixes # (issue)
#5170
Who can review?
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
@eyurtsev @hwchase17