Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

Merged
merged 6 commits into from
Nov 1, 2024

Conversation

jkyamog
Copy link
Contributor

@jkyamog jkyamog commented Sep 12, 2024

  • Description:
    Currently CommaSeparatedListOutputParser can't handle strings that may contain commas within a column. It would parse any commas as the delimiter.
    Ex.
    "foo, foo2", "bar", "baz"

It will create 4 columns: "foo", "foo2", "bar", "baz"

This should be 3 columns:

"foo, foo2", "bar", "baz"

  • Dependencies:
    Added 2 additional imports, but they are built in python packages.

import csv
from io import StringIO

  • Twitter handle: @jkyamog

  • Add tests and docs:

  1. added simple unit test test_multiple_items_with_comma

Copy link

vercel bot commented Sep 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 1, 2024 10:42pm

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: core Related to langchain-core 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 12, 2024
Copy link
Member

@efriis efriis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! This is a breaking change in certain cases. Would it be possible to only use the new csv-based parsing logic in the event some flag is set on the output parser? e.g. CommaSeparatedListOutputParser(parse_quoted_cells=True)

@jkyamog
Copy link
Contributor Author

jkyamog commented Sep 17, 2024

Hey there! This is a breaking change in certain cases. Would it be possible to only use the new csv-based parsing logic in the event some flag is set on the output parser? e.g. CommaSeparatedListOutputParser(parse_quoted_cells=True)

Yes I can definitely just make it a flag. I didn't realize it was breaking other use cases. I did run the test on this, and I added 1 additional test. So my understanding it covered the previous cases. Before adding a new flag, I would like to see if it's possible to cover for all identified use cases? Can you give me an example where it gets broken?

For you convenience here are the current tests are:

test_single_item
    text = "foo"
    expected = ["foo"]

test_multiple_items_with_spaces
    text = "foo, bar, baz"
    expected = ["foo", "bar", "baz"]

test_multiple_items
    text = "foo,bar,baz"
    expected = ["foo", "bar", "baz"]

test_multiple_items_with_comma <- new test that I added based on the previous existing one.
    text = '"foo, foo2",bar,baz'
    expected = ["foo, foo2", "bar", "baz"]

Can you tell me what cases where it fails?

text = "FAILING CASE"
expected = ["foo", "bar" .....]

@efriis
Copy link
Member

efriis commented Sep 19, 2024

test_multiple_items_with_comma_existing_behavior
text = '" is a double quote character,another double quote character is ",this cell has no double quote'
expected = ['"this is a double quote character', 'another double quote character is "', 'this cell has no double quote']

note that the behavior will be different using the new flag (then it will match your test)

@jkyamog
Copy link
Contributor Author

jkyamog commented Sep 23, 2024

test_multiple_items_with_comma_existing_behavior text = '" is a double quote character,another double quote character is ",this cell has no double quote' expected = ['"this is a double quote character', 'another double quote character is "', 'this cell has no double quote']

note that the behavior will be different using the new flag (then it will match your test)

Sorry I only looked at this right now. Adding the flag on class level CommaSeparatedListOutputParser(parse_quoted_cells=True) it gives a lot of issues of existing test/code can't seem to instantiate it. Not sure, but maybe because it is being serialized and pydantic is not aware of the attribute parse_quoted_cells?

I then tried to see if I can just add the flag on the method level.

    assert parser.parse(text,parse_quoted_cells=True) == expected
    assert add(parser.transform(t for t in text)) == expected

This works on the .parse, but fails on .transform. I am not sure what is the purpose of .transform, I guess that is how its called down the chain? Appreciate some guidance on this, my initial thoughts that this would be simple change and even keeping the original behavior but I might be missing something.

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Nov 1, 2024
@efriis
Copy link
Member

efriis commented Nov 1, 2024

let's count the comma values in double quotes as a bugfix but leave format instructions

@efriis efriis enabled auto-merge (squash) November 1, 2024 22:34
@efriis efriis merged commit 830cad7 into langchain-ai:master Nov 1, 2024
77 checks passed
shjunn pushed a commit to shjunn/langchain that referenced this pull request Nov 4, 2024
…ontain commas in it (langchain-ai#26365)

- **Description:**
Currently CommaSeparatedListOutputParser can't handle strings that may
contain commas within a column. It would parse any commas as the
delimiter.
Ex. 
"foo, foo2", "bar", "baz"

It will create 4 columns: "foo", "foo2", "bar", "baz"

This should be 3 columns:

"foo, foo2", "bar", "baz"

- **Dependencies:**
Added 2 additional imports, but they are built in python packages.

import csv
from io import StringIO

- **Twitter handle:** @jkyamog

- [ ] **Add tests and docs**: 
1. added simple unit test test_multiple_items_with_comma

---------

Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Bagatur <[email protected]>
Co-authored-by: Bagatur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: core Related to langchain-core size:M This PR changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants