core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

jkyamog · 2024-09-12T00:49:52Z

Description:
Currently CommaSeparatedListOutputParser can't handle strings that may contain commas within a column. It would parse any commas as the delimiter.
Ex.
"foo, foo2", "bar", "baz"

It will create 4 columns: "foo", "foo2", "bar", "baz"

This should be 3 columns:

"foo, foo2", "bar", "baz"

Dependencies:
Added 2 additional imports, but they are built in python packages.

import csv
from io import StringIO

Twitter handle: @jkyamog
Add tests and docs:

added simple unit test test_multiple_items_with_comma

… commas in it, used built 'csv' lib

vercel · 2024-09-12T00:49:55Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 1, 2024 10:42pm

efriis

Hey there! This is a breaking change in certain cases. Would it be possible to only use the new csv-based parsing logic in the event some flag is set on the output parser? e.g. CommaSeparatedListOutputParser(parse_quoted_cells=True)

jkyamog · 2024-09-17T23:45:31Z

Hey there! This is a breaking change in certain cases. Would it be possible to only use the new csv-based parsing logic in the event some flag is set on the output parser? e.g. CommaSeparatedListOutputParser(parse_quoted_cells=True)

Yes I can definitely just make it a flag. I didn't realize it was breaking other use cases. I did run the test on this, and I added 1 additional test. So my understanding it covered the previous cases. Before adding a new flag, I would like to see if it's possible to cover for all identified use cases? Can you give me an example where it gets broken?

For you convenience here are the current tests are:

test_single_item
    text = "foo"
    expected = ["foo"]

test_multiple_items_with_spaces
    text = "foo, bar, baz"
    expected = ["foo", "bar", "baz"]

test_multiple_items
    text = "foo,bar,baz"
    expected = ["foo", "bar", "baz"]

test_multiple_items_with_comma <- new test that I added based on the previous existing one.
    text = '"foo, foo2",bar,baz'
    expected = ["foo, foo2", "bar", "baz"]

Can you tell me what cases where it fails?

text = "FAILING CASE"
expected = ["foo", "bar" .....]

efriis · 2024-09-19T05:13:42Z

test_multiple_items_with_comma_existing_behavior
text = '" is a double quote character,another double quote character is ",this cell has no double quote'
expected = ['"this is a double quote character', 'another double quote character is "', 'this cell has no double quote']

note that the behavior will be different using the new flag (then it will match your test)

jkyamog · 2024-09-23T04:18:24Z

test_multiple_items_with_comma_existing_behavior text = '" is a double quote character,another double quote character is ",this cell has no double quote' expected = ['"this is a double quote character', 'another double quote character is "', 'this cell has no double quote']

note that the behavior will be different using the new flag (then it will match your test)

Sorry I only looked at this right now. Adding the flag on class level CommaSeparatedListOutputParser(parse_quoted_cells=True) it gives a lot of issues of existing test/code can't seem to instantiate it. Not sure, but maybe because it is being serialized and pydantic is not aware of the attribute parse_quoted_cells?

I then tried to see if I can just add the flag on the method level.

    assert parser.parse(text,parse_quoted_cells=True) == expected
    assert add(parser.transform(t for t in text)) == expected

This works on the .parse, but fails on .transform. I am not sure what is the purpose of .transform, I guess that is how its called down the chain? Appreciate some guidance on this, my initial thoughts that this would be simple change and even keeping the original behavior but I might be missing something.

efriis · 2024-11-01T22:34:24Z

let's count the comma values in double quotes as a bugfix but leave format instructions

@jkyamog

…ontain commas in it (langchain-ai#26365) - **Description:** Currently CommaSeparatedListOutputParser can't handle strings that may contain commas within a column. It would parse any commas as the delimiter. Ex. "foo, foo2", "bar", "baz" It will create 4 columns: "foo", "foo2", "bar", "baz" This should be 3 columns: "foo, foo2", "bar", "baz" - **Dependencies:** Added 2 additional imports, but they are built in python packages. import csv from io import StringIO - **Twitter handle:** @jkyamog - [ ] **Add tests and docs**: 1. added simple unit test test_multiple_items_with_comma --------- Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Bagatur <[email protected]> Co-authored-by: Bagatur <[email protected]>

fix CommaSeparatedListOutputParser to handle columns that may contain…

7eded32

… commas in it, used built 'csv' lib

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: core Related to langchain-core 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 12, 2024

vercel bot deployed to Preview September 12, 2024 01:02 View deployment

efriis reviewed Sep 17, 2024

View reviewed changes

Merge branch 'master' into master

91b0325

efriis self-assigned this Sep 17, 2024

vercel bot deployed to Preview September 17, 2024 21:35 View deployment

baskaryan and others added 2 commits October 24, 2024 08:26

Merge branch 'master' into master

380ed0f

fmt

0f16732

vercel bot deployed to Preview October 24, 2024 15:39 View deployment

efriis added 2 commits November 1, 2024 15:31

Merge branch 'master' into jkyamog/master

c4e7269

x

c3c690e

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Nov 1, 2024

efriis enabled auto-merge (squash) November 1, 2024 22:34

vercel bot deployed to Preview November 1, 2024 22:42 View deployment

efriis merged commit 830cad7 into langchain-ai:master Nov 1, 2024
77 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

jkyamog commented Sep 12, 2024

vercel bot commented Sep 12, 2024 •

edited

Loading

efriis left a comment

jkyamog commented Sep 17, 2024

efriis commented Sep 19, 2024

jkyamog commented Sep 23, 2024

efriis commented Nov 1, 2024

core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

Conversation

jkyamog commented Sep 12, 2024

vercel bot commented Sep 12, 2024 • edited Loading

efriis left a comment

Choose a reason for hiding this comment

jkyamog commented Sep 17, 2024

efriis commented Sep 19, 2024

jkyamog commented Sep 23, 2024

efriis commented Nov 1, 2024

vercel bot commented Sep 12, 2024 •

edited

Loading