-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csv writer needs more quoting rules #67230
Comments
The csv module currently implements four quoting rules for dialects: QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC and QUOTE_NONE. These rules treat values of None the same as an empty string, i.e. by outputting two consecutive quotes. I propose the addition of two new rules, QUOTE_NOTNULL and QUOTE_STRINGS. The former behaves like QUOTE_ALL while the later behaves like QUOTE_NONNUMERIC, except that in both cases values of None are output as an empty field. Examples follow. Current behavior (which will remain unchanged) >>> csv.register_dialect('quote_all', quoting=csv.QUOTE_ALL)
>>> csv.writer(sys.stdout, dialect='quote_all').writerow(['foo', None, 42])
"foo","","42"
>>> csv.register_dialect('quote_nonnumeric', quoting=csv.QUOTE_NONNUMERIC)
>>> csv.writer(sys.stdout, dialect='quote_nonnumeric').writerow(['foo', None, 42])
"foo","",42 Proposed behavior >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['foo', None, 42])
"foo",,"42"
>>> csv.register_dialect('quote_strings', quoting=csv.QUOTE_STRINGS)
>>> csv.writer(sys.stdout, dialect='quote_strings').writerow(['foo', None, 42])
"foo",,42 |
As an enhancement, this could be added only to 3.5. The proposal sounds reasonable to me. |
David: That's not a problem for me. Sorry I can't provide real patches, but I'm not in a position to compile (much less test) the C implementation of _csv. I've looked at the code online and below are the changes that I think need to be made. My use cases don't require special handing when reading empty fields, so the only changes I've made are to the code for writers. I did verify that the reader code mostly only checks for QUOTE_NOTNULL when parsing. This means that completely empty fields will continue to load as zero-length strings, not None. I won't stand in the way of anyone wanting to "fix" that for these new rules. typedef enum { static StyleDesc quote_styles[] = { switch (dialect->quoting) {
case QUOTE_NONNUMERIC:
quoted = !PyNumber_Check(field);
break;
case QUOTE_ALL:
quoted = 1;
break;
case QUOTE_STRINGS:
quoted = PyString_Check(field);
break;
case QUOTE_NOTNULL:
quoted = field != Py_None;
break;
default:
quoted = 0;
break;
} " csv.QUOTE_MINIMAL means only when required, for example, when a\n" |
Samwyse, are these suggestions just based on ideas of what could be done or have you encountered real-world CSV data exchanges that couldn't be handled by the CSV module? |
It doesn't look like a difficult change, but is it really needed? I guess my reaction is the same as Raymond's. Are there real-world uses where the current set of quoting styles isn't sufficient? |
Used function PyUnicode_Check instead of PyString_Check |
Yes, it's based on a real-world need. I work for a Fortune 500 company and we have an internal tool that exports CSV files using what I've described as the QUOTE_NOTNULL rules. I need to create similar files for re-importation. Right now, I have to post-process the output of my Python program to get it right. I added in the QUOTE_STRINGS rule for completeness. I think these two new rules would be useful for anyone wanting to create sparse CSV files. |
If I understand correctly, your software needs to distinguish between # wrote ["foo", "", 42, None] with quote_all in effect and # wrote ["foo", None, 42, ""] with quote_nonnull in effect so you in effect want to transmit some type information through a CSV file? |
Skip, I don't have any visibility into how the Java program I'm feeding data into works, I'm just trying to replicate the csv files that it exports as accurately as possible. It has several other quirks, but I can replicate all of them using Dialects; this is the only "feature" I can't. The files I'm looking at have quoted strings and numbers, but there aren't any quoted empty strings. I'm using a DictWriter to create similar csv files, where missing keys are treated as values of None, so I'd like those printed without quotes. If we also want to print empty strings without quotes, that wouldn't impact me at all. Besides my selfish needs, this could be useful to anyone wanting to reduce the save of csv files that have lots of empty fields, but wants to quote all non-empty values. This may be an empty set, I don't know. |
Thanks for the update berker.peksag. I'm still not convinced that the csv module should be modified just so one user (sorry samwyse) can match the input format of someone's Java program. It seems a bit like trying to make the csv module type-sensitive. What happens when someone finds a csv file containing timestamps in a format other than the datetime.datetime object will produce by default? Why is None special as an object where bool(obj) is False? I think the better course here is to either:
If someone else thinks this is something which belongs in Python's csv module, feel free to reopen and assign it to yourself. |
The csv module is already type-sensitive (with QUOTE_NONNUMERIC). I agree, that we shouldn't modify the csv module just for one user and one program. If a standard CVS library in Java (or other popular laguages) differentiates between empty string and null value when read from CSV, it would be a serious argument to support this in Python. Quick search don't find this. |
I was thinking adding a more flexible API like:
But that would require too much change in the csv module (or at least its implementation wouldn't be trivial). I agree that subclassing DictWriter is a much better way to achieve this. |
The correct handling of None values belongs to the csv module. There is a use case to migrate a DB2 database to PostgreSQL. I would like to read in these exports with python and put it to PostgreSQL. But with the csv library I can't read it in correctly. The input is converted to: It is pretty easy to write a correct CSV reader with ANTLR but it's terribly slow. And last but not least: if someone writes a list the reading should the identity. Example: import csv
hello_out_lst = ['Hello', None, 'world']
with open('hello.csv', 'w') as ofh:
writer = csv.writer(ofh, delimiter=',')
writer.writerow(hello_out_lst)
with open('hello.csv', 'r') as ifh:
reader = csv.reader(ifh, delimiter=',')
for row in reader:
hello_in_lst = row
is_equal = hello_out_lst == hello_in_lst
print(f'{hello_out_lst} is equal {hello_in_lst} ? {is_equal}') The result is: |
There is a real requirement for csv to handle an empty field vs a empty string """". csv.QUOTE_NOTNULL could be useful. |
I have another use case. I want to import data into PostgreSQL using the COPY FROM command. This command can read a CSV input and it needs to distinguish empty string from null values. Could we reconsider this issue and the proposed solution? |
Okay, I'll reopen this, at least for the discussion of QUOTE_NONNULL. @erdnaxeli please given an example of how PostgreSQL distinguishes between the empty string and None cases. Is it a quoted empty string vs an empty field? If so, modifying @samwyse's original example, is this what you are after? >>> csv.register_dialect('quote_notnull', quoting=csv.QUOTE_NOTNULL)
>>> csv.writer(sys.stdout, dialect='quote_notnull').writerow(['', None, 42])
"",,"42" ? Can you modify the original two patches to restrict to QUOTE_NONNULL? |
Missed tweaking a couple settings. |
Ugh... s/QUOTE_NONNULL/QUOTE_NOTNULL/ Not, Non, None... Perl would treat them all the same, right? <wink> |
The MS Synapse has a CSV support in its COPY command that would benefit from the proposed csv.QUOTE_NOTNULL as it can be used when preparing data for import. As they reference the same RFC the CSV modul should support, both the proposed modifications would extend the RFC functionality with support for NULL/None values in the data. |
If I can add my pinch of salt. Here we colide data and programming world. In data world None is a value of a type. In Python None is a separate type. |
I certainly understand the desire to interoperate with the MS CSV ecosystem, as that has traditionally been the de facto "spec" for what it means to read and write CSV files. Still, I think the topic would be too involved to deal with properly between now and May 1. I propose limiting the change for now to just the writer, fixing any doc bits which suggest there are any changes to the reader, then going through the usual PEP process to investigate making changes to the semantics of the reader. |
Add two quoting styles for csv dialects. They will help to work with certain databases in particular. Automerge-Triggered-By: GH:merwok
Would someone like to work on a PEP regarding type conversion in the reader? |
* main: (24 commits) pythongh-98040: Move the Single-Phase Init Tests Out of test_imp (pythongh-102561) pythongh-83861: Fix datetime.astimezone() method (pythonGH-101545) pythongh-102856: Clean some of the PEP 701 tokenizer implementation (python#103634) pythongh-102856: Skip test_mismatched_parens in WASI builds (python#103633) pythongh-102856: Initial implementation of PEP 701 (python#102855) pythongh-103583: Add ref. dependency between multibytecodec modules (python#103589) pythongh-83004: Harden msvcrt further (python#103420) pythonGH-88342: clarify that `asyncio.as_completed` accepts generators yielding tasks (python#103626) pythongh-102778: IDLE - make sys.last_exc available in Shell after traceback (python#103314) pythongh-103582: Remove last references to `argparse.REMAINDER` from docs (python#103586) pythongh-103583: Always pass multibyte codec structs as const (python#103588) pythongh-103617: Fix compiler warning in _iomodule.c (python#103618) pythongh-103596: [Enum] do not shadow mixed-in methods/attributes (pythonGH-103600) pythonGH-100530: Change the error message for non-class class patterns (pythonGH-103576) pythongh-95299: Remove lingering setuptools reference in installer scripts (pythonGH-103613) [Doc] Fix a typo in optparse.rst (python#103504) pythongh-101100: Fix broken reference `__format__` in `string.rst` (python#103531) pythongh-95299: Stop installing setuptools as a part of ensurepip and venv (python#101039) pythonGH-103484: Docs: add linkcheck allowed redirects entries for most cases (python#103569) pythongh-67230: update whatsnew note for csv changes (python#103598) ...
I'll try to summarise the state of this issue: The proposed feature was to add more quoting rules to the
@samwyse remarked in #67230 (comment):
I suggest to close this as completed (the csv writer feature was implemented) and to open up a new issue regarding what to do with the csv reader. Footnotes
|
But it does.
If quoting style is QUOTE_NONNUMERIC, it returns non-quoted field as numeric. I expect the same for None. |
As I already wrote, I suggest to summarise the needed reader changes in a new issue. |
That's not really inferring the type in my book. It's effectively being told by the programmer that unquoted numeric strings are to be returned as numbers. If it was inferring the type, your first example would also spit out a float as the first value. |
Right, and I suggest that QUOTE_NOTNULL is told by the programmer that unquoted empty strings are to be returned as None. |
Can you please suggest this csv reader change in a new issue? |
It is perhaps too late to make this change in 3.12, so we can open a new issue. |
Opened #113732. |
…114816) As @GPHemsley pointed out, #29469 omitted `versionadded` notes for the 2 new items.
…RINGS (pythonGH-114816) As @GPHemsley pointed out, pythonGH-29469 omitted `versionadded` notes for the 2 new items. (cherry picked from commit 586057e) Co-authored-by: Skip Montanaro <[email protected]>
…TRINGS (GH-114816) (#114840) As @GPHemsley pointed out, GH-29469 omitted `versionadded` notes for the 2 new items. (cherry picked from commit 586057e) Co-authored-by: Skip Montanaro <[email protected]>
…RINGS (python#114816) As @GPHemsley pointed out, python#29469 omitted `versionadded` notes for the 2 new items.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: