Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Workaround for Pandas.DataFrame.to_csv bug #28755

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented May 29, 2024

SUMMARY

Per pandas-dev/pandas#47871, in Python 3.10+ (which is now the minimal supported Python version in Superset) there's a Pandas bug where DataFrame.to_csv unnecessarily requires one to specify an escapechar even though—per here—special values have already been escaped.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Updated unit tests.

Also tested locally, i.e., I was able to successfully export a CSV file containing escaped special characters in Python 3.9 yet it throws in Python 3.10 with the following error,

_csv.Error: need to escape, but no escapechar set

unless an escape character is specified.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley force-pushed the john-bodley--fix-pandas-dataframe-to-csv branch from 64dec4b to ac5034e Compare May 29, 2024 16:17
@@ -49,10 +49,6 @@ def escape_value(value: str) -> str:
is_negative_number = negative_number_re.match(value) is not None

if needs_escaping and not is_negative_number:
# Escape pipe to be extra safe as this
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now handled via the escapechar character in the pandas.DataFrame.to_csv method.

@john-bodley john-bodley force-pushed the john-bodley--fix-pandas-dataframe-to-csv branch 2 times, most recently from 0bc246b to 61123b1 Compare May 30, 2024 18:45
csv_str = "\n".join([",".join(row) for row in csv_rows])

df = pd.read_csv(io.StringIO(csv_str))
df = pd.DataFrame(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to create a Pandas DataFrame using data (as opposed to reading from a CSV file) in order to invoke the,

_csv.Error: need to escape, but no escapechar set

error. This resulted in different output and thus the tests needed to be updated.

df = pd.DataFrame(
data={
"value": [
"a",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure why previously there were two columns of data. One should be suffice.

["'=func()"],
["-10"],
[r"'=cmd\\|' /C calc'!A0"],
['"\'""""=b"'],
Copy link
Member Author

@john-bodley john-bodley May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous comment,

# pandas seems to be removing the leading ""

is no longer valid when creating the Pandas DataFrame from data (as opposed to a CSV file). The extra " are present because the " is the quote variable.

["col_a"],
["'=func()"],
["-10"],
[r"'=cmd\\|' /C calc'!A0"],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now an extra \. This is due to how the Pandas DataFrame is composed as opposed to to inclusion of an escape character.

@john-bodley john-bodley marked this pull request as ready for review May 30, 2024 18:55
@dosubot dosubot bot added the data:csv Related to import/export of CSVs label May 30, 2024
@john-bodley john-bodley force-pushed the john-bodley--fix-pandas-dataframe-to-csv branch from 61123b1 to 22a28d5 Compare June 12, 2024 16:27
@john-bodley john-bodley force-pushed the john-bodley--fix-pandas-dataframe-to-csv branch from 22a28d5 to ff488fd Compare June 12, 2024 17:41
csv_str = "\n".join([",".join(row) for row in csv_rows])

df = pd.read_csv(io.StringIO(csv_str))
df = pd.DataFrame(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note these tests are unit as opposed to integration tests—no database dependency—and thus I opted to move the file.

@john-bodley john-bodley merged commit 6b016da into master Jun 13, 2024
33 checks passed
@john-bodley john-bodley deleted the john-bodley--fix-pandas-dataframe-to-csv branch June 13, 2024 15:54
@michael-s-molina michael-s-molina added the v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch label Jun 14, 2024
michael-s-molina pushed a commit that referenced this pull request Jun 14, 2024
@mistercrunch mistercrunch added 🍒 4.0.2 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels labels Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels data:csv Related to import/export of CSVs size/M v4.0 Label added by the release manager to track PRs to be included in the 4.0 branch 🍒 4.0.2 🚢 4.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants