Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

Open
robfe opened this issue Jan 8, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@robfe
Copy link

robfe commented Jan 8, 2024

Describe the bug
Certain data causes the error duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n

To Reproduce
Steps to reproduce the behavior:
I was running through the meltano "getting started" tutorial but I made two changes: I targetted duckdb instead of postgres, and I selected all github data.

When I run 'meltano run tap-github target-duckdb', I get the error mentioned

Expected behavior
For all the data from github to be uploaded into duckdb (many of the tables do arrive before this error occurs)

Screenshots
The entire contents of the CSV in question are:

,[],CONTRIBUTOR,"{""label"": ""robfe:master"", ""ref"": ""master"", ""sha"": ""5432aac028f95d6be47b6cc63e5080175d80b9d8"", ""user"": {""login"": ""robfe"", ""id"": 281453, ""node_id"": ""MDQ6VXNlcjI4MTQ1Mw=="", ""avatar_url"": ""https://avatars.githubusercontent.com/u/281453?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/robfe"", ""type"": ""User"", ""site_admin"": false}, ""repo"": {""id"": 21383686, ""node_id"": ""MDEwOlJlcG9zaXRvcnkyMTM4MzY4Ng=="", ""name"": ""SpecLight"", ""full_name"": ""robfe/SpecLight"", ""html_url"": ""https://github.com/robfe/SpecLight""}}","dependency
",2016-06-14T17:30:58Z,,https://api.github.com/repos/robfe/SpecLight/issues/10/comments,https://api.github.com/repos/robfe/SpecLight/pulls/10/commits,2016-06-13T23:43:26Z,https://github.com/robfe/SpecLight/pull/10.diff,False,"{""label"": ""FinestV:Fix-Broken-System-Web-WebPages-Dependency"", ""ref"": ""Fix-Broken-System-Web-WebPages-Dependency"", ""sha"": ""3cdc44ca834873174b955bc413812dc0e0782703"", ""user"": {""login"": ""FinestV"", ""id"": 13340489, ""node_id"": ""MDQ6VXNlcjEzMzQwNDg5"", ""avatar_url"": ""https://avatars.githubusercontent.com/u/13340489?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/FinestV"", ""type"": ""User"", ""site_admin"": false}, ""repo"": {""id"": 61009600, ""node_id"": ""MDEwOlJlcG9zaXRvcnk2MTAwOTYwMA=="", ""name"": ""SpecLight"", ""full_name"": ""FinestV/SpecLight"", ""html_url"": ""https://github.com/FinestV/SpecLight""}}",https://github.com/robfe/SpecLight/pull/10,73660537,[],False,fb263052304eeb4b1810d9e4d8cd4dc36beb0046,2016-06-14T17:30:58Z,,MDExOlB1bGxSZXF1ZXN0NzM2NjA1Mzc=,10,robfe,https://github.com/robfe/SpecLight/pull/10.patch,,,SpecLight,21383686,[],https://api.github.com/repos/robfe/SpecLight/pulls/comments{/number},https://api.github.com/repos/robfe/SpecLight/pulls/10/comments,closed,https://api.github.com/repos/robfe/SpecLight/statuses/3cdc44ca834873174b955bc413812dc0e0782703,Add dll required to fix the broken System.Web.Webpages reference,2018-02-18T09:36:38Z,https://api.github.com/repos/robfe/SpecLight/pulls/10,"{""login"": ""FinestV"", ""id"": 13340489, ""node_id"": ""MDQ6VXNlcjEzMzQwNDg5"", ""avatar_url"": ""https://avatars.githubusercontent.com/u/13340489?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/FinestV"", ""type"": ""User"", ""site_admin"": false}"


The csv is referencing the data from github for robfe/SpecLight#10

Your environment
osx catalina, oh my zsh, with python 3.11.7

Additional context
Add any other context about the problem here.

@robfe robfe added the bug Something isn't working label Jan 8, 2024
@jwills
Copy link
Owner

jwills commented Jan 8, 2024

Hrm, is the repo in question private? Can you share enough of the meltano config for me to try to re-run it locally? (I'm assuming it's trying to pull/load all of the data from that SpecLight project?)

@robfe
Copy link
Author

robfe commented Jan 8, 2024

Hi @jwills!

I've uploaded the meltano project to https://github.com/robfe/meltano-tutorial-repro, hopefully it behaves the same for you as for me :D

Yes, it's configured to pull all of the data from that speclight project (this was just a learning exercise for me)

But I expect it would repro if I only tried to take the descriptions of the prs - I think the newline character that github is returning as the contents of the PR descriptions is triggering this (and that PR has a trailing newline on the description)

@robfe
Copy link
Author

robfe commented Jan 8, 2024

FWIW it looks like duckdb can correctly load that CSV (see the whitespace char in column04's cell)

image

@robfe
Copy link
Author

robfe commented Jan 9, 2024

FYI I've found a workaround: To use stream mapping on the source to strip out any newline chars:

      stream_maps:
        commits:
          commit: __NULL__
          commit_message: "commit.message.replace('\\r', '<br/>').replace('\\n', '<br/>')"

@jwills
Copy link
Owner

jwills commented Jan 9, 2024

ah nice-- was just about to run this locally, I'll see if there is anything I can do to make the process a bit smoother here.

@robfe
Copy link
Author

robfe commented Feb 9, 2024

Hi @jwills have you had any luck?

I can confirm that it doesn't matter what the tap / data source is - if there are newlines present in any of the cells, then this error occurs

However it's not occuring 100% of the time. I had a 267 row postgres table (with text columns that contained lots of newlines) that loaded file on first run with meltano, but once I ran an incremental load, a single row crashed the loader

(This is before attempting the above workaround on that particular project, which involves a LOT of tables and columns :| )

@jwills
Copy link
Owner

jwills commented Feb 9, 2024

ah no I promptly forgot about this issue, sorry!

The workaround stopped working, or you encountered a new use case where it doesn't, or?

@robfe
Copy link
Author

robfe commented Feb 10, 2024

The workaround works but you have to apply it on a column by column basis. Also, the outcome is you have to decide what to use instead of a new line character.

My production use case for this loader is to replicate our transactional database into duckdb. Unlike the GitHub example above where I was trying things out, the transactional db has a few hundred text columns. These could hold simple strings, json, markdown, html, comments, descriptions etc…

it would be ideal if I didn’t have to figure out on a case by case basis what a good replacement for a newline character is for each column

it’s still possible of course! But if there’s any way I can help solve this at the root instead of working around the issue each (numerous) time it’s encountered, I’d be really keen to help…

@jwills
Copy link
Owner

jwills commented Feb 11, 2024

yeah that totally makes sense, thanks; I'm trying to think about how I would correctly escape all of the newlines in the general case though, that seems hard.

Looking at how we're doing this here: https://github.com/jwills/target-duckdb/blob/main/target_duckdb/db_sync.py#L378

...have you tried setting a quotechar for the ingest? I'm wondering how much of that this would solve. I may also need to adjust that COPY statement below to explicitly configure the newline/delimiter/quotechars.

@jwills
Copy link
Owner

jwills commented Feb 11, 2024

Something like this, maybe #34

@robfe
Copy link
Author

robfe commented Feb 22, 2024

Something like this, maybe #34

I am not 100% sure if this will fix it and i'm not sure how to test it until you release (not so great with python sorry)

But something that I do think will work is: When you're writing out the CSV file, any time you write a string cell, make sure the string's newlines are all \r\n not just \n

Currently i do the same effect in stream mappings with the expression: str((self or "").replace("\r\n", "\n").replace("\n", "\r\n"))

@srpwnd
Copy link
Contributor

srpwnd commented Mar 11, 2024

I'm encountering the same issue. I tried running the #34 version but nothing changes and I still recieve

     cur.execute("COPY {} FROM '{}' WITH (delim '{}', quote '{}')".format(temp_table, temp_file_csv, self.delimiter, self.quotechar)) cmd_type=elb consumer=True name=target-duckdb producer=False stdio=stderr string_id=target-duckdb
duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n cmd_type=elb consumer=True name=target-duckdb producer=False stdio=stderr string_id=target-duckdb

I'm currently trying to get this running with tap-hubspot.

Could setting the new_line parameter for CSVs possibly help?

@srpwnd
Copy link
Contributor

srpwnd commented Mar 12, 2024

So I've tested my assumption and got it working in #35.

Could you also try it out on your data @robfe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants