Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

robfe · 2024-01-08T02:12:34Z

Describe the bug
Certain data causes the error duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n

To Reproduce
Steps to reproduce the behavior:
I was running through the meltano "getting started" tutorial but I made two changes: I targetted duckdb instead of postgres, and I selected all github data.

When I run 'meltano run tap-github target-duckdb', I get the error mentioned

Expected behavior
For all the data from github to be uploaded into duckdb (many of the tables do arrive before this error occurs)

Screenshots
The entire contents of the CSV in question are:

,[],CONTRIBUTOR,"{""label"": ""robfe:master"", ""ref"": ""master"", ""sha"": ""5432aac028f95d6be47b6cc63e5080175d80b9d8"", ""user"": {""login"": ""robfe"", ""id"": 281453, ""node_id"": ""MDQ6VXNlcjI4MTQ1Mw=="", ""avatar_url"": ""https://avatars.githubusercontent.com/u/281453?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/robfe"", ""type"": ""User"", ""site_admin"": false}, ""repo"": {""id"": 21383686, ""node_id"": ""MDEwOlJlcG9zaXRvcnkyMTM4MzY4Ng=="", ""name"": ""SpecLight"", ""full_name"": ""robfe/SpecLight"", ""html_url"": ""https://github.com/robfe/SpecLight""}}","dependency
",2016-06-14T17:30:58Z,,https://api.github.com/repos/robfe/SpecLight/issues/10/comments,https://api.github.com/repos/robfe/SpecLight/pulls/10/commits,2016-06-13T23:43:26Z,https://github.com/robfe/SpecLight/pull/10.diff,False,"{""label"": ""FinestV:Fix-Broken-System-Web-WebPages-Dependency"", ""ref"": ""Fix-Broken-System-Web-WebPages-Dependency"", ""sha"": ""3cdc44ca834873174b955bc413812dc0e0782703"", ""user"": {""login"": ""FinestV"", ""id"": 13340489, ""node_id"": ""MDQ6VXNlcjEzMzQwNDg5"", ""avatar_url"": ""https://avatars.githubusercontent.com/u/13340489?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/FinestV"", ""type"": ""User"", ""site_admin"": false}, ""repo"": {""id"": 61009600, ""node_id"": ""MDEwOlJlcG9zaXRvcnk2MTAwOTYwMA=="", ""name"": ""SpecLight"", ""full_name"": ""FinestV/SpecLight"", ""html_url"": ""https://github.com/FinestV/SpecLight""}}",https://github.com/robfe/SpecLight/pull/10,73660537,[],False,fb263052304eeb4b1810d9e4d8cd4dc36beb0046,2016-06-14T17:30:58Z,,MDExOlB1bGxSZXF1ZXN0NzM2NjA1Mzc=,10,robfe,https://github.com/robfe/SpecLight/pull/10.patch,,,SpecLight,21383686,[],https://api.github.com/repos/robfe/SpecLight/pulls/comments{/number},https://api.github.com/repos/robfe/SpecLight/pulls/10/comments,closed,https://api.github.com/repos/robfe/SpecLight/statuses/3cdc44ca834873174b955bc413812dc0e0782703,Add dll required to fix the broken System.Web.Webpages reference,2018-02-18T09:36:38Z,https://api.github.com/repos/robfe/SpecLight/pulls/10,"{""login"": ""FinestV"", ""id"": 13340489, ""node_id"": ""MDQ6VXNlcjEzMzQwNDg5"", ""avatar_url"": ""https://avatars.githubusercontent.com/u/13340489?v=4"", ""gravatar_id"": """", ""html_url"": ""https://github.com/FinestV"", ""type"": ""User"", ""site_admin"": false}"

The csv is referencing the data from github for robfe/SpecLight#10

Your environment
osx catalina, oh my zsh, with python 3.11.7

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jwills · 2024-01-08T16:52:21Z

Hrm, is the repo in question private? Can you share enough of the meltano config for me to try to re-run it locally? (I'm assuming it's trying to pull/load all of the data from that SpecLight project?)

robfe · 2024-01-08T20:24:27Z

Hi @jwills!

I've uploaded the meltano project to https://github.com/robfe/meltano-tutorial-repro, hopefully it behaves the same for you as for me :D

Yes, it's configured to pull all of the data from that speclight project (this was just a learning exercise for me)

But I expect it would repro if I only tried to take the descriptions of the prs - I think the newline character that github is returning as the contents of the PR descriptions is triggering this (and that PR has a trailing newline on the description)

robfe · 2024-01-08T20:34:20Z

FWIW it looks like duckdb can correctly load that CSV (see the whitespace char in column04's cell)

robfe · 2024-01-09T20:35:11Z

FYI I've found a workaround: To use stream mapping on the source to strip out any newline chars:

      stream_maps:
        commits:
          commit: __NULL__
          commit_message: "commit.message.replace('\\r', '<br/>').replace('\\n', '<br/>')"

jwills · 2024-01-09T20:47:47Z

ah nice-- was just about to run this locally, I'll see if there is anything I can do to make the process a bit smoother here.

robfe · 2024-02-09T03:37:34Z

Hi @jwills have you had any luck?

I can confirm that it doesn't matter what the tap / data source is - if there are newlines present in any of the cells, then this error occurs

However it's not occuring 100% of the time. I had a 267 row postgres table (with text columns that contained lots of newlines) that loaded file on first run with meltano, but once I ran an incremental load, a single row crashed the loader

(This is before attempting the above workaround on that particular project, which involves a LOT of tables and columns :| )

jwills · 2024-02-09T23:33:02Z

ah no I promptly forgot about this issue, sorry!

The workaround stopped working, or you encountered a new use case where it doesn't, or?

robfe · 2024-02-10T05:45:57Z

The workaround works but you have to apply it on a column by column basis. Also, the outcome is you have to decide what to use instead of a new line character.

My production use case for this loader is to replicate our transactional database into duckdb. Unlike the GitHub example above where I was trying things out, the transactional db has a few hundred text columns. These could hold simple strings, json, markdown, html, comments, descriptions etc…

it would be ideal if I didn’t have to figure out on a case by case basis what a good replacement for a newline character is for each column

it’s still possible of course! But if there’s any way I can help solve this at the root instead of working around the issue each (numerous) time it’s encountered, I’d be really keen to help…

jwills · 2024-02-11T16:07:17Z

yeah that totally makes sense, thanks; I'm trying to think about how I would correctly escape all of the newlines in the general case though, that seems hard.

Looking at how we're doing this here: https://github.com/jwills/target-duckdb/blob/main/target_duckdb/db_sync.py#L378

...have you tried setting a quotechar for the ingest? I'm wondering how much of that this would solve. I may also need to adjust that COPY statement below to explicitly configure the newline/delimiter/quotechars.

jwills · 2024-02-11T16:53:53Z

Something like this, maybe #34

robfe · 2024-02-22T01:59:36Z

Something like this, maybe #34

I am not 100% sure if this will fix it and i'm not sure how to test it until you release (not so great with python sorry)

But something that I do think will work is: When you're writing out the CSV file, any time you write a string cell, make sure the string's newlines are all \r\n not just \n

Currently i do the same effect in stream mappings with the expression: str((self or "").replace("\r\n", "\n").replace("\n", "\r\n"))

srpwnd · 2024-03-11T12:19:44Z

I'm encountering the same issue. I tried running the #34 version but nothing changes and I still recieve

     cur.execute("COPY {} FROM '{}' WITH (delim '{}', quote '{}')".format(temp_table, temp_file_csv, self.delimiter, self.quotechar)) cmd_type=elb consumer=True name=target-duckdb producer=False stdio=stderr string_id=target-duckdb
duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n cmd_type=elb consumer=True name=target-duckdb producer=False stdio=stderr string_id=target-duckdb

I'm currently trying to get this running with tap-hubspot.

Could setting the new_line parameter for CSVs possibly help?

srpwnd · 2024-03-12T12:09:31Z

So I've tested my assumption and got it working in #35.

Could you also try it out on your data @robfe?

robfe added the bug Something isn't working label Jan 8, 2024

srpwnd mentioned this issue Mar 12, 2024

fix: set line terminator chars to correspond with csv writer #35

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

robfe commented Jan 8, 2024

jwills commented Jan 8, 2024

robfe commented Jan 8, 2024

robfe commented Jan 8, 2024 •

edited

Loading

robfe commented Jan 9, 2024

jwills commented Jan 9, 2024

robfe commented Feb 9, 2024

jwills commented Feb 9, 2024

robfe commented Feb 10, 2024

jwills commented Feb 11, 2024

jwills commented Feb 11, 2024

robfe commented Feb 22, 2024

srpwnd commented Mar 11, 2024

srpwnd commented Mar 12, 2024

Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32

Comments

robfe commented Jan 8, 2024

jwills commented Jan 8, 2024

robfe commented Jan 8, 2024

robfe commented Jan 8, 2024 • edited Loading

robfe commented Jan 9, 2024

jwills commented Jan 9, 2024

robfe commented Feb 9, 2024

jwills commented Feb 9, 2024

robfe commented Feb 10, 2024

jwills commented Feb 11, 2024

jwills commented Feb 11, 2024

robfe commented Feb 22, 2024

srpwnd commented Mar 11, 2024

srpwnd commented Mar 12, 2024

robfe commented Jan 8, 2024 •

edited

Loading