-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n with github data #32
Comments
Hrm, is the repo in question private? Can you share enough of the meltano config for me to try to re-run it locally? (I'm assuming it's trying to pull/load all of the data from that SpecLight project?) |
Hi @jwills! I've uploaded the meltano project to https://github.com/robfe/meltano-tutorial-repro, hopefully it behaves the same for you as for me :D Yes, it's configured to pull all of the data from that speclight project (this was just a learning exercise for me) But I expect it would repro if I only tried to take the descriptions of the prs - I think the newline character that github is returning as the contents of the PR descriptions is triggering this (and that PR has a trailing newline on the description) |
FYI I've found a workaround: To use stream mapping on the source to strip out any newline chars: stream_maps:
commits:
commit: __NULL__
commit_message: "commit.message.replace('\\r', '<br/>').replace('\\n', '<br/>')" |
ah nice-- was just about to run this locally, I'll see if there is anything I can do to make the process a bit smoother here. |
Hi @jwills have you had any luck? I can confirm that it doesn't matter what the tap / data source is - if there are newlines present in any of the cells, then this error occurs However it's not occuring 100% of the time. I had a 267 row postgres table (with text columns that contained lots of newlines) that loaded file on first run with meltano, but once I ran an incremental load, a single row crashed the loader (This is before attempting the above workaround on that particular project, which involves a LOT of tables and columns :| ) |
ah no I promptly forgot about this issue, sorry! The workaround stopped working, or you encountered a new use case where it doesn't, or? |
The workaround works but you have to apply it on a column by column basis. Also, the outcome is you have to decide what to use instead of a new line character. My production use case for this loader is to replicate our transactional database into duckdb. Unlike the GitHub example above where I was trying things out, the transactional db has a few hundred text columns. These could hold simple strings, json, markdown, html, comments, descriptions etc… it would be ideal if I didn’t have to figure out on a case by case basis what a good replacement for a newline character is for each column it’s still possible of course! But if there’s any way I can help solve this at the root instead of working around the issue each (numerous) time it’s encountered, I’d be really keen to help… |
yeah that totally makes sense, thanks; I'm trying to think about how I would correctly escape all of the newlines in the general case though, that seems hard. Looking at how we're doing this here: https://github.com/jwills/target-duckdb/blob/main/target_duckdb/db_sync.py#L378 ...have you tried setting a quotechar for the ingest? I'm wondering how much of that this would solve. I may also need to adjust that |
Something like this, maybe #34 |
I am not 100% sure if this will fix it and i'm not sure how to test it until you release (not so great with python sorry) But something that I do think will work is: When you're writing out the CSV file, any time you write a string cell, make sure the string's newlines are all Currently i do the same effect in stream mappings with the expression: |
I'm encountering the same issue. I tried running the #34 version but nothing changes and I still recieve
I'm currently trying to get this running with tap-hubspot. Could setting the |
Describe the bug
Certain data causes the error
duckdb.duckdb.InvalidInputException: Invalid Input Error: Wrong NewLine Identifier. Expecting \r\n
To Reproduce
Steps to reproduce the behavior:
I was running through the meltano "getting started" tutorial but I made two changes: I targetted duckdb instead of postgres, and I selected all github data.
When I run 'meltano run tap-github target-duckdb', I get the error mentioned
Expected behavior
For all the data from github to be uploaded into duckdb (many of the tables do arrive before this error occurs)
Screenshots
The entire contents of the CSV in question are:
The csv is referencing the data from github for robfe/SpecLight#10
Your environment
osx catalina, oh my zsh, with python 3.11.7
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: