Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problem #69

Open
LeonardoSanna opened this issue Dec 10, 2020 · 4 comments
Open

Encoding problem #69

LeonardoSanna opened this issue Dec 10, 2020 · 4 comments

Comments

@LeonardoSanna
Copy link

Hello, I've a pretty large dataset (> 2 TB) split in six files.

I assumed that UTF-8 were the text encoding of jsonl files. However there are some charachters that apparently are non-UTF.8 and this causes R to fail when I specify the encoding.

Not specifying the encoding results in a messy full_text output

@edsu
Copy link
Member

edsu commented Dec 10, 2020

Do you have an example?

@LeonardoSanna
Copy link
Author

LeonardoSanna commented Dec 10, 2020

Update, I found I walkaround:

  1. Import the file in R withouth encoding specification
  2. clean data
  3. export on UTF-8 csv.

The problem was with the function stream_in producing the error Error in FUN(X[[i]], ...) : invalid multibyte string, element 1 while streaming the json file in a dataframe.

Not specifying while importing solves the issue, though fileEncoding = "UTF-8" must be specified while writing on the outfile

However there are still some weird charachters under "full text" I think because of emojis

These are unicode emojis and I'm ok with that
RT @ScottAnthonyUSA: <U+26A0><U+FE0F> IT SHOULD BE NOTED that the CDC initially had an embargo placed on CDC testimony. The TRUMP ADMINISTRATION LIFTED…

But what about this? iOS emoji?

RT @StocksUnhinged: $SPY $AAL $APT $EXPE $GOOG $DAL $UAL $BA $LAKE $YUM $CMG $HUM $CI #CDC expected to announce first US case of #Wuhan…

@edsu
Copy link
Member

edsu commented Dec 10, 2020

If you can give me a tweet id that will help me test.

@LeonardoSanna
Copy link
Author

If you can give me a tweet id that will help me test.

1219771346768596992 this the one with the dollar signs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants