Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fewer Tweets in CSV than hydrated #88

Closed
Tom1316 opened this issue Apr 29, 2021 · 12 comments
Closed

Fewer Tweets in CSV than hydrated #88

Tom1316 opened this issue Apr 29, 2021 · 12 comments

Comments

@Tom1316
Copy link

Tom1316 commented Apr 29, 2021

After hydration, I end up with Tweet Ids Read: 8,145,625/Tweets Hydrated: 7,583,036 which is to be expected. However, when I opened the CSV in R, I only had 2,586,793 observations. Why is there such a large discrepancy? Does this mean over half the Tweets are not being converted?

@edsu
Copy link
Member

edsu commented Apr 29, 2021

Hmm, that's not good. Are you able to share the CSV privately with me at [email protected] for me to take a look? Also can you point me at the tweet ids?

@Tom1316
Copy link
Author

Tom1316 commented Apr 29, 2021

Sure thing, however the CSV files i 1.5-ish GB, so I'll have to upload it to Google Drive first. Once this is done (in about an hour) I'll send you a link. The Tweet IDs were taken from the following database:https://zenodo.org/record/4726282. I collected every Tweet ID between 7-14 April as .txt files. I then concatenated the individual textfiles into one text file, and ran it through the hydrator. The JSONL is about 35GB, which is large compared to the CSV. I'm trying to look at ways to load the JSONL into R, but I'm very new to computational work and am struggling to figure out what to do. - Thank you for your help

@edsu
Copy link
Member

edsu commented Apr 29, 2021

I can confirm that there are 2,586,794 rows in he CSV you shared. Could you share the tweet id file and I will try with the Hydrator too. If you like I can also try hydrating with twarc which can be a bit more reliable for large datasets.

@Tom1316
Copy link
Author

Tom1316 commented Apr 30, 2021

OK, Ed, I'll email the file to you shortly. Does this mean the JSON produced by Hydrator is also corrupt or only the CSV? I'm hoping to use this dataset for my dissertaion, so would appreciate it if you could run it in twarc and send me the CSV. Additionally, do you know why the Tweets were dropped during converstion?

@edsu
Copy link
Member

edsu commented Apr 30, 2021

Ok, let me know when you send the tweet ids.

I'm not sure if the JSON is corrupt, but it's possible. One way to check would be to run a little program over it and count the lines that have a valid JSON object on them. If you have Python installed and your JSON file is called for example tweets.jsonl you could run a program like this.

import json

count = 0
for line in open('tweets.jsonl'):
    tweet = json.loads(line)
    count += 1

print(count)

@Tom1316
Copy link
Author

Tom1316 commented Apr 30, 2021 via email

@edsu
Copy link
Member

edsu commented Apr 30, 2021

I was able to download your hydrated json from Google Drive (thanks!). The good news is that it looks intact, with 7,583,036 valid JSON objects. I guess something must have gone wrong when hydrator tried to write the data. Perhaps it was interrupted? It would probably have taken a fair amount of time.

If you are interested you could convert the JSON to CSV using the json2csv.py utility. Since you are in a pinch with the research I could run this for you and send you the results.

@Tom1316
Copy link
Author

Tom1316 commented Apr 30, 2021 via email

@edsu
Copy link
Member

edsu commented Apr 30, 2021

Ok, I will respond with a private email with the link to the CSV. Let me know if you are able to read this with R. It will be a large DataFrame, so depending on your setup/resources it might make sense to subset just the data you need before loading it. The csvcut utility from csvkit might provide a nice way to do that.

@Tom1316
Copy link
Author

Tom1316 commented Apr 30, 2021

It worked perfectly and the dataframe has fully loaded. Thanks for the assist. I'll take your advice and probably look to reduce the dataframe to make it easy to manipulate.

@edsu
Copy link
Member

edsu commented Apr 30, 2021

Since we have #56 and #51 for problems knowing how long CSV generation is taking can we close this ticket?

@Tom1316
Copy link
Author

Tom1316 commented May 1, 2021

Yes, thank you for your help.

@edsu edsu closed this as completed Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants