Fewer Tweets in CSV than hydrated #88

Tom1316 · 2021-04-29T20:16:59Z

After hydration, I end up with Tweet Ids Read: 8,145,625/Tweets Hydrated: 7,583,036 which is to be expected. However, when I opened the CSV in R, I only had 2,586,793 observations. Why is there such a large discrepancy? Does this mean over half the Tweets are not being converted?

edsu · 2021-04-29T22:16:51Z

Hmm, that's not good. Are you able to share the CSV privately with me at [email protected] for me to take a look? Also can you point me at the tweet ids?

Tom1316 · 2021-04-29T22:35:27Z

Sure thing, however the CSV files i 1.5-ish GB, so I'll have to upload it to Google Drive first. Once this is done (in about an hour) I'll send you a link. The Tweet IDs were taken from the following database:https://zenodo.org/record/4726282. I collected every Tweet ID between 7-14 April as .txt files. I then concatenated the individual textfiles into one text file, and ran it through the hydrator. The JSONL is about 35GB, which is large compared to the CSV. I'm trying to look at ways to load the JSONL into R, but I'm very new to computational work and am struggling to figure out what to do. - Thank you for your help

edsu · 2021-04-29T23:32:35Z

I can confirm that there are 2,586,794 rows in he CSV you shared. Could you share the tweet id file and I will try with the Hydrator too. If you like I can also try hydrating with twarc which can be a bit more reliable for large datasets.

Tom1316 · 2021-04-30T06:45:41Z

OK, Ed, I'll email the file to you shortly. Does this mean the JSON produced by Hydrator is also corrupt or only the CSV? I'm hoping to use this dataset for my dissertaion, so would appreciate it if you could run it in twarc and send me the CSV. Additionally, do you know why the Tweets were dropped during converstion?

edsu · 2021-04-30T11:46:56Z

Ok, let me know when you send the tweet ids.

I'm not sure if the JSON is corrupt, but it's possible. One way to check would be to run a little program over it and count the lines that have a valid JSON object on them. If you have Python installed and your JSON file is called for example tweets.jsonl you could run a program like this.

import json

count = 0
for line in open('tweets.jsonl'):
    tweet = json.loads(line)
    count += 1

print(count)

Tom1316 · 2021-04-30T12:59:44Z

Hi Ed, The tweet ids should be in the link in my previous email. If it didn't send or open, I'll compress the file it and attach via email. Thank you for the python script, I'll give it a try. I'm trying to run soemthing similar in R, but I keep encountering memory or vector errors. I'll keep trying! Please let me know if you have any luck downloading the Tweet IDs and converting it your end. Best Tom

edsu · 2021-04-30T16:46:22Z

I was able to download your hydrated json from Google Drive (thanks!). The good news is that it looks intact, with 7,583,036 valid JSON objects. I guess something must have gone wrong when hydrator tried to write the data. Perhaps it was interrupted? It would probably have taken a fair amount of time.

If you are interested you could convert the JSON to CSV using the json2csv.py utility. Since you are in a pinch with the research I could run this for you and send you the results.

Tom1316 · 2021-04-30T16:52:38Z

Hi Ed, Thank you so much for this. Surprisingly, the CSV conversion was pretty rapid; maybe that was the issue? If there's a way to send you an error log I'd be to support however I could. It's encouraging that the number of JSON objects matches the hydrated Tweets. I would be most grateful if you'd be able to run it and send the CSV. Afterwards I can try and open in it R and manipulate it from there. When I'm not against a deadline, I'll try the py utility as I need to keep developing my skills. I've passed it onto my classmates as I know a couple of them want to deal with JSON objects. Once again I can't thank you enough for your support, it really means a lot! Best, Tom

edsu · 2021-04-30T17:37:02Z

Ok, I will respond with a private email with the link to the CSV. Let me know if you are able to read this with R. It will be a large DataFrame, so depending on your setup/resources it might make sense to subset just the data you need before loading it. The csvcut utility from csvkit might provide a nice way to do that.

Tom1316 · 2021-04-30T19:32:30Z

It worked perfectly and the dataframe has fully loaded. Thanks for the assist. I'll take your advice and probably look to reduce the dataframe to make it easy to manipulate.

edsu · 2021-04-30T20:38:08Z

Since we have #56 and #51 for problems knowing how long CSV generation is taking can we close this ticket?

Tom1316 · 2021-05-01T09:08:52Z

Yes, thank you for your help.

edsu closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fewer Tweets in CSV than hydrated #88

Fewer Tweets in CSV than hydrated #88

Tom1316 commented Apr 29, 2021 •

edited

Loading

edsu commented Apr 29, 2021

Tom1316 commented Apr 29, 2021

edsu commented Apr 29, 2021 •

edited

Loading

Tom1316 commented Apr 30, 2021 •

edited

Loading

edsu commented Apr 30, 2021

Tom1316 commented Apr 30, 2021 via email •

edited

Loading

edsu commented Apr 30, 2021

Tom1316 commented Apr 30, 2021 via email •

edited by edsu

Loading

edsu commented Apr 30, 2021 •

edited

Loading

Tom1316 commented Apr 30, 2021 •

edited

Loading

edsu commented Apr 30, 2021

Tom1316 commented May 1, 2021

Fewer Tweets in CSV than hydrated #88

Fewer Tweets in CSV than hydrated #88

Comments

Tom1316 commented Apr 29, 2021 • edited Loading

edsu commented Apr 29, 2021

Tom1316 commented Apr 29, 2021

edsu commented Apr 29, 2021 • edited Loading

Tom1316 commented Apr 30, 2021 • edited Loading

edsu commented Apr 30, 2021

Tom1316 commented Apr 30, 2021 via email • edited Loading

edsu commented Apr 30, 2021

Tom1316 commented Apr 30, 2021 via email • edited by edsu Loading

edsu commented Apr 30, 2021 • edited Loading

Tom1316 commented Apr 30, 2021 • edited Loading

edsu commented Apr 30, 2021

Tom1316 commented May 1, 2021

Tom1316 commented Apr 29, 2021 •

edited

Loading

edsu commented Apr 29, 2021 •

edited

Loading

Tom1316 commented Apr 30, 2021 •

edited

Loading

Tom1316 commented Apr 30, 2021 via email •

edited

Loading

Tom1316 commented Apr 30, 2021 via email •

edited by edsu

Loading

edsu commented Apr 30, 2021 •

edited

Loading

Tom1316 commented Apr 30, 2021 •

edited

Loading