-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miss-matching counts #1
Comments
Do you have a sample of what your dataframe contains? How is it generated in the first place? It's hard to say or compare it to the code otherwise. |
Sure, I tried with a random sample using : Here are the results: (for twarc only those that appear with a count=2)
|
@luisignaciomenendez I think @igorbrigadir means where # Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
columns=['hashtag', 'id']) Is |
Yes,exactly. I converted it using twarc2 and then it is loaded with pandas. |
I'm a little bit confused by your code but I do think you've found a difference in how twarc-hashtags works and what is in the It looks like twarc-csv includes not only the tweets that were collected but also tweets that those tweets reference (replies and quotes) or so called "includes". Personally I would expect to only get hashtags for the tweets that were collected, not the tweets that were referenced. But I guess having an I wonder if users of twarc-csv understand this behavior when using the data though ... |
I think i found what the problem is - It's retweets. twarc-csv processes retweets so that they match what you would expect to find, using the full text of the tweet, not what the json actually contains. So, For a retweet in the json like this:
The retweet is truncated, so only 1 Hashtag is counted by twarc-hashtags: While the twarc-csv code, will dig into the referenced tweet,
So it will count 2 hashtags. A second source of variation is that twarc-hashtags ignores case, while your code is case sensitive, so These aren't mistakes or bugs as such, they're just different things that we should be aware of and decide to count one way or another. Personally, i'm inclined to to edit twarc-hashtags to count the retweeted hashtags same as twarc-csv, and keep it ignoring the case, same as twitter UI. This does mean adding a bit more code but i think it's less surprising to users, becuause if someone were to manually verify a count, they should match. |
I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies? That was the source of one discrepency at least. I thought that twarc-hashtags was counting retweets. If that's not the case it definitely feels like a bug in twarc-hashtags. I'm not sure it makes sense to count hashtags in tweets that are being replied to, quoted etc though -- unless asked to? I might need to think about this. I guess as a user of a hashtag report I'd want to see counts for tweets that I collected, not tweets related to the tweets I collected, but this is a fuzzy area where one tweet begins and ends. |
It used to, but by default in the latest version, no. Just the original tweets merged into the retweets. Also agree with not counting them from all referenced tweets like replies. Quotes are different though - the quote tweet itself yes, but the quoted tweet? I'm not sure. Right now it will count the quote itself but not the quoted tweet. Still on the fence here too. I guess making command line switches for this will work. Some of this overlaps with what i was planning with DocNow/twarc-statistics#2 and with DocNow/twarc#562 |
@igorbrigadir ok, thanks! I'll have to double check. I just got a new computer and am using the latest twarc-csv. I thought I noticed it pulling in basbtags from the included conversation_id after flattening. |
I have been experimenting with the plug in in some datasets and there appears to be an inconsistency with the counting. I am not sure if tweets that contain multiple hashtags are also taken into account.
Here is the code I use (extracting them from the entities metadata):
I get different counts when I apply
twarc2 hashtags sample.jsonl
(just got a random sample of tweets). I usually hashtags with higher counts compared to the twarc command.The text was updated successfully, but these errors were encountered: