Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miss-matching counts #1

Open
luisignaciomenendez opened this issue Jan 25, 2022 · 9 comments · May be fixed by #2
Open

Miss-matching counts #1

luisignaciomenendez opened this issue Jan 25, 2022 · 9 comments · May be fixed by #2
Labels
bug Something isn't working enhancement New feature or request

Comments

@luisignaciomenendez
Copy link

I have been experimenting with the plug in in some datasets and there appears to be an inconsistency with the counting. I am not sure if tweets that contain multiple hashtags are also taken into account.

Here is the code I use (extracting them from the entities metadata):

def hash_retrieve(df):
    """
    df : dataframe of tweets
    Description: 
        The function takes as an object a df of tweets obtained via twarc and 
        returns a generator object.
    
    """

    for line, id in zip(df['entities.hashtags'], df['id']):
        if pd.isna(line):
            continue
        line = line.strip()
        data = json.loads(line)
        for hashtag in ensure_flattened(data):
            #print(hashtag['tag'],id)
            yield [hashtag['tag'], id]


# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])


a.hashtag.value_counts()

I get different counts when I apply twarc2 hashtags sample.jsonl (just got a random sample of tweets). I usually hashtags with higher counts compared to the twarc command.

@igorbrigadir
Copy link

igorbrigadir commented Jan 25, 2022

Do you have a sample of what your dataframe contains? How is it generated in the first place? It's hard to say or compare it to the code otherwise.

@luisignaciomenendez
Copy link
Author

luisignaciomenendez commented Jan 25, 2022

Sure, I tried with a random sample using :
twarc2 sample sample.jsonl
( I have also done some extra trials but this is the most inmediate one). I know this is hardly replicable as its using a live stream of tweets but I will try to attach/send you the original file that I have.

Here are the results: (for twarc only those that appear with a count=2)

twarc2 hashtags sample.jsonl

Screenshot 2022-01-25 at 12 55 10

from my code:
Screenshot 2022-01-25 at 12 55 37

sample.jsonl.zip

@edsu
Copy link
Member

edsu commented Jan 25, 2022

@luisignaciomenendez I think @igorbrigadir means where df comes from in:

# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])

Is df loaded from a CSV generated with twarc2 csv?

@luisignaciomenendez
Copy link
Author

@luisignaciomenendez I think @igorbrigadir means where df comes from in:

# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])

Is df loaded from a CSV generated with twarc2 csv?

Yes,exactly. I converted it using twarc2 and then it is loaded with pandas.

@edsu
Copy link
Member

edsu commented Jan 25, 2022

I'm a little bit confused by your code but I do think you've found a difference in how twarc-hashtags works and what is in the entities.hashtags column that twarc-csv generates.

It looks like twarc-csv includes not only the tweets that were collected but also tweets that those tweets reference (replies and quotes) or so called "includes".

Personally I would expect to only get hashtags for the tweets that were collected, not the tweets that were referenced. But I guess having an --all flag to get all might be appropriate?

I wonder if users of twarc-csv understand this behavior when using the data though ...

@igorbrigadir
Copy link

I think i found what the problem is - It's retweets. twarc-csv processes retweets so that they match what you would expect to find, using the full text of the tweet, not what the json actually contains. So, For a retweet in the json like this:

{
  "entities": {
    "hashtags": [
      {
        "start": 107,
        "end": 115,
        "tag": "EndSARS"
      }
    ]
  },
  "id": "1388203310327508995",
  "referenced_tweets": [
    {
      "type": "retweeted",
      "id": "1388174000472432650"
    }
  ],
  "text": "RT @abjghost: @imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still wan…"
}

The retweet is truncated, so only 1 Hashtag is counted by twarc-hashtags: EndSARS

While the twarc-csv code, will dig into the referenced tweet, 1388174000472432650 which is:

{
  "entities": {
    "urls": [
      {
        "start": 280,
        "end": 303,
        "url": "https://t.co/fDgTVvbQBZ",
        "expanded_url": "https://twitter.com/abjghost/status/1388174000472432650/photo/1",
        "display_url": "pic.twitter.com/fDgTVvbQBZ"
      }
    ],
    "mentions": [
      {
        "start": 0,
        "end": 16,
        "username": "imoleayomichael"
      }
    ],
    "hashtags": [
      {
        "start": 93,
        "end": 101,
        "tag": "EndSARS"
      },
      {
        "start": 224,
        "end": 237,
        "tag": "FreeImoleAyo"
      }
    ]
  },
  "id": "1388174000472432650",
  "in_reply_to_user_id": "927129038933626880",
  "text": "@imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still want to convict him.\n\nImoleayo is a Programmer NOT A CRIMINAL!\n\nPls lend your voice in solidarity to \n#FreeImoleAyo\nIt could be you or me.\nPls tweet, RT, Tag https://t.co/fDgTVvbQBZ"
}

So it will count 2 hashtags.

A second source of variation is that twarc-hashtags ignores case, while your code is case sensitive, so EndSARS and endsars will be separate for example. Also, ensure_flattened(data) is meant more for handling entire responses not small json objects within tweets, but since the function is robust enough to handle that it's ok to keep using it like that. It simply does not do any thing to the data, so you can leave it out, and have for hashtag in data:

These aren't mistakes or bugs as such, they're just different things that we should be aware of and decide to count one way or another.

Personally, i'm inclined to to edit twarc-hashtags to count the retweeted hashtags same as twarc-csv, and keep it ignoring the case, same as twitter UI. This does mean adding a bit more code but i think it's less surprising to users, becuause if someone were to manually verify a count, they should match.

@igorbrigadir igorbrigadir added bug Something isn't working enhancement New feature or request labels Jan 25, 2022
@edsu
Copy link
Member

edsu commented Jan 25, 2022

I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies? That was the source of one discrepency at least. I thought that twarc-hashtags was counting retweets. If that's not the case it definitely feels like a bug in twarc-hashtags. I'm not sure it makes sense to count hashtags in tweets that are being replied to, quoted etc though -- unless asked to? I might need to think about this. I guess as a user of a hashtag report I'd want to see counts for tweets that I collected, not tweets related to the tweets I collected, but this is a fuzzy area where one tweet begins and ends.

@igorbrigadir
Copy link

I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies?

It used to, but by default in the latest version, no. Just the original tweets merged into the retweets.

Also agree with not counting them from all referenced tweets like replies. Quotes are different though - the quote tweet itself yes, but the quoted tweet? I'm not sure. Right now it will count the quote itself but not the quoted tweet. Still on the fence here too. I guess making command line switches for this will work.

Some of this overlaps with what i was planning with DocNow/twarc-statistics#2 and with DocNow/twarc#562

@edsu
Copy link
Member

edsu commented Jan 27, 2022

@igorbrigadir ok, thanks! I'll have to double check. I just got a new computer and am using the latest twarc-csv. I thought I noticed it pulling in basbtags from the included conversation_id after flattening.

@igorbrigadir igorbrigadir linked a pull request Feb 5, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants