You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some things to try:
clustering tweets with tSNE/kMeans/PCA
predict hashtags with tweets vectors
do regression on tweet/hashtag vectors
(notes from a chat with a colleague of mine who did some nlp research.
The following are some of his recommendations:
using word2vec is more going to give better results compared to e.g. countVectorizer
use word2vec with skipgram training for the tweets themselves
there probably is no need to remove stop words or tokenize tweets (but remove punctuation)
convert emojis into e.g. happy to get better context
convert word2vec vectors into polar coordinates
train word2vec for hashtags from tweets using cbow
I have been tinkering a bit with the our data using gensim (seems fairly easy to use although I haven't actually tried seeing what falls out of it yet)
The text was updated successfully, but these errors were encountered:
Great to hear @patrick-dd thanks for picking this up! I invited you to the D4D organization so you can be assigned the issue (helps us track who's working on what).
looks like me and @patrick-dd is going to work from the two different ends of the problem and maybe with luck meet in the middle :) Just thought I'd add in that anyone else who is interested is welcome since it'll be useful to get different insights into this task
Construct word2vec model with tweets for groups of people (e.g. far right) and compare with models trained on the overall twitterverse (e.g. http://fredericgodin.com/papers/Named%20Entity%20Recognition%20for%20Twitter%20Microposts%20using%20Distributed%20Word%20Representations.pdf)
Some things to try:
clustering tweets with tSNE/kMeans/PCA
predict hashtags with tweets vectors
do regression on tweet/hashtag vectors
(notes from a chat with a colleague of mine who did some nlp research.
The following are some of his recommendations:
using word2vec is more going to give better results compared to e.g. countVectorizer
use word2vec with skipgram training for the tweets themselves
there probably is no need to remove stop words or tokenize tweets (but remove punctuation)
convert emojis into e.g. happy to get better context
convert word2vec vectors into polar coordinates
train word2vec for hashtags from tweets using cbow
His opinion is that gensim is a handy tool but he also built some extra utils etc for his work that may be useful: https://github.com/pelodelfuego/word2vec-toolbox )
I have been tinkering a bit with the our data using gensim (seems fairly easy to use although I haven't actually tried seeing what falls out of it yet)
The text was updated successfully, but these errors were encountered: