Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec models #26

Open
wwymak opened this issue Feb 2, 2017 · 3 comments
Open

Word2Vec models #26

wwymak opened this issue Feb 2, 2017 · 3 comments

Comments

@wwymak
Copy link
Contributor

wwymak commented Feb 2, 2017

Construct word2vec model with tweets for groups of people (e.g. far right) and compare with models trained on the overall twitterverse (e.g. http://fredericgodin.com/papers/Named%20Entity%20Recognition%20for%20Twitter%20Microposts%20using%20Distributed%20Word%20Representations.pdf)

Some things to try:
clustering tweets with tSNE/kMeans/PCA
predict hashtags with tweets vectors
do regression on tweet/hashtag vectors

(notes from a chat with a colleague of mine who did some nlp research.
The following are some of his recommendations:

using word2vec is more going to give better results compared to e.g. countVectorizer
use word2vec with skipgram training for the tweets themselves
there probably is no need to remove stop words or tokenize tweets (but remove punctuation)
convert emojis into e.g. happy to get better context
convert word2vec vectors into polar coordinates
train word2vec for hashtags from tweets using cbow

His opinion is that gensim is a handy tool but he also built some extra utils etc for his work that may be useful: https://github.com/pelodelfuego/word2vec-toolbox )

I have been tinkering a bit with the our data using gensim (seems fairly easy to use although I haven't actually tried seeing what falls out of it yet)

@patrick-dd
Copy link

Starting on this

@hadoopjax
Copy link
Contributor

Great to hear @patrick-dd thanks for picking this up! I invited you to the D4D organization so you can be assigned the issue (helps us track who's working on what).

@wwymak
Copy link
Contributor Author

wwymak commented Feb 5, 2017

looks like me and @patrick-dd is going to work from the two different ends of the problem and maybe with luck meet in the middle :) Just thought I'd add in that anyone else who is interested is welcome since it'll be useful to get different insights into this task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants