-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Printing Clusters (Top terms & titles) #12
Comments
@s2hewitt it looks like If you want to get all the titles for a given cluster (assuming the above is true) you can do something like:
Let me know if that helps! |
Thanks. It'll take me a good while to work this out. |
@s2hewitt if you're able to post your notebook and some sample data I could take a look; if you're not getting out of memory errors then the data size isn't a problem you might just need some fancy footwork to convert the dataframe into a format that give you your desired output. |
Thanks! Could you follow me on Twitter so I can DM you? This is for my PhD, and while I'm very willing to share solutions with anyone else who may encounter the same issues, I'm a bit protective of my data! |
I've followed all the steps down to the final one where you print the top terms per cluster, together with the film titles.
I'm using a slightly different dataset (blog titles and blog post content) but in essence my data is the same as yours, although my data is already in a dataframe, so where you call on 'synopses', I call df.Content. The one step I couldn't do was the one where you grouped the rank by clusters as obviously this doesn't apply to me. I want ten clusters from my data.
Here, you create a dictionary:
films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])
But as I already have a dataframe, I reindex-ed using clusters. The problem is, only the first ten blog post titles are being used, as this screenshot shows:
As this is my first attempt at kMeans (although I've been experimenting with my data for three weeks) I'm not yet clever enough to work out what's going wrong. Any ideas? Thanks in advance!
The text was updated successfully, but these errors were encountered: