Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Printing Clusters (Top terms & titles) #12

Open
s2hewitt opened this issue Jan 3, 2017 · 5 comments
Open

Printing Clusters (Top terms & titles) #12

s2hewitt opened this issue Jan 3, 2017 · 5 comments

Comments

@s2hewitt
Copy link

s2hewitt commented Jan 3, 2017

I've followed all the steps down to the final one where you print the top terms per cluster, together with the film titles.
I'm using a slightly different dataset (blog titles and blog post content) but in essence my data is the same as yours, although my data is already in a dataframe, so where you call on 'synopses', I call df.Content. The one step I couldn't do was the one where you grouped the rank by clusters as obviously this doesn't apply to me. I want ten clusters from my data.

Here, you create a dictionary:

films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])

But as I already have a dataframe, I reindex-ed using clusters. The problem is, only the first ten blog post titles are being used, as this screenshot shows:

image

As this is my first attempt at kMeans (although I've been experimenting with my data for three weeks) I'm not yet clever enough to work out what's going wrong. Any ideas? Thanks in advance!

@s2hewitt
Copy link
Author

s2hewitt commented Jan 3, 2017

Actually, I HAVE solved the problem, but now I have another.
I did this:
image
...but when I run the final step, (printing the top terms and titles in each cluster) I get the following error message:
image

I'm at a loss again.... Thanks!

@brandomr
Copy link
Owner

brandomr commented Jan 3, 2017

@s2hewitt it looks like Title in your frame object is not an array but a string. In the frame I referenced Title is an array of film titles associated with the cluster. It looks like you have one row per title and an associated cluster.

If you want to get all the titles for a given cluster (assuming the above is true) you can do something like:

import pandas as pd

data = [
        {'Title': 'film 1', 'cluster': 0},
        {'Title': 'film 2', 'cluster': 0},
        {'Title': 'film 3', 'cluster': 1},
        {'Title': 'film 4', 'cluster': 1},
        {'Title': 'film 5', 'cluster': 1},
        {'Title': 'film 6', 'cluster': 2},
        {'Title': 'film 7', 'cluster': 2},
        {'Title': 'film 8', 'cluster': 2}
    ]

frame = pd.DataFrame(data)

# get unique list of clusters
clusters = list(set(frame.cluster))

# iterate over list of clusters
for clust in clusters:

    # subset frame based on cluster then grab those titles
    cluster_titles = ', '.join(frame[frame['cluster'] == clust].Title.tolist())

    print('Cluster {0} Titles: {1}\n'.format(clust, cluster_titles))

Let me know if that helps!

@s2hewitt
Copy link
Author

s2hewitt commented Jan 3, 2017

Thanks. It'll take me a good while to work this out.
My dataframe has nearly 4,000 blog post titles and associated content. I'm trying this out on this sample - my final .csv file is much, much bigger.
I think I need to go back and see if I can replicate how you created and converted the dictionary, although I'm guessing that you worked from lists which isn't practical for the file sizes I'm eventually going to be working with.

@brandomr
Copy link
Owner

brandomr commented Jan 4, 2017

@s2hewitt if you're able to post your notebook and some sample data I could take a look; if you're not getting out of memory errors then the data size isn't a problem you might just need some fancy footwork to convert the dataframe into a format that give you your desired output.

@s2hewitt
Copy link
Author

s2hewitt commented Jan 4, 2017

Thanks! Could you follow me on Twitter so I can DM you? This is for my PhD, and while I'm very willing to share solutions with anyone else who may encounter the same issues, I'm a bit protective of my data!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants