Skip to content

Commit

Permalink
0703 kmeans python
Browse files Browse the repository at this point in the history
  • Loading branch information
mvanrongen committed Mar 7, 2024
1 parent eb32b33 commit 3c50399
Show file tree
Hide file tree
Showing 10 changed files with 88 additions and 4 deletions.
4 changes: 2 additions & 2 deletions _freeze/materials/mva-kmeans/execute-results/html.json

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
88 changes: 86 additions & 2 deletions materials/mva-kmeans.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -157,15 +157,39 @@ The output is a list of vectors, with differing lengths. That's because they con

To do the clustering, we'll be using the `KMeans()` function.

```{python}
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
# remove missing values
penguins_scaled_py = penguins_py.dropna()
# select relevant columns
penguins_scaled_py = penguins_scaled_py[['bill_depth_mm', 'bill_length_mm']]
penguins_scaled_py = std_scaler.fit_transform(penguins_scaled_py)
kmeans = KMeans(
init = 'random',
n_clusters = 3,
n_init = 10,
max_iter = 300,
random_state = 42
)
kmeans.fit(penguins_scaled_py)
```

:::

### Visualise clusters

When we performed the clustering, the centers were calculated. These values give the (x, y) coordinates of the centroids.

::: {.panel-tabset group="language"}
## R

When we performed the clustering, the centers were calculated. These values give the (x, y) coordinates of the centroids.

```{r}
tidy_clust <- tidy(kclust) # get centroid coordinates
Expand All @@ -174,6 +198,22 @@ tidy_clust

## Python

```{python}
# calculate the cluster centers
kclusts_py = kmeans.cluster_centers_
kclusts_py = pd.DataFrame(kclusts_py, columns = ['0', '1'])
# convert to Pandas DataFrame and rename columns
kclusts_py = pd.DataFrame(kclusts_py)
kclusts_py = (kclusts_py
.rename(columns = {"0": "bdepth_scaled",
"1": "blength_scaled"}))
# and show the coordinates
kclusts_py
```

:::

:::{.callout-note}
Expand All @@ -199,6 +239,50 @@ kclust %>% # take clustering data

## Python

We reformat and rename the scaled data:

```{python}
# convert NumPy arrays to Pandas DataFrame
penguins_scaled_py = pd.DataFrame(penguins_scaled_py)
penguins_scaled_py = penguins_scaled_py.rename(columns={0: 'bdepth_scaled', 1: 'blength_scaled'})
```


and merge this with the original data:

```{python}
# remove missing values
penguins_py = penguins_py.dropna()
# add an ID column
penguins_py['id'] = range(1, len(penguins_py) + 1)
# add an ID column to the scaled data
# so we can match the observations
penguins_scaled_py['id'] = range(1, len(penguins_scaled_py) + 1)
# merge the data by ID
penguins_augment_py = pd.merge(penguins_py.dropna(), penguins_scaled_py, on = 'id')
# add the cluster designation
penguins_augment_py['cluster'] = kmeans.fit_predict(penguins_scaled_py)
# and convert it into a factor
penguins_augment_py['cluster'] = penguins_augment_py['cluster'].astype('category')
```

We can then (finally!) plot this:

```{python}
#| results: hide
(ggplot(penguins_augment_py,
aes(x = 'bill_depth_mm',
y = 'bill_length_mm',
colour = 'cluster')) +
geom_point())
```

:::

## Optimising cluster number
Expand Down

0 comments on commit 3c50399

Please sign in to comment.