Data Analysis With Python & K-Means

A concise tutorial on how to get started with K-Means in Python.

Mar 05, 2023

What’s K-Means?

K-means is a widely-used clustering algorithm in data analysis, which groups similar data points into clusters. The algorithm works by minimizing the sum of the squared distances between each data point and the center of its assigned cluster. In this post, we’ll explore a basic example of how to use k-means in data analysis with Python.

Step 1: Importing Libraries

To start with, we need to import the necessary libraries to use k-means in Python. We will use numpy, pandas, matplotlib, and sklearn libraries.

# Import the necessary libraries for using k-means in Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Step 2: Loading the Data

Next, we need to load the data that we want to analyze. For this example, we will use the iris dataset, which contains measurements of the sepal length, sepal width, petal length, and petal width of three different species of iris flowers.

# Load the iris dataset for analysis
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data

Step 3: Choosing the Number of Clusters

Before we can use k-means to cluster the data, we need to choose the number of clusters that we want to group the data into. One way to do this is by using the elbow method, which plots the sum of squared distances for different values of k. Next, we can select the value of k where the change in the sum of squared distances starts to level off.

# Use the elbow method to choose the number of clusters for k-means
sse = {}

for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
    sse[k] = kmeans.inertia_
    plt.figure()

plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of clusters")
plt.ylabel("SSE")plt.show()

In this code, we are looping through different values of k from 1 to 10, fitting a k-means model to the data for each value of k, and calculating the sum of squared distances (SSE) for each model. We then plot the SSE for each value of k.

Step 4: Running K-Means

Once we have chosen the number of clusters, we can run the k-means algorithm on the data.

# Run k-means with the chosen number of clusters on the iris dataset
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(data)

In this code, we are fitting a k-means model to the data with k=3 clusters and a maximum of 1000 iterations.

Step 5: Visualizing the Clusters

Finally, we can visualize the clusters that the k-means algorithm has generated. One way to do this is by plotting the data points and coloring them based on their assigned cluster.

# Visualize the clusters generated by k-means
plt.scatter(data[:, 0], data[:, 1], c=kmeans.labels_)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()

In this code, we are plotting the sepal length on the x-axis and the sepal width on the y-axis, and coloring each point based on its assigned cluster.

Conclusion

In conclusion, k-means is a powerful clustering algorithm. Using k-means is a highly useful skill for data analysis in variety of ways. This was just one example to lay out the basics in a concise format for those looking to jump in quickly and get exploring.

TwoCents