Data Analysis With Python & K-Means
A concise tutorial on how to get started with K-Means in Python.
What’s K-Means?
K-means is a widely-used clustering algorithm in data analysis, which groups similar data points into clusters. The algorithm works by minimizing the sum of the squared distances between each data point and the center of its assigned cluster. In this post, we’ll explore a basic example of how to use k-means in data analysis with Python.
Step 1: Importing Libraries
To start with, we need to import the necessary libraries to use k-means in Python. We will use numpy, pandas, matplotlib, and sklearn libraries.
# Import the necessary libraries for using k-means in Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Step 2: Loading the Data
Next, we need to load the data that we want to analyze. For this example, we will use the iris dataset, which contains measurements of the sepal length, sepal width, petal length, and petal width of three different species of iris flowers.
# Load the iris dataset for analysis
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
Step 3: Choosing the Number of Clusters
Before we can use k-means to cluster the data, we need to choose the number of clusters that we want to group the data into. One way to do this is by using the elbow method, which plots the sum of squared distances for different values of k. Next, we can select the value of k where the change in the sum of squared distances starts to level off.
# Use the elbow method to choose the number of clusters for k-means
sse = {}
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
sse[k] = kmeans.inertia_
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of clusters")
plt.ylabel("SSE")plt.show()
In this code, we are looping through different values of k from 1 to 10, fitting a k-means model to the data for each value of k, and calculating the sum of squared distances (SSE) for each model. We then plot the SSE for each value of k.
Step 4: Running K-Means
Once we have chosen the number of clusters, we can run the k-means algorithm on the data.
# Run k-means with the chosen number of clusters on the iris dataset
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(data)
In this code, we are fitting a k-means model to the data with k=3 clusters and a maximum of 1000 iterations.
Step 5: Visualizing the Clusters
Finally, we can visualize the clusters that the k-means algorithm has generated. One way to do this is by plotting the data points and coloring them based on their assigned cluster.
# Visualize the clusters generated by k-means
plt.scatter(data[:, 0], data[:, 1], c=kmeans.labels_)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.show()
In this code, we are plotting the sepal length on the x-axis and the sepal width on the y-axis, and coloring each point based on its assigned cluster.
Conclusion
In conclusion, k-means is a powerful clustering algorithm. Using k-means is a highly useful skill for data analysis in variety of ways. This was just one example to lay out the basics in a concise format for those looking to jump in quickly and get exploring.