Clustering Textual content Paperwork utilizing Okay-Means in Scikit Be taught

Enhance Article

Save Article

Like Article

Enhance Article

Save Article

Like Article

Clustering textual content paperwork is a typical problem in pure language processing (NLP). Based mostly on their content material, associated paperwork are to be grouped. The k-means clustering approach is a popular answer to this problem. On this article, we’ll exhibit easy methods to cluster textual content paperwork utilizing k-means utilizing Scikit Be taught.

Okay-means clustering algorithm

The k-means algorithm is a popular unsupervised studying algorithm that organizes information factors into teams primarily based on similarities. The algorithm operates by iteratively assigning every information level to its nearest cluster centroid after which recalculating the centroids primarily based on the newly shaped clusters.


Preprocessing describes the procedures used to get information prepared for machine studying or evaluation. It ceaselessly entails remodeling, reformatting, and cleansing uncooked information and vectorization right into a format applicable for extra evaluation or modeling.


  1. Loading or making ready the dataset [dataset link:]
  2. Preprocessing of textual content in case the textual content is loaded as a substitute of manually including it to the code
  3. Vectorizing the textual content utilizing TfidfVectorizer
  4. Cut back the dimension utilizing PCA
  5. Clustering the paperwork
  6. Plot the cluster utilizing matplotlib


import json

import numpy as np

import pandas as pd

from sklearn.feature_extraction.textual content import TfidfVectorizer

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt




sentence = df.headline


vectorizer = TfidfVectorizer(stop_words='english')


vectorized_documents = vectorizer.fit_transform(sentence)


pca = PCA(n_components=2)

reduced_data = pca.fit_transform(vectorized_documents.toarray())



num_clusters = 2

kmeans = KMeans(n_clusters=num_clusters, n_init=5,

                max_iter=500, random_state=42)




outcomes = pd.DataFrame()

outcomes['document'] = sentence

outcomes['cluster'] = kmeans.labels_




colours = ['red', 'green']

cluster = ['Not Sarcastic','Sarcastic']

for i in vary(num_clusters):

    plt.scatter(reduced_data[kmeans.labels_ == i, 0],

                reduced_data[kmeans.labels_ == i, 1], 

                s=10, coloration=colours[i], 

                label=f' {cluster[i]}')




                                                doc  cluster
16263  examine finds majority of u.s. forex has touc...        0
5318   an open and private e-mail to hillary clinton ...        0
12994        it isn't only a muslim ban, it is a lot worse        0
5395   princeton college students confront college preside...        0
24591     why getting married might assist folks drink much less        0
Text clustering using KMeans - Geeksforgeeks

Textual content clustering utilizing KMeans

Final Up to date :
09 Jun, 2023

Like Article

Save Article

Leave a Reply

Your email address will not be published. Required fields are marked *