🧠Amazon product Review using Clustering Technique(unsupervised learning) and Preprocessed by NLTK
Objective
This project aimed to uncover hidden patterns in Amazon product reviews using unsupervised machine learning. By converting review texts into numerical vectors using TF-IDF, and applying KMeans clustering, the project grouped similar reviews and extracted top keywords per cluster. This enables understanding of major themes or topics customers frequently talk about.
Tasks
- Preprocessed text using NLTK: tokenization, stopword removal, etc.
- Converted review text into TF-IDF vectors.
- Applied KMeans Clustering to group similar reviews.
- Extracted and displayed top keywords for each cluster.
- Visualized cluster insights to interpret major review themes.
Skills Learned
- Text pre-processing (tokenization, cleaning)
- Vectorizing text with TF-IDF
- Applying KMeans for unsupervised learning
- Evaluating clusters using inertia and the Elbow method
- Extracting meaningful keywords from clusters
- Interpreting clustering output for real-world product insights
- Python (Google Colab / Jupyter Notebook)
- Libraries:
NLTK
, Scikit-learn
, Pandas
, Matplotlib
, Scipy
,Seaborn
,Numpy
- NLP Methods:
TfidfVectorizer
, KMeans
- Evaluation: Elbow Method (Inertia), Keyword Extraction
- Visualization: Bar charts of top terms per cluster
-
AI Rapid Studio: for evaluation of the process and result
📈 Output
- Preprocessed Seven hundred one thousand five hundred fifty-nine (7015559) of Amazon reviews for meaningful clustering.
- Chose optimal number of clusters (e.g.,
k=5
) using the Elbow method.
- Grouped reviews into 5 thematic clusters using KMeans.


- Top Keywords per Cluster extracted for interpretation:
Clusters 0 These had a review on hair growth and grooming. These were neutral but slightly positive
Clusters 1 These were involved with Oily Hair and Greasiness. This represent negative sentiments
Cluster 2: It presents Product effectiveness and Use. Reviewers’ sentiment were more neutral
Cluster 3. User sentiment was neutral. It’s involved in Usability and handling
Cluster 4: It represents strong Emotional Reactions and brand Affinity. User sentiment was highly positive
