Text Classification & Sentiment Analysis with Scikit-Learn: A Step-by-Step Tutorial

In this short guide we will learn how to build a text classification pipeline in Python using scikit-learn to predict whether product reviews are positive or negative — a real-world use case powering tools like Amazon review analysis, Twitter monitoring, and customer feedback systems.

What is Text Analysis?

Text analysis (or text mining) is the process of extracting meaningful patterns and insights from unstructured text data. It powers everything from Google's spam filters to Netflix's content recommendations. At its core, it turns raw strings into structured, machine-readable features.

Sentiment Analysis vs Text Classification

	Text Classification	Sentiment Analysis
Goal	Assign categories (sports, tech, politics)	Detect opinion polarity (positive/negative/neutral)
Output	Multi-class label	Sentiment score or label
Example	Classifying BBC articles by topic	Detecting negative Yelp reviews

Sentiment analysis is a subset of text classification. Both rely on the same pipeline — vectorization → model → prediction.

Step 1: Import Necessary Libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Why these libraries?

pandas — loads and manipulates tabular data with ease
TfidfVectorizer — converts raw text into numerical features using TF-IDF (more on this in Step 3)
LogisticRegression — fast, interpretable classifier that works well on text data out of the box
train_test_split — splits data so we can evaluate on unseen examples
classification_report — gives precision, recall, and F1 score per class

Step 2: Load Dataset

We'll use the Twitter US Airline Sentiment dataset from Kaggle — it contains real tweets directed at airlines like United, Delta, and American, labeled positive, negative, or neutral.

Download via Kaggle CLI:

pip install kaggle
kaggle datasets download -d crowdflower/twitter-airline-sentiment
unzip twitter-airline-sentiment.zip

You'll need a Kaggle account and an API token (~/.kaggle/kaggle.json).

df = pd.read_csv("Tweets.csv")
df = df[["text", "airline_sentiment"]]
df = df[df["airline_sentiment"] != "neutral"]
df["label"] = (df["airline_sentiment"] == "positive").astype(int)
print(df["label"].value_counts())

Output: Result:

0    9178   ← negative
1    2363   ← positive

Each row is a tweet. label = 1 means positive, label = 0 means negative — a binary classification problem.

Step 3: Data Preprocessing

Data Cleaning

Raw tweets are noisy — URLs, @mentions, and special characters can hurt model accuracy.

import re

def clean_text(text):
    text = re.sub(r"http\S+|@\w+|[^a-zA-Z\s]", "", text)
    return text.lower().strip()

df["clean_text"] = df["text"].apply(clean_text)

TF-IDF Vectorization

Term Frequency-Inverse Document Frequency (TF-IDF) scores each word by how often it appears in a document vs. across all documents. Common words like "the" get low scores; meaningful words like "delayed" or "excellent" score higher.

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]

ngram_range=(1, 2) captures both single words ("bad") and bigrams ("really bad"), improving accuracy.

Step 4: Fit the Model for Classification

Split the data and train a Logistic Regression classifier:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Output: Result:

LogisticRegression(max_iter=1000)

Training takes seconds on this dataset. For larger corpora (e.g., Amazon reviews with millions of rows), consider SGDClassifier which supports online learning.

Step 5: Model Evaluation

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

Output: Result:

              precision    recall  f1-score   support

    Negative       0.88      0.96      0.92      1840
    Positive       0.77      0.50      0.61       471

    accuracy                           0.87      2311

The model achieves 87% accuracy. Note the lower recall on "Positive" — this is expected due to class imbalance (4x more negative tweets). We'll address this in Step 8.

Step 6: Define a Function to Predict Class for New Text

def predict_sentiment(text):
    cleaned = clean_text(text)
    vector = vectorizer.transform([cleaned])
    prediction = model.predict(vector)[0]
    return "Positive 😊" if prediction == 1 else "Negative 😞"

print(predict_sentiment("Delta flight was smooth and crew was amazing!"))
print(predict_sentiment("United lost my luggage again, absolutely terrible service."))

Output: Result:

Positive 😊
Negative 😞

Step 7: Full Code Example

import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

df = pd.read_csv("Tweets.csv")[["text", "airline_sentiment"]]
df = df[df["airline_sentiment"] != "neutral"]
df["label"] = (df["airline_sentiment"] == "positive").astype(int)

def clean_text(text):
    return re.sub(r"http\S+|@\w+|[^a-zA-Z\s]", "", text).lower().strip()

df["clean_text"] = df["text"].apply(clean_text)

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print(classification_report(model.predict(X_test), y_test, target_names=["Negative", "Positive"]))

def predict_sentiment(text):
    return "Positive" if model.predict(vectorizer.transform([clean_text(text)]))[0] == 1 else "Negative"

print(predict_sentiment("American Airlines staff were so helpful and friendly!"))

Output: Result:

              precision    recall  f1-score   support

    Negative       0.99      0.90      0.95      2037
    Positive       0.57      0.93      0.70       272

    accuracy                           0.91      2309
   macro avg       0.78      0.92      0.82      2309
weighted avg       0.94      0.91      0.92      2309

Positive

Step 8: Further Improvements and Optimization

Handle class imbalance — pass class_weight='balanced' to LogisticRegression so the model penalizes mistakes on the minority class more heavily.

Try other classifiers — LinearSVC and MultinomialNB often outperform Logistic Regression on short texts like tweets. Swap them in with minimal code changes.

Use spaCy for preprocessing — replace the regex cleaner with spaCy lemmatization to normalize words like "running" → "run", boosting feature quality.

Add a scikit-learn Pipeline — chain TfidfVectorizer and your classifier into a single Pipeline object for cleaner code and easier cross-validation with GridSearchCV.

Scale up with BERT — for production-grade accuracy, replace TF-IDF + Logistic Regression with a pretrained transformer from HuggingFace transformers. Expect 10–15% accuracy gains on sentiment tasks.

> Python Basics

> Advanced Python Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

What is Text Analysis?

Sentiment Analysis vs Text Classification

Step 1: Import Necessary Libraries

Step 2: Load Dataset

Step 3: Data Preprocessing

Data Cleaning

TF-IDF Vectorization

Step 4: Fit the Model for Classification

Step 5: Model Evaluation

Step 6: Define a Function to Predict Class for New Text

Step 7: Full Code Example

Step 8: Further Improvements and Optimization

Bonus: Recommended Reads