In this short guide we will learn how to build a text classification pipeline in Python using scikit-learn to predict whether product reviews are positive or negative — a real-world use case powering tools like Amazon review analysis, Twitter monitoring, and customer feedback systems.

What is Text Analysis?

Text analysis (or text mining) is the process of extracting meaningful patterns and insights from unstructured text data. It powers everything from Google's spam filters to Netflix's content recommendations. At its core, it turns raw strings into structured, machine-readable features.

Sentiment Analysis vs Text Classification

Text Classification Sentiment Analysis
Goal Assign categories (sports, tech, politics) Detect opinion polarity (positive/negative/neutral)
Output Multi-class label Sentiment score or label
Example Classifying BBC articles by topic Detecting negative Yelp reviews

Sentiment analysis is a subset of text classification. Both rely on the same pipeline — vectorization → model → prediction.

Step 1: Import Necessary Libraries

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Why these libraries?

  • pandas — loads and manipulates tabular data with ease
  • TfidfVectorizer — converts raw text into numerical features using TF-IDF (more on this in Step 3)
  • LogisticRegression — fast, interpretable classifier that works well on text data out of the box
  • train_test_split — splits data so we can evaluate on unseen examples
  • classification_report — gives precision, recall, and F1 score per class

Step 2: Load Dataset

We'll use the Twitter US Airline Sentiment dataset from Kaggle — it contains real tweets directed at airlines like United, Delta, and American, labeled positive, negative, or neutral.

Download via Kaggle CLI:

pip install kaggle
kaggle datasets download -d crowdflower/twitter-airline-sentiment
unzip twitter-airline-sentiment.zip

You'll need a Kaggle account and an API token (~/.kaggle/kaggle.json).

df = pd.read_csv("Tweets.csv")
df = df[["text", "airline_sentiment"]]
df = df[df["airline_sentiment"] != "neutral"]
df["label"] = (df["airline_sentiment"] == "positive").astype(int)
print(df["label"].value_counts())

Output: Result:

0    9178   ← negative
1    2363   ← positive

Each row is a tweet. label = 1 means positive, label = 0 means negative — a binary classification problem.

Step 3: Data Preprocessing

Data Cleaning

Raw tweets are noisy — URLs, @mentions, and special characters can hurt model accuracy.

import re

def clean_text(text):
    text = re.sub(r"http\S+|@\w+|[^a-zA-Z\s]", "", text)
    return text.lower().strip()

df["clean_text"] = df["text"].apply(clean_text)

TF-IDF Vectorization

Term Frequency-Inverse Document Frequency (TF-IDF) scores each word by how often it appears in a document vs. across all documents. Common words like "the" get low scores; meaningful words like "delayed" or "excellent" score higher.

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]

ngram_range=(1, 2) captures both single words ("bad") and bigrams ("really bad"), improving accuracy.

Step 4: Fit the Model for Classification

Split the data and train a Logistic Regression classifier:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Output: Result:

LogisticRegression(max_iter=1000)

Training takes seconds on this dataset. For larger corpora (e.g., Amazon reviews with millions of rows), consider SGDClassifier which supports online learning.

Step 5: Model Evaluation

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

Output: Result:

              precision    recall  f1-score   support

    Negative       0.88      0.96      0.92      1840
    Positive       0.77      0.50      0.61       471

    accuracy                           0.87      2311

The model achieves 87% accuracy. Note the lower recall on "Positive" — this is expected due to class imbalance (4x more negative tweets). We'll address this in Step 8.

Step 6: Define a Function to Predict Class for New Text

def predict_sentiment(text):
    cleaned = clean_text(text)
    vector = vectorizer.transform([cleaned])
    prediction = model.predict(vector)[0]
    return "Positive 😊" if prediction == 1 else "Negative 😞"

print(predict_sentiment("Delta flight was smooth and crew was amazing!"))
print(predict_sentiment("United lost my luggage again, absolutely terrible service."))

Output: Result:

Positive 😊
Negative 😞

Step 7: Full Code Example

import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

df = pd.read_csv("Tweets.csv")[["text", "airline_sentiment"]]
df = df[df["airline_sentiment"] != "neutral"]
df["label"] = (df["airline_sentiment"] == "positive").astype(int)

def clean_text(text):
    return re.sub(r"http\S+|@\w+|[^a-zA-Z\s]", "", text).lower().strip()

df["clean_text"] = df["text"].apply(clean_text)

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print(classification_report(model.predict(X_test), y_test, target_names=["Negative", "Positive"]))

def predict_sentiment(text):
    return "Positive" if model.predict(vectorizer.transform([clean_text(text)]))[0] == 1 else "Negative"

print(predict_sentiment("American Airlines staff were so helpful and friendly!"))

Output: Result:

              precision    recall  f1-score   support

    Negative       0.99      0.90      0.95      2037
    Positive       0.57      0.93      0.70       272

    accuracy                           0.91      2309
   macro avg       0.78      0.92      0.82      2309
weighted avg       0.94      0.91      0.92      2309

Positive

Step 8: Further Improvements and Optimization

Handle class imbalance — pass class_weight='balanced' to LogisticRegression so the model penalizes mistakes on the minority class more heavily.

Try other classifiersLinearSVC and MultinomialNB often outperform Logistic Regression on short texts like tweets. Swap them in with minimal code changes.

Use spaCy for preprocessing — replace the regex cleaner with spaCy lemmatization to normalize words like "running" → "run", boosting feature quality.

Add a scikit-learn Pipeline — chain TfidfVectorizer and your classifier into a single Pipeline object for cleaner code and easier cross-validation with GridSearchCV.

Scale up with BERT — for production-grade accuracy, replace TF-IDF + Logistic Regression with a pretrained transformer from HuggingFace transformers. Expect 10–15% accuracy gains on sentiment tasks.