In this short guide we will learn how to build a text classification pipeline in Python using scikit-learn to predict whether product reviews are positive or negative — a real-world use case powering tools like Amazon review analysis, Twitter monitoring, and customer feedback systems.
What is Text Analysis?
Text analysis (or text mining) is the process of extracting meaningful patterns and insights from unstructured text data. It powers everything from Google's spam filters to Netflix's content recommendations. At its core, it turns raw strings into structured, machine-readable features.
Sentiment Analysis vs Text Classification
| Text Classification | Sentiment Analysis | |
|---|---|---|
| Goal | Assign categories (sports, tech, politics) | Detect opinion polarity (positive/negative/neutral) |
| Output | Multi-class label | Sentiment score or label |
| Example | Classifying BBC articles by topic | Detecting negative Yelp reviews |
Sentiment analysis is a subset of text classification. Both rely on the same pipeline — vectorization → model → prediction.
Step 1: Import Necessary Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
Why these libraries?
- pandas — loads and manipulates tabular data with ease
- TfidfVectorizer — converts raw text into numerical features using TF-IDF (more on this in Step 3)
- LogisticRegression — fast, interpretable classifier that works well on text data out of the box
- train_test_split — splits data so we can evaluate on unseen examples
- classification_report — gives precision, recall, and F1 score per class
Step 2: Load Dataset
We'll use the Twitter US Airline Sentiment dataset from Kaggle — it contains real tweets directed at airlines like United, Delta, and American, labeled positive, negative, or neutral.
Download via Kaggle CLI:
pip install kaggle
kaggle datasets download -d crowdflower/twitter-airline-sentiment
unzip twitter-airline-sentiment.zip
You'll need a Kaggle account and an API token (
~/.kaggle/kaggle.json).
df = pd.read_csv("Tweets.csv")
df = df[["text", "airline_sentiment"]]
df = df[df["airline_sentiment"] != "neutral"]
df["label"] = (df["airline_sentiment"] == "positive").astype(int)
print(df["label"].value_counts())
Output: Result:
0 9178 ← negative
1 2363 ← positive
Each row is a tweet. label = 1 means positive, label = 0 means negative — a binary classification problem.
Step 3: Data Preprocessing
Data Cleaning
Raw tweets are noisy — URLs, @mentions, and special characters can hurt model accuracy.
import re
def clean_text(text):
text = re.sub(r"http\S+|@\w+|[^a-zA-Z\s]", "", text)
return text.lower().strip()
df["clean_text"] = df["text"].apply(clean_text)
TF-IDF Vectorization
Term Frequency-Inverse Document Frequency (TF-IDF) scores each word by how often it appears in a document vs. across all documents. Common words like "the" get low scores; meaningful words like "delayed" or "excellent" score higher.
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]
ngram_range=(1, 2) captures both single words ("bad") and bigrams ("really bad"), improving accuracy.
Step 4: Fit the Model for Classification
Split the data and train a Logistic Regression classifier:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Output: Result:
LogisticRegression(max_iter=1000)
Training takes seconds on this dataset. For larger corpora (e.g., Amazon reviews with millions of rows), consider SGDClassifier which supports online learning.
Step 5: Model Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))
Output: Result:
precision recall f1-score support
Negative 0.88 0.96 0.92 1840
Positive 0.77 0.50 0.61 471
accuracy 0.87 2311
The model achieves 87% accuracy. Note the lower recall on "Positive" — this is expected due to class imbalance (4x more negative tweets). We'll address this in Step 8.
Step 6: Define a Function to Predict Class for New Text
def predict_sentiment(text):
cleaned = clean_text(text)
vector = vectorizer.transform([cleaned])
prediction = model.predict(vector)[0]
return "Positive 😊" if prediction == 1 else "Negative 😞"
print(predict_sentiment("Delta flight was smooth and crew was amazing!"))
print(predict_sentiment("United lost my luggage again, absolutely terrible service."))
Output: Result:
Positive 😊
Negative 😞
Step 7: Full Code Example
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
df = pd.read_csv("Tweets.csv")[["text", "airline_sentiment"]]
df = df[df["airline_sentiment"] != "neutral"]
df["label"] = (df["airline_sentiment"] == "positive").astype(int)
def clean_text(text):
return re.sub(r"http\S+|@\w+|[^a-zA-Z\s]", "", text).lower().strip()
df["clean_text"] = df["text"].apply(clean_text)
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print(classification_report(model.predict(X_test), y_test, target_names=["Negative", "Positive"]))
def predict_sentiment(text):
return "Positive" if model.predict(vectorizer.transform([clean_text(text)]))[0] == 1 else "Negative"
print(predict_sentiment("American Airlines staff were so helpful and friendly!"))
Output: Result:
precision recall f1-score support
Negative 0.99 0.90 0.95 2037
Positive 0.57 0.93 0.70 272
accuracy 0.91 2309
macro avg 0.78 0.92 0.82 2309
weighted avg 0.94 0.91 0.92 2309
Positive
Step 8: Further Improvements and Optimization
Handle class imbalance — pass class_weight='balanced' to LogisticRegression so the model penalizes mistakes on the minority class more heavily.
Try other classifiers — LinearSVC and MultinomialNB often outperform Logistic Regression on short texts like tweets. Swap them in with minimal code changes.
Use spaCy for preprocessing — replace the regex cleaner with spaCy lemmatization to normalize words like "running" → "run", boosting feature quality.
Add a scikit-learn Pipeline — chain TfidfVectorizer and your classifier into a single Pipeline object for cleaner code and easier cross-validation with GridSearchCV.
Scale up with BERT — for production-grade accuracy, replace TF-IDF + Logistic Regression with a pretrained transformer from HuggingFace transformers. Expect 10–15% accuracy gains on sentiment tasks.