In this short guide, you'll see how to count n-grams and calculate their frequency in text using Python.
Here you can find the short answer:
(1) Using NLTK ngrams
from nltk import ngrams
from collections import Counter
bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(bigrams)
(2) Using scikit-learn CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(texts)
(3) Using pandas for analysis
import pandas as pd
df = pd.DataFrame(bigram_freq.most_common(), columns=['bigram', 'count'])
So let's see several useful examples on how to extract and count n-grams from text data.
Suppose you have text like:
"Apple Inc is a technology company. Apple products include iPhone and MacBook."
1: Count bigrams using NLTK
Let's start with the most common use case - counting bigrams (2-word sequences) in text:
from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
text = "Apple Inc is a technology company. Apple products include iPhone and MacBook."
tokens = word_tokenize(text.lower())
bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(bigrams)
print("Top 5 bigrams:")
for bigram, count in bigram_freq.most_common(5):
print(f"{bigram}: {count}")
result will be:
Top 5 bigrams:
('apple', 'inc'): 1
('inc', 'is'): 1
('is', 'a'): 1
('a', 'technology'): 1
('technology', 'company'): 1
This method creates consecutive word pairs from your text. N-grams are fundamental for text analysis, language modeling, and feature extraction in NLP tasks.
What if you want trigrams (3-word sequences)? Simply change the n parameter:
trigrams = list(ngrams(tokens, 3))
trigram_freq = Counter(trigrams)
print("\nTop 3 trigrams:")
for trigram, count in trigram_freq.most_common(3):
print(f"{trigram}: {count}")
result:
Top 3 trigrams:
('apple', 'inc', 'is'): 1
('inc', 'is', 'a'): 1
('is', 'a', 'technology'): 1
2: Count n-grams with frequency filtering
For real-world text analysis, you typically want to filter n-grams by minimum frequency and remove stopwords:
from nltk import ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string
text = """
Apple Inc is the world's largest technology company.
Apple products are known worldwide. The company creates innovative products.
Apple continues to lead the technology industry. The technology company is known for Apple products
"""
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())
filtered_tokens = [
word for word in tokens
if word not in stop_words and word not in string.punctuation
]
bigrams = list(ngrams(filtered_tokens, 2))
bigram_freq = Counter(bigrams)
min_frequency = 2
frequent_bigrams = {k: v for k, v in bigram_freq.items() if v >= min_frequency}
print(f"Bigrams appearing at least {min_frequency} times:")
for bigram, count in sorted(frequent_bigrams.items(), key=lambda x: x[1], reverse=True):
print(f"{' '.join(bigram)}: {count}")
result:
Bigrams appearing at least 2 times:
apple products: 2
technology company: 2
This approach removes common words and punctuation, focusing on meaningful word combinations. Essential for keyword extraction and phrase mining.
3: Create n-gram frequency DataFrame
Convert n-gram counts to a pandas DataFrame for easier analysis and visualization:
from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
import pandas as pd
reviews = [
"Great product quality amazing",
"Product quality is excellent",
"Amazing quality great product",
"Excellent product great quality"
]
all_bigrams = []
for review in reviews:
tokens = word_tokenize(review.lower())
bigrams = list(ngrams(tokens, 2))
all_bigrams.extend(bigrams)
bigram_freq = Counter(all_bigrams)
df = pd.DataFrame(bigram_freq.most_common(), columns=['bigram', 'frequency'])
df['word1'] = df['bigram'].apply(lambda x: x[0])
df['word2'] = df['bigram'].apply(lambda x: x[1])
print(df.head(10))
result:
| bigram | frequency | word1 | word2 |
|---|---|---|---|
| ('product', 'quality') | 3 | product | quality |
| ('great', 'product') | 2 | great | product |
| ('quality', 'great') | 1 | quality | great |
| ('quality', 'amazing') | 1 | quality | amazing |
| ('quality', 'is') | 1 | quality | is |
This format makes it easy to sort, filter, and visualize n-gram patterns in your data.
4: Count n-grams using scikit-learn
For machine learning applications, use CountVectorizer to create n-gram feature matrices:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
texts = [
"Apple iPhone is great",
"Samsung Galaxy is awesome",
"iPhone camera quality excellent",
"Galaxy camera is good"
]
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()
bigram_counts = X.toarray().sum(axis=0)
df = pd.DataFrame({
'bigram': feature_names,
'frequency': bigram_counts
}).sort_values('frequency', ascending=False)
print(df.head())
result:
bigram frequency
10 camera is 2
0 apple iphone 1
1 camera quality 1
2 excellent is 1
3 galaxy camera 1
This method is optimized for ML pipelines and integrates seamlessly with scikit-learn classifiers and vectorization workflows.
5: Character n-grams for text similarity
Beyond word n-grams, character n-grams are useful for spell checking, language detection, and fuzzy matching:
from nltk import ngrams
from collections import Counter
def get_char_ngrams(text, n=3):
text = text.lower().replace(' ', '')
char_ngrams = [''.join(gram) for gram in ngrams(text, n)]
return Counter(char_ngrams)
company1 = "Microsoft"
company2 = "Microsft"
ngrams1 = get_char_ngrams(company1, 3)
ngrams2 = get_char_ngrams(company2, 3)
print(f"Character trigrams for '{company1}':")
print(ngrams1.most_common(5))
print(f"\nCharacter trigrams for '{company2}':")
print(ngrams2.most_common(5))
result:
Character trigrams for 'Microsoft':
[('mic', 1), ('icr', 1), ('cro', 1), ('ros', 1), ('oso', 1)]
Character trigrams for 'Microsft':
[('mic', 1), ('icr', 1), ('cro', 1), ('ros', 1), ('osf', 1)]
Character n-grams help detect typos and similar words by comparing overlapping character sequences.