Count N-grams and Calculate Frequency in Python

In this short guide, you'll see how to count n-grams and calculate their frequency in text using Python.

Here you can find the short answer:

(1) Using NLTK ngrams

from nltk import ngrams
from collections import Counter

bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(bigrams)

(2) Using scikit-learn CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(texts)

(3) Using pandas for analysis

import pandas as pd

df = pd.DataFrame(bigram_freq.most_common(), columns=['bigram', 'count'])

So let's see several useful examples on how to extract and count n-grams from text data.

Suppose you have text like:

"Apple Inc is a technology company. Apple products include iPhone and MacBook."

1: Count bigrams using NLTK

Let's start with the most common use case - counting bigrams (2-word sequences) in text:

from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter

text = "Apple Inc is a technology company. Apple products include iPhone and MacBook."
tokens = word_tokenize(text.lower())

bigrams = list(ngrams(tokens, 2))
bigram_freq = Counter(bigrams)

print("Top 5 bigrams:")
for bigram, count in bigram_freq.most_common(5):
    print(f"{bigram}: {count}")

result will be:

Top 5 bigrams:
('apple', 'inc'): 1
('inc', 'is'): 1
('is', 'a'): 1
('a', 'technology'): 1
('technology', 'company'): 1

This method creates consecutive word pairs from your text. N-grams are fundamental for text analysis, language modeling, and feature extraction in NLP tasks.

What if you want trigrams (3-word sequences)? Simply change the n parameter:

trigrams = list(ngrams(tokens, 3))
trigram_freq = Counter(trigrams)

print("\nTop 3 trigrams:")
for trigram, count in trigram_freq.most_common(3):
    print(f"{trigram}: {count}")

result:

Top 3 trigrams:
('apple', 'inc', 'is'): 1
('inc', 'is', 'a'): 1
('is', 'a', 'technology'): 1

2: Count n-grams with frequency filtering

For real-world text analysis, you typically want to filter n-grams by minimum frequency and remove stopwords:

from nltk import ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string

text = """
Apple Inc is the world's largest technology company. 
Apple products are known worldwide. The company creates innovative products.
Apple continues to lead the technology industry. The technology company is known for Apple products 
"""

stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())

filtered_tokens = [
    word for word in tokens
    if word not in stop_words and word not in string.punctuation
]

bigrams = list(ngrams(filtered_tokens, 2))
bigram_freq = Counter(bigrams)

min_frequency = 2
frequent_bigrams = {k: v for k, v in bigram_freq.items() if v >= min_frequency}

print(f"Bigrams appearing at least {min_frequency} times:")
for bigram, count in sorted(frequent_bigrams.items(), key=lambda x: x[1], reverse=True):
    print(f"{' '.join(bigram)}: {count}")

result:

Bigrams appearing at least 2 times:
apple products: 2
technology company: 2

This approach removes common words and punctuation, focusing on meaningful word combinations. Essential for keyword extraction and phrase mining.

3: Create n-gram frequency DataFrame

Convert n-gram counts to a pandas DataFrame for easier analysis and visualization:

from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
import pandas as pd

reviews = [
    "Great product quality amazing",
    "Product quality is excellent",
    "Amazing quality great product",
    "Excellent product great quality"
]

all_bigrams = []
for review in reviews:
    tokens = word_tokenize(review.lower())
    bigrams = list(ngrams(tokens, 2))
    all_bigrams.extend(bigrams)

bigram_freq = Counter(all_bigrams)

df = pd.DataFrame(bigram_freq.most_common(), columns=['bigram', 'frequency'])
df['word1'] = df['bigram'].apply(lambda x: x[0])
df['word2'] = df['bigram'].apply(lambda x: x[1])

print(df.head(10))

result:

bigram	frequency	word1	word2
('product', 'quality')	3	product	quality
('great', 'product')	2	great	product
('quality', 'great')	1	quality	great
('quality', 'amazing')	1	quality	amazing
('quality', 'is')	1	quality	is

This format makes it easy to sort, filter, and visualize n-gram patterns in your data.

4: Count n-grams using scikit-learn

For machine learning applications, use CountVectorizer to create n-gram feature matrices:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

texts = [
    "Apple iPhone is great",
    "Samsung Galaxy is awesome",
    "iPhone camera quality excellent",
    "Galaxy camera is good"
]

vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(texts)

feature_names = vectorizer.get_feature_names_out()
bigram_counts = X.toarray().sum(axis=0)

df = pd.DataFrame({
    'bigram': feature_names,
    'frequency': bigram_counts
}).sort_values('frequency', ascending=False)

print(df.head())

result:

           bigram  frequency
10       camera is          2
0   apple iphone           1
1   camera quality         1
2   excellent is           1
3   galaxy camera          1

This method is optimized for ML pipelines and integrates seamlessly with scikit-learn classifiers and vectorization workflows.

5: Character n-grams for text similarity

Beyond word n-grams, character n-grams are useful for spell checking, language detection, and fuzzy matching:

from nltk import ngrams
from collections import Counter

def get_char_ngrams(text, n=3):
    text = text.lower().replace(' ', '')
    char_ngrams = [''.join(gram) for gram in ngrams(text, n)]
    return Counter(char_ngrams)

company1 = "Microsoft"
company2 = "Microsft"

ngrams1 = get_char_ngrams(company1, 3)
ngrams2 = get_char_ngrams(company2, 3)

print(f"Character trigrams for '{company1}':")
print(ngrams1.most_common(5))

print(f"\nCharacter trigrams for '{company2}':")
print(ngrams2.most_common(5))

result:

Character trigrams for 'Microsoft':
[('mic', 1), ('icr', 1), ('cro', 1), ('ros', 1), ('oso', 1)]

Character trigrams for 'Microsft':
[('mic', 1), ('icr', 1), ('cro', 1), ('ros', 1), ('osf', 1)]

Character n-grams help detect typos and similar words by comparing overlapping character sequences.

> Python Basics

> Advanced Python Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

1: Count bigrams using NLTK

2: Count n-grams with frequency filtering

3: Create n-gram frequency DataFrame

4: Count n-grams using scikit-learn

5: Character n-grams for text similarity

Resources