abbreviation detection: abbreviations to full form in python

In this short tutorial, you'll see how to do abbreviation/acronym detection and matching in Python. We will try to map abbreviations or acronyms to the full form of the names.

We will use 3 different ways for mapping and detection:

  • regex builder
  • using difflib library
  • nlp abbreviation detection

Regex to detect abbreviations/acronym

To detect abbreviations using regex in Python we can try regex like:

  • r"\b[A-Z]{2,}\b" - capital letters only
  • r"\b[A-Z\.]{2,}\b" - abbreviations plus dots
import re
text = """The Eiffel Tower (E.T.) is a popular tourist
    	attraction in Paris (PR, FR). It was built in 1889."""
re.findall(r"\b[A-Z\.]{2,}\b", text)

the result of this code is:

['E.T', 'PR', 'FR']

Regex to map abbreviation

We can build simple regex to map abbreviation to full names in Python: "(|.*\s)".join(abbrev.lower()). The code convert abbreviations to regex like:

  • 'GET' - g(|.*\s)e(|.*\s)t
  • 'ELC' - e(|.*\s)l(|.*\s)c

Full code:

import re    

def is_abbrev(abbrev, text):
	pattern = "(|.*\s)".join(abbrev.lower())
	return re.match("^" + pattern, text.lower()) is not None

teams = ['Elche', 'Girona', 'Getafe']
abbreviations = ['GET','ELC','GIR']

for team in teams:
	for abbr in abbreviations:
    	match = is_abbrev(abbr, team)
    	if match:
        	print(abbr, team)

For a real world example for mapping football acronym to teams you can check: Football Prediction in Python: Barcelona vs Real Madrid

nlp abbreviation - fuzzy matching

We can also use fuzzy matching in order to map abbreviations to full form names in Python. Below you can find simple example of the matching:

from fuzzywuzzy import fuzz, process

teams = ['Elche', 'Girona', 'Getafe']
abbreviations = ['GET','ELC','GIR']

queries = [''.join([i[0] for i in j.split()]) for j in teams]

for query, company in zip(queries, teams):
	print(company, '-', process.extractOne(query, abbreviations, scorer=fuzz.partial_token_sort_ratio))
    
for query, company in zip(queries, teams):
	print(company, '-', process.extractOne(query, abbreviations, scorer=fuzz.partial_token_sort_ratio))    

Result:

RM [('RMA', 100), ('BAR', 50), ('RSO', 50)]
RS [('RSO', 100), ('BAR', 50), ('RMA', 50)]
B [('BAR', 100), ('RMA', 0), ('RSO', 0)]

The second loop produce good results:

Real Madrid - ('RMA', 100)
Real Sociedad - ('RSO', 100)
Barcelona - ('BAR', 100)

Note: In some cases the code will produce bad results. For example for input:

teams = ['Elche', 'Girona', 'Getafe']
abbreviations = ['GET','ELC','GIR']

we will get:

E [('GET', 100), ('ELC', 100), ('GIR', 0)]
G [('GET', 100), ('GIR', 100), ('ELC', 0)]
G [('GET', 100), ('GIR', 100), ('ELC', 0)]

scispacy - AbbreviationDetector example

In this example scispacy detects abbreviations and acronyms and replace them in the text with the full form of the entities:

import spacy
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("abbreviation_detector")

text = """The Eiffel Tower (E.T.) is a popular tourist
    	attraction in Paris (PR), France (FR). It was built in 1889.
    	E.T. is one of the famous monuments in FR."""

def replace_acronyms(text):
	doc = nlp(text)
	altered_tok = [tok.text for tok in doc]
	for abrv in doc._.abbreviations:
    	altered_tok[abrv.start] = str(abrv._.long_form)

	return(" ".join(altered_tok))

replace_acronyms(text)

As a result we get replacement for all abbreviations/acronyms with the full forms:

  • E.T. -> Eiffel Tower
  • FR -> France

notice that in the result acronyms are replaced:

'The Eiffel Tower ( Eiffel Tower ) is a popular tourist
attraction in Paris ( Paris ) , France ( France ) . It was built in 1889 .
Eiffel Tower is one of the famous monuments in France .'

Difflib

Python offers one more way to match acronyms/abbreviations to names by similarity matching. We will use method get_close_matches('RSO', teams, n=3, cutoff=0.2) to find the closest match between two strings:

import difflib
teams = ['Real Madrid', 'Real Sociedad', 'Rayo Vallecano']
difflib.get_close_matches('RSO', teams, n=3, cutoff=0.2)

result:

['Real Sociedad']

Summary

We've seen three different ways of detecting and mapping abbreviations in Python. Let's have a quick overview of each of them, pointing out the advantages and disadvantages.

Regular expressions offer simplicity and freedom of customization. More general solutions like spacy, difflib and scispacy offer pre built models which can save precious time. They came at the cost of performance and efficiency.

For small to medium datasets we can use scispacy and the rest. For big data we need to implement custom solutions.

Resources