Python 3 language detection

In python there many modules available for language detection based on a given text. They offer different level of accuracy, features and language packs. In my personal projects I like to use several like:

  • langdetect - Port of Google's language-detection library to Python.
  • CLD2 - CLD (Compact Language Detection)

Langdetect

This one is a port of the Google's language-detection library. It supports more than 50 different languages. It's easy to install and use and you can add new languages. It is licensed under the Apache License.

Langdetect Installation

Installation is very easy with pip install and there aren't any additional requirements in order to use it. Supported Python versions 2.7, 3.4+:

pip install langdetect

Langdetect Example

In the next example are given two cases:

  • detect single language
  • probabilities for the best matching languages
from langdetect import detect
from langdetect import detect_langs

# Single language detection
print(detect("War doesn't show who's right, just who's left."))
print(detect("Ein, zwei, drei, vier"))
print(detect("李红:不,那不是杂志。那是字典"))
print(detect("Доброе утро"))
print(detect("voulez vous manger avec moi"))


# language probabilities best match
print(detect_langs("Otec matka syn."))

result:

en
de
zh-cn
ru
fr
[fi:0.5714263104341767, pl:0.42857086073523615]

As you can see even difficult language like Chinese can be detected with good level of accuracy.

CLD2 (Compact Language Detection)

The other choice is licensed under Chromium’s LICENSE. There are about 80 supported languages. CLD2 is a Naïve Bayesian classifier, using one of three different token algorithms.

CLD2 Installation

Again simple installation without requirements

pip install cld2-cffi

development versions can be installed by:

pip install --upgrade 'git+https://github.com/GregBowyer/cld2-cffi.git'

More information can be found here: cld2

CLD2 Example

Testing with Chinese and French in the example below. If the text is not recognized then Unknown is returned.

import cld2

isReliable, textBytesFound, details = cld2.detect("王明:那是杂志吗")
print('  reliable: %s' % (isReliable != 0))
print('  textBytes: %s' % textBytesFound)
print('  details: %s' % str(details))


isReliable, textBytesFound, details = cld2.detect("voulez vous manger avec moi")
print('  reliable: %s' % (isReliable != 0))
print('  textBytes: %s' % textBytesFound)
print('  details: %s' % str(details))

isReliable, textBytesFound, details = cld2.detect("李红:不,那不是杂志。那是字典")
print('  reliable: %s' % (isReliable != 0))
print('  textBytes: %s' % textBytesFound)
print('  details: %s' % str(details))

result:

  reliable: True
  textBytes: 24
  details: (Detection(language_name='Chinese', language_code='zh', percent=95, score=1691.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))
  reliable: True
  textBytes: 29
  details: (Detection(language_name='FRENCH', language_code='fr', percent=96, score=1426.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))
  reliable: False
  textBytes: 41
  details: (Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))