In this short article we will check the Python library: shorttext · PyPI which helps us to facilitates supervised and unsupervised learning for short text categorization in Python.

1. Main features of shorttext and installation

  • text preprocessing
  • pre-trained word-embedding support
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder
  • topic model representation supported for supervised learning using scikit-learn
  • cosine distance and maximum entropy classification
  • neural network classification (including ConvNet, and C-LSTM)
  • metrics of phrases differences including
    • soft Jaccard score (using Damerau-Levenshtein distance)
    • Word Mover's distance (WMD)
  • character-level sequence-to-sequence (seq2seq) learning
  • spell correction

Documentation

Documentation and tutorials for shorttext can be found here: http://shorttext.rtfd.io/.

See tutorial for how to use the package, and FAQ.

Installation

To install it, in a console, use pip.

>>> pip install shorttext

or, if you want the most recent development version on Github, type

>>> pip install git+https://github.com/stephenhky/PyShortTextCategorization@master

Yes — here is a clean, fully working example based on your snippets, plus a short, practical explanation of what each part does.

This is written so you can copy-paste it into a Jupyter notebook or Python script and run it.

2. Working Example

This example shows how shorttext can:

  • Load built-in NLP datasets
  • Preprocess and tokenize text
  • Build a document-term matrix
  • Analyze word usage across documents
  • Query term statistics per document

2.1 Load built-in training datasets

import shorttext

# Subject keywords dataset (small demo dataset)
trainclassdict = shorttext.data.subjectkeywords()
print("Subject keywords sample:")
for k in list(trainclassdict.keys())[:3]:
    print(k, "->", trainclassdict[k][:3])

# NIH reports dataset (larger dataset)
nih_data = shorttext.data.nihreports()
print("\nNIH reports classes:", list(nih_data.keys())[:5])

What this does:

  • Loads small, built-in labeled datasets for experimenting
  • Useful for quick testing and demos
  • Data is in the format:
    label -> list of short text samples

2.2 Load US presidential inaugural speeches

usprez = shorttext.data.inaugural()

docids = sorted(usprez.keys())

# Join token lists into full text documents
usprez_texts = [' '.join(usprez[docid]) for docid in docids]

print("First document ID:", docids[0])
print("First document text sample:\n", usprez_texts[0][:300])

result:

First document ID: 1789-Washington
First document text sample:
 Fellow - Citizens of the Senate and of the House of Representatives...

What this does:

  • Loads US presidential inaugural addresses
  • Each document is originally tokenized
  • We join tokens back into full strings

2.3 Preprocess and tokenize text

preprocess = shorttext.utils.standard_text_preprocessor_1()

corpus = [
    preprocess(text).split(' ')
    for text in usprez_texts
]

print("First processed document tokens:")
print(corpus[0][:30])

output:

First processed document tokens:
['fellow', '', 'citizen', 'senat', 'hous', 'repres', 'among', 'vicissitud', 'incid', 'life

What this does:

  • Lowercases
  • Removes punctuation
  • Applies stemming
  • Tokenizes into word lists

2.4 Build a Document-Term Matrix (DTM)

from shorttext.utils import DocumentTermMatrix

usprez_dtm = DocumentTermMatrix(corpus, docids=docids)

print("DTM built with", len(docids), "documents")

Output:

DTM built with 56 documents

What this does:

  • Builds a document-term matrix
  • Tracks word counts per document
  • Enables frequency-based queries

2.5 Query word frequencies

# Number of documents containing the term
print("Doc frequency of 'peopl':",
      usprez_dtm.get_doc_frequency('peopl'))

# Total occurrences across all documents
print("Total term freq of 'justic':",
      usprez_dtm.get_total_termfreq('justic'))

# Term frequency in a specific document
print("Term freq of 'chang' in 2009-Obama:",
      usprez_dtm.get_termfreq('2009-Obama', 'chang'))

result:

Doc frequency of 'peopl': 54
Total term freq of 'justic': 134.0
Term freq of 'chang' in 2009-Obama: 2.0

What this does:

  • get_doc_frequency()
    How many documents contain the word
  • get_total_termfreq()
    Total count across all documents
  • get_termfreq(docid, term)
    Count in a specific document

3. Comparing vocabulary between two docs

This is just small experiment with the library for comparison between different docs:

import shorttext
from shorttext.utils import DocumentTermMatrix

# Load US presidential inaugural speeches
usprez = shorttext.data.inaugural()
docids = sorted(usprez.keys())
# Join token lists into full text documents
usprez_texts = [' '.join(usprez[docid]) for docid in docids]
print("First document ID:", docids[0])
print("First document text sample:\n", usprez_texts[0][:300])

# Preprocess and tokenize text
preprocess = shorttext.utils.standard_text_preprocessor_1()
corpus = [
    preprocess(text).split(' ')
    for text in usprez_texts
]
print("First processed document tokens:")
print(corpus[0][:30])

# Build a Document-Term Matrix (DTM)
usprez_dtm = DocumentTermMatrix(corpus, docids=docids)
print("DTM built with", len(docids), "documents")

# Comparing vocabulary between two presidents
obama_change = usprez_dtm.get_termfreq('2009-Obama', 'chang')
obama_freedom = usprez_dtm.get_termfreq('2009-Obama', 'freedom')
bush_freedom = usprez_dtm.get_termfreq('2005-Bush', 'freedom')
bush_change = usprez_dtm.get_termfreq('2005-Bush', 'chang')
print("Obama - chang:", obama_change)
print("Obama - freedom:", obama_freedom)
print("Bush - chang:", bush_change)
print("Bush - freedom:", bush_freedom)

Result:

First document ID: 1789-Washington
First document text sample:
 Fellow - Citizens of the Senate and of the House of...
  usprez_dtm = DocumentTermMatrix(corpus, docids=docids)
First processed document tokens:
['fellow', '', 'citizen', 'senat', 'hous', 'repres'...]
DTM built with 56 documents
Obama - chang: 2.0
Obama - freedom: 3.0
Bush - chang: 0.0
Bush - freedom: 27.0

Demo notebooks

The author of the library created several notebooks for demonstration. It's pretty useful to check how the library can be used in practical examples. We can see data preprocessing and classification:

Resources