Supervised and unsupervised learning for short text categorization

In this short article we will check the Python library: shorttext · PyPI which helps us to facilitates supervised and unsupervised learning for short text categorization in Python.

1. Main features of shorttext and installation

text preprocessing
pre-trained word-embedding support
gensim topic models (LDA, LSI, Random Projections) and autoencoder
topic model representation supported for supervised learning using scikit-learn
cosine distance and maximum entropy classification
neural network classification (including ConvNet, and C-LSTM)
metrics of phrases differences including
- soft Jaccard score (using Damerau-Levenshtein distance)
- Word Mover's distance (WMD)
character-level sequence-to-sequence (seq2seq) learning
spell correction

Documentation

Documentation and tutorials for shorttext can be found here: http://shorttext.rtfd.io/.

See tutorial for how to use the package, and FAQ.

Installation

To install it, in a console, use pip.

>>> pip install shorttext

or, if you want the most recent development version on Github, type

>>> pip install git+https://github.com/stephenhky/PyShortTextCategorization@master

Yes — here is a clean, fully working example based on your snippets, plus a short, practical explanation of what each part does.

This is written so you can copy-paste it into a Jupyter notebook or Python script and run it.

2. Working Example

This example shows how shorttext can:

Load built-in NLP datasets
Preprocess and tokenize text
Build a document-term matrix
Analyze word usage across documents
Query term statistics per document

2.1 Load built-in training datasets

import shorttext

# Subject keywords dataset (small demo dataset)
trainclassdict = shorttext.data.subjectkeywords()
print("Subject keywords sample:")
for k in list(trainclassdict.keys())[:3]:
    print(k, "->", trainclassdict[k][:3])

# NIH reports dataset (larger dataset)
nih_data = shorttext.data.nihreports()
print("\nNIH reports classes:", list(nih_data.keys())[:5])

What this does:

Loads small, built-in labeled datasets for experimenting
Useful for quick testing and demos
Data is in the format:
label -> list of short text samples

2.2 Load US presidential inaugural speeches

usprez = shorttext.data.inaugural()

docids = sorted(usprez.keys())

# Join token lists into full text documents
usprez_texts = [' '.join(usprez[docid]) for docid in docids]

print("First document ID:", docids[0])
print("First document text sample:\n", usprez_texts[0][:300])

result:

First document ID: 1789-Washington
First document text sample:
 Fellow - Citizens of the Senate and of the House of Representatives...

What this does:

Loads US presidential inaugural addresses
Each document is originally tokenized
We join tokens back into full strings

2.3 Preprocess and tokenize text

preprocess = shorttext.utils.standard_text_preprocessor_1()

corpus = [
    preprocess(text).split(' ')
    for text in usprez_texts
]

print("First processed document tokens:")
print(corpus[0][:30])

output:

First processed document tokens:
['fellow', '', 'citizen', 'senat', 'hous', 'repres', 'among', 'vicissitud', 'incid', 'life

What this does:

Lowercases
Removes punctuation
Applies stemming
Tokenizes into word lists

2.4 Build a Document-Term Matrix (DTM)

from shorttext.utils import DocumentTermMatrix

usprez_dtm = DocumentTermMatrix(corpus, docids=docids)

print("DTM built with", len(docids), "documents")

Output:

DTM built with 56 documents

What this does:

Builds a document-term matrix
Tracks word counts per document
Enables frequency-based queries

2.5 Query word frequencies

# Number of documents containing the term
print("Doc frequency of 'peopl':",
      usprez_dtm.get_doc_frequency('peopl'))

# Total occurrences across all documents
print("Total term freq of 'justic':",
      usprez_dtm.get_total_termfreq('justic'))

# Term frequency in a specific document
print("Term freq of 'chang' in 2009-Obama:",
      usprez_dtm.get_termfreq('2009-Obama', 'chang'))

result:

Doc frequency of 'peopl': 54
Total term freq of 'justic': 134.0
Term freq of 'chang' in 2009-Obama: 2.0

What this does:

get_doc_frequency()
How many documents contain the word
get_total_termfreq()
Total count across all documents
get_termfreq(docid, term)
Count in a specific document

3. Comparing vocabulary between two docs

This is just small experiment with the library for comparison between different docs:

import shorttext
from shorttext.utils import DocumentTermMatrix

# Load US presidential inaugural speeches
usprez = shorttext.data.inaugural()
docids = sorted(usprez.keys())
# Join token lists into full text documents
usprez_texts = [' '.join(usprez[docid]) for docid in docids]
print("First document ID:", docids[0])
print("First document text sample:\n", usprez_texts[0][:300])

# Preprocess and tokenize text
preprocess = shorttext.utils.standard_text_preprocessor_1()
corpus = [
    preprocess(text).split(' ')
    for text in usprez_texts
]
print("First processed document tokens:")
print(corpus[0][:30])

# Build a Document-Term Matrix (DTM)
usprez_dtm = DocumentTermMatrix(corpus, docids=docids)
print("DTM built with", len(docids), "documents")

# Comparing vocabulary between two presidents
obama_change = usprez_dtm.get_termfreq('2009-Obama', 'chang')
obama_freedom = usprez_dtm.get_termfreq('2009-Obama', 'freedom')
bush_freedom = usprez_dtm.get_termfreq('2005-Bush', 'freedom')
bush_change = usprez_dtm.get_termfreq('2005-Bush', 'chang')
print("Obama - chang:", obama_change)
print("Obama - freedom:", obama_freedom)
print("Bush - chang:", bush_change)
print("Bush - freedom:", bush_freedom)

Result:

First document ID: 1789-Washington
First document text sample:
 Fellow - Citizens of the Senate and of the House of...
  usprez_dtm = DocumentTermMatrix(corpus, docids=docids)
First processed document tokens:
['fellow', '', 'citizen', 'senat', 'hous', 'repres'...]
DTM built with 56 documents
Obama - chang: 2.0
Obama - freedom: 3.0
Bush - chang: 0.0
Bush - freedom: 27.0

Demo notebooks

The author of the library created several notebooks for demonstration. It's pretty useful to check how the library can be used in practical examples. We can see data preprocessing and classification:

Resources

Documentation: http://shorttext.readthedocs.io
Github: https://github.com/stephenhky/PyShortTextCategorization
PyPI: https://pypi.org/project/shorttext/
"Package shorttext 1.0.0 released," Medium
"Python Package for Short Text Mining", WordPress
"Document-Term Matrix: Text Mining in R and Python," WordPress
An earlier version of this repository is a demonstration of the following blog post: Short Text Categorization using Deep Neural Networks and Word-Embedding Models
PyShortTextCategorization
Short Text Categorization using Deep Neural Networks and Word-Embedding Models

> Python Basics

> Advanced Python Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

1. Main features of shorttext and installation

Documentation

Installation

2. Working Example

2.1 Load built-in training datasets

2.2 Load US presidential inaugural speeches

2.3 Preprocess and tokenize text

2.4 Build a Document-Term Matrix (DTM)

2.5 Query word frequencies

3. Comparing vocabulary between two docs

Demo notebooks

Resources