Text search and analytics engines for NLP and big data

In the last years I had to work with many problems related to efficient text search and replacement. Over the years I collected several tools that are very useful dealing with NLP problems and big data. This is the list which includes one new library.

ElasticSearch

ElasticSearch

In short: ElasticSearch is a text/document database based on RESTful communication. It is designed to it indexes all fields in a document, and they can be search with good perfomance and by various ways.

Elasticsearch is a extremely scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly. ElasticSearch can be downloaded in many formats such as ZIP and TAR.GZ from Elasticsearch Downloads. All you need to do is to download and extract the package. Running ElasticSearch is very easy too. For example for windows in order to run it you need to run elasticsearch.bat located in the bin directory by a command window. This will launch ElasticSearch running in the foreground in the console, meaning we'll see errors in the console and can shut it down using CTRL+C.

ElasticSearch features:

Scalable Map/Reduce model
REST based
Self contained
Memory and I/O efficient

SphinxSearch

SphinxSearch

Sphinx is an open source full text search server, designed with performance, relevance (search quality), and integration simplicity in mind. Sphinx lets you either ..

Sphinx is known as an open source search project that allows full-text searches over large data very efficiently. Another plus is data diversity and working with many different sources: RDBMS, text files, HTML pages, mailboxes, and so on.

Some key features of Sphinx are:

  • Indexing
  • Searching performance
  • Querying tools
  • Result post-processing
  • Scalability (terabytes, thousands of queries per second)
  • Easy integration with SQL and XML data sources

Apache Lucene Core

Apache Lucene

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Apache Lucene is another open source engine which is completely free.

Features

  • Lucene offers powerful and simple API:
  • High-Performance Indexing
  • Scalability
  • over 150GB/hour on modern hardware
  • RAM efficient
  • incremental indexing
  • batch indexing

FlashText

FlashText github

Recently I found this python library when I had trouble with regex expressions which surprise me in several cases. FlashText is a Python library designed with the idea of searching and replacing words in a text document. It's based on algorithm Aho-Corasick algorithm and Trie Dictionary which makes it extremely fast for huge text replacements. In comparison with the traditional regex replacements has huge advantage and can be used as regex replacement

features:

  • easy to use and install
  • efficient regex alternative
  • extreme speed of replacement