The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.

Mapping

You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea of how to parametrize indexation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Interesting options which are better specified before indexation include similarity scoring as well as term frequencies and positions.

Linguistic analysis

The string data type allows for the definition of the linguistic analysis to be used (or not) during indexation.

Elasticsearch ships with a series of language analysers which can be used for language-aware tokenization and indexation. Given a “text” field in German, here is where it happens in the mapping:

{
  "text": {
    "type" : "string",
    "index" : "analyzed",
    "analyzer" : "german",
   },
 }

Beyond that, it is possible to write custom settings, which is a notable strength of ElasticSearch since it provides integration of a linguistic analysis toolchain directly into the indexation phase, with features such as series of filters and analyzers.

Here are notable built-in options:

It is also possible to treat strings as they are with the not_analyzed option.

Useful tools

Manipulation of JSON-files is much faster with UltraJSON (usable from C and Python).

For indexing with Python I use the “official” module: https://github.com/elastic/elasticsearch-py

Further information

Further examples to help with indexation:

Further references to help with the mapping: