Indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.

Mapping

You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea of how to parametrize indexation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Interesting options which are better specified before indexation include similarity scoring as well as term frequencies and positions.

Linguistic analysis

The string data type allows for the definition of the linguistic analysis to be used (or not) during indexation.

Elasticsearch ships with a series of language analysers which can be used for language-aware tokenization and indexation. Given a “text” field in German, here is where it happens in the mapping:

{
  "text": {
    "type" : "string",
    "index" : "analyzed",
    "analyzer" : "german",
   },
 }

Beyond that, it is possible to write custom settings, which is a notable strength of ElasticSearch since it provides integration of a linguistic analysis toolchain directly into the indexation phase, with features such as series of filters and analyzers.

Here are notable built-in options:

strip HTML
trimming filter
N-Grams indexation
tokenization:
- a “real” tokenizer in the linguistic sense (works only for English though)
- for most European languages there is a so-called “standard” tokenizer
dictionary-based token analysis
normalization (mostly not in the linguistic sense)

It is also possible to treat strings as they are with the not_analyzed option.

Useful tools

Manipulation of JSON-files is much faster with UltraJSON (usable from C and Python).

For indexing with Python I use the “official” module: https://github.com/elastic/elasticsearch-py

Further information

Further examples to help with indexation:

https://github.com/andrewpuch/elasticsearch_examples

Further references to help with the mapping:

The whole mapping I use for tweets: https://github.com/adbar/tweets-tools/blob/master/elasticsearch-mapping.json
Another possible mapping for tweets: https://github.com/humangeo/some-elasticsearch-mappings/blob/master/twitter-elasticsearch-mapping

Mapping

Linguistic analysis

Useful tools

Further information

Related Posts: