//Karthik Srinivasan

Product Engineer, CTO & a Beer Enthusiast
Experiments, thoughts and scripts documented for posterity.

Quirky Personal Projects

LinkedIn

Email me

Elasticsearch - Cautionary and Useful Tips

Sep, 2014

Update/Delete Gotcha:

In elasticsearch, an update to a document is basically a delete and reinsert. A delete operation in elasticsearch is basically marking the document to be deleted and not actually deleted. This is problem especially when you have heavy updates/delete operations as the documents are not actually purged but instead just marked for deletion, which takes up disk space. Following screen shot is an example where the total number of documents in the index (where documents can be searched) is not the same as the actual total documents in the index.



To reclaim disk space, you have to optimize the index:
curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true'
More information at: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html

Memory Limitation - Max ES Heap size:

By default elasticsearch allocates 1GB of heap to it's process. This is ok for development purposes but in production you should generally provide half of the server memory to Elasticsearch. To set the heap size:
export ES_HEAP_SIZE=10g
More the memory given to elasticsearch is better as more data is held in the memory for faster search/seeks but there are few gotchas to be aware of: More information at : http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html

Index name - Alias

It's always advisable to provide an alias to the index and have the application use the alias name instead of the actual index name. This is useful as we can switch index's without affecting the calling application. For example, we can create a brand new index with new mappings and basically delete the alias from the old index and assign it to the new index. This way re-indexing operations would be a zero downtime operations.

curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "kIndex_v1", "alias" : "assets" } },
        { "add" : { "index" : "kIndex_v2", "alias" : "assets" } }
    ]
}'
                

Logging - Debug/Info/Error:

By default, elasticsearch log level is set to debug in the logging.yml file. This is probably not a good choice as ES tends to log everything which takes up a lot of disk space. I learned this the hard way where I copied data from one index to another index for reindexing purposes and elasticsearch logged every single payload and the log size was almost the size of the index itself! It;s best to have the log level to WARN instead of DEBUG or INFO.

Document Versioning:

For every insert or document update, elasticsearch either auto assigns a version number of expects the user to provide a version number. This is useful for concurrency control. There are 4 types of version mechanism: The above four version options are not very well documented, but can be understood by reading the ES source code at: Source

//external_gte example
curl -XPOST "http://localhost:9200/designs/shirt/1?version=4&version_type=external_gte" -d'
{
    "name": "elasticsearch",
    "votes": 1
}'