Tips & Tricks with ElasticSearch

It’s been a while now since I’m using sometimes the Lucene’s child : the powerful ElasticSearch !

A NoSQL collection, which is mainly used for it’s historical & analysis capabilities. I’ve learnt some things, but I also need to write down these things, or I’m not going to remember it in a couple of months without touching to the tech.

How works ElasticSearch in a few words ? For what ?

It is basically a collection of files. Everything is dynamic, and stored into categories called « index ». An index can contains multiple documents, and a document contains attributes and collections of attributes. In fact, the hierarchy is quite simple; let’s take a quick example : I am the owner of 3 shops, and want to historize every events related to the commercial aspect. I have an Elasticsearch running, and I have the following indexes :

  • shop1
  • shop2
  • shop3

In each index I’m storing differents documents, differencied by their types :

  • purchase (each purchase details, promo or not, date, price, article, etc …)
  • product_delivery (registering every products delivered by providers)
  • inventory (monthly inventory results stored here)
  • daily (storing opening/closing dates, etc …)

And for example a purchase document type could contains these attributes : date, quantity, article ref code, article common name, unit price, promotion, name of the vendor, …)

One day, the owner wants to add a fidelity program and creates a fidelity card, that let customers profit of a % reduction based on the overall buy price amount.

No problem, in our document type « purchase » we simply now add a field « fidelity » and « fidelity percentage » and it’s done, all our new purchases are now tracked with the new data.

The owner wants to open 3 new shops ? We add an Elasticsearch node to the cluster, because one node was not sufficient let’s say. It’s very simple to adapt the current to our future needs, that why it’s so flex and so powerful. But there is another reason.

The data analysis. Often, people are using tools like Kibana to pull out the best of ElasticSearch. With your huge dataset stored, you may want to exploit data to make statistics like : the top 10 of best-seller articles each month, the vendor who made the most important sales, the peak periods, the articles lost between sold/stock difference, … and many other things where your only limit is the data and your imagination. This is by doing aggregation on multiple data sources that we create the information (from raw data blocks). We could imagine that if we combine your shop data with weather data collected, we can for example determine on a long period, what is the best weather during which you sell more, or have the most customers visiting, etc … And you will then be able to statistically provide an estimation of the future activity by fetching data retrived by a weather forecast API. This is only an application example, not very relevant, but it’s up to you to imagine smart combinations.

The only few drawbacks of such a system, is that it’s not adapted for multiple update/deletion actions. If you want to bulk update or delete document, it will be quite a pain to target what you want and process all your queries. That’s why ElasticSearch (let’s call it ES for convinience) is often going on duo with a relationnal database (MySQL, MariaDB, PostgreSQL, Oracle, …) which let us store things that are frequently updated but not added ( login/user data, configurations, or ).

The second drawback that I see, it’s that it don’t natively have a nice GUI to browse data (appart from Kibana) and you are forced to install a plugin (like _head) to easily and quickly gaining access to the data you want to check …

The basics

The funny thing about it is that you can request any document in ElasticSearch via a REST API, so you basically send JSON formatted document, and you receive another one as a response. Nice and lean.

Talking to ES

So here we are, in Elasticearch by default you can access the running instance by two opened ports : 9200 and 9300.

The 9200 port is used to do your REST requests and the 9300 port is for cluster communication. If multiple nodes are open, you can access the REST API by the ports between 9200-9300 and the nodes communicates using ports from 9300 to 9400.

Indexes

Firstly, let’s check the indexes created : http://myelasticsearch:9200/_cat/indices?v

Index informations are being displayed. You can check the size of those, or the number of documents contained, along with their status.

Beauty is subjective

In every request, you can add the GET parameter ?pretty=true in order to return a nicely formatted JSON result (with tabbing, spacing and hierarchy respect).

Wanted: documents

Let’s talk a bit about the search queries in Elasticsearch. The main advantage of using ES is that it is quite good to index and search for documents. So you are right in the perfect use-case for such a NoSQL engine.

You have to know that the HTTP verbs are very important to make a good use of the full power of ES. You can either use GET or POST verbs to get documents. The things that matter are the parameters passed (in the body or the URL).

If we begin with a basic example, to get you started, we will try to retrieve all the documents contained into one index :

GEThttp://localhost:9200/shop1/purchase/_search?q=*

To be noted, you can also query for multiple indexes at the same time, you can do so by putting a wildcard ‘*’ at the end of the name of your indexes. Let’s say for instance, that you have multiple shops and you use one index per shop (shop1, shop2, shop3, …). In this case, when you globally search some results, you can do a request to this URL:

GEThttp://localhost:9200/shop*/purchase/_search?q=*

By doing one of the two above queries, we receive a list of documents limited to 10 results by default. If you want to increase this limit, you can add the size parameter into the quey URL :

GEThttp://localhost:9200/shop1/purchase/_search?q=*&size=1000&pretty=true

Now, it will return the maximum of 1000 documents from the index purchase.

We want to add the pretty parameter, to get nice formatted JSON result. Then, we just have to add the parameter pretty to true :

GEThttp://localhost:9200/shop1/purchase/_search?q=*&size=1000&pretty=true

Ok cool ! Now we want to search documents based on a particular filter, let’s say that we want to search all the purchases with a total cost over 50 € (or whatever currency) :

GEThttp://localhost:9200/shop1/purchase/_search?q=Cost:>50&size=1000&pretty=true

You see that we use the double dotted equality sign ‘:’ to form our operator.

  • For equality comparison, we use ‘:’
  • For higher than, we use ‘:>’
  • For lower than, we use ‘:<‘

I’ll show you how to add a date filter on your query, to get, for instance, the purchases lower than 20 € between the 01/05/2016 and the 01/06/2016 :

GEThttp://localhost:9200/shop1/purchase/_search?q=Cost:>50&size=1000&range=date&pretty=true

Ok, but via the REST API, you can also post a detailed query by sending a POST request, with the body containing  your query. Let’s check out an example :

POST: http://localhost:9200/shop1/purchase/_search

JSON Body :

"query" : {
      "term" : { "seller" : "kimchy" }
}

You can also add to these queries a bunch of new key/value pairs in order to retrieve with much precision the wanted content:

 { 
     "query" : { 
        "bool" : { 
            "must": [{
                "term" : 
                    { 
                        "article_ref": "003816724" 
                    } 
                },{ 
                "term" : 
                { 
                    "discount": "yes" 
                } 
            }] 
        }
    } ,
    "from" :  1, 
    "size" : 25, 
    "sort" : [ { "saleDate" : { "order" : "desc" } } ] }

So here we get a little deeper into a query. This query typically fit a pagination system query. I’ll explain what and why.

When you have a lot of results, best thing to do is paginate your results; because browsing lists of 25 elements is more convinient than having a 4867 lines array/list displayed.

In the query above, we make a simple boolean condition on each elements to get all discounted articles  referenced as 003816724 sold. If it matches, we retrieve it; if not we don’t touch it.

  • must: contains array of conditions that must be filled
  • from: this keyword mean that we want the page 1 of the results (according to the size and sort keywords)
  • size: the number of elements to be retrieved per page
  • sort: we will sort by the sale date, ordered by newest to oldest

You can also use Wildcards queries to get items that contains your expression. Like I want all the articles sold for a particular serie : Cherokee ! Let’s say a brand « Dummy » made 3 articles with this title : Sneakers Dummy Cherokee, Shirt Dummy Cherokee and Slimfit jean Dummy Cherokee.

I want to get all article of this serie sold for a shop:

POST: http://localhost:9200/shop1/purchase/_search

{
   "wildcard" : {
      "article_name" : "*Cherokee*"
   }
}

 

Add mappings for ElasticSearch accuracy when saving documents

When you add data into your ElasticSearch cluster, it try to map as accurate as it can the data to correct value type. If it sees a value with { float: 0.20 }  it will understand that this is a floating value (or double value, as you want !) and assign the correct type when storing it.

But if you send { float: ‘0.20’ } it will map it into a string value. Because of the quotes. Here come the overriding of the default mapping. You can specify for each index and each document type, a mapping structure that will be applied each time you add a document.

After creating your index, you just have to apply a mapping using the REST API (simplest) like this :

PUT: http://localhost:9200/shop1/_mapping

{
     "purchase": {
           "properties": {
                   "article_ref": {
                             "type": "long"
                    },
                    "article_name": {
                            "index": "not_analyzed",
                             "type": "string"
                    },
                    "price" : {
                             "type": "double"
                    },
                    "cost" : {
                             "type": "double"
                    },
                    "seller": {
                              "type": "string",
                              "index": "not_analyzed"
                    },
                    "discount" : {
                              "type": "string"
                    },
                    "saleDate" : {
                              "type" : "date"
                    }
            }
     }
}

 

Analyzed or not analyzed ? That is the question !

A little point about the field processing in ElasticSearch. By default, when you insert a document containing string values in an index, your string fields will be analyzed by the ElasticSearch engine. You have to specifify into your mapping structure that the fields you want to keep untouched will be « not_analyzed ».

If you don’t do that, when you will search on that field, be prepared to have a headache as your displayed field will not be compared as is in your query, but special characters will be supressed and the string field will be converted to lower case.

So, like in the previous example, you just have to add into your field definition « index » : « not_analyzed » and your string will no longer be re-processed during any query.

 

To be continued :)

 

Laisser un commentaire