28 April, 2020
Have you ever heard about Elasticsearch? It’s a really useful tool if you have to go through a massive amount of data and pick the necessary information only. And we’re talking a situation where control/command+F isn’t an option. In this Elasticsearch tutorial, I’m going to show you how to get started with this searching tool, how Elasticsearch works and how to properly use it.
What is Elasticsearch?
Elasticsearch is an open-source solution, which makes the full-text search super easy to handle. Under the hood, we have a NoSQL database which stores a data in a JSON document form. Every entry to this database is transformed by an engine, which helps you to search through the enormous amount of data.
Very often Elasticsearch is called schema-less but it is only half-true. You don’t have to specify the schema, Elasticsearch can do this for you. But there is a little-big drawback.
The types of data in columns can be incorrectly defined, so we won’t be able to conduct specific operations on some fields that vary depending on different types. The better option is to define schema by yourself. You’ll be sure that the engine correctly indexes your data, and you will be able to handle data during the search according to your needs.
It is also a scalable and distributed solution. You can easily run it in a hundred nodes. That will spread the load evenly. Thanks to that, Elasticsearch is highly available and easily handles nodes outages. Of course, you don’t need a lot of machines to realise the full potential. You can easily run Elasticsearch on your local computer and (with a rational amount of data ) it will be really fast.
It’s very interesting how you can access your data. Elasticsearch gives you a very coherent REST API to work on a database. You can also use a dedicated client – e.g. here’s the official Elasticsearch client for Node.js environment.
If you prefer to search your data through a pretty UI (I totally understand, the design is important!), you can use Kibana which is part of ELK stack. It provides a lot of additional tools to explore and visualise your data.
What does Elasticsearch consist of?
In Elasticsearch we deal with large volumes of data. To sort them out, the data is held in one or more indices. Each index is a collection of data with similar characteristic. The index name is an identifier used to perform operations like searching, indexing, updating or inserting. Data in the index is spread through the shards, which are searched independently. When you have a lot of data, you can spread it through multiple shards and then search through all of them parallelly. This solution makes searching Sonic-the-Hedgehog fast.
Notice that the index consist of an index as a name, and type field. Type field was created to store several types of data with the same index. Unfortunately from version 7.0 type field is marked as deprecated. From version 8.0 API that accepts types will be removed because types proved to cause more problems than they solved.
That’s why in the next part of this article we will use a typeless approach to create and manage indices.
But let’s return to our indices. We’ve said already that they are kept in the shards but what exactly does it mean? Shard is a fully-functional and independent index that can be hosted on any node in the cluster. In reality, it’s just an index, divided into smaller pieces. We create it to distribute data across more nodes and achieve high availability of data. What is more, each chunk can have a replica on a different node so your data is distributed (replica shard is never allocated on the same node as the original shard). This gives you a guarantee that if one of the nodes will stop working, your data will be accessible on another node. Finally, distributing data between nodes makes less load for each node.
Great, but what exactly a node is?
Node is a single server which is a part of a cluster. Cluster is a collection of nodes which gives us a possibility to search for data across all nodes.
How does Elasticsearch work?
In order to locally run Elasticsearch, Docker image is my go-to solution. In this article, I will use the secondary tool – Kibana – that will help me to browse the data. So let’s start by creating a simple
docker-compose file, which will run these two services for us. To run it we need only to execute a command
docker-compose up -d.
After a short while, you should have Elasticsearch and Kibana running. So let’s go to the http://localhost:5601/ – there you should see a Kibana UI. I won’t go into details about all the options. For now, you only need to use a DevTools. It is a tool like Postman – thanks to it, we can make a request to the Elasticsearch. So let’s click on a third option from the bottom of the side menu. You should see two form-rows. On the left, you will enter your requests. On the right, you will see a response from the request you made.
Creating an index
To store data, we first need to create an index for all the documents. We will use POST request through the DevTools. Let’s assume that we have a huge list of quotes which we need to store, and search it using phrases or author name. Let’s execute the command below, to make a little space for our quotes.
To check if our index exists we can look up for all the indicates.
The last position in the result should be your new index called “quotes”.
Creating shards and mappings for an index
We have created an index, but wait a moment! We’ve been talking about shards before, so now let’s learn how to create them. There is only one possibility to define a number of them so we need to provide this value when we create an index. The default number of shards is five. In addition to defining shards, we can provide a number of replicas (default is one). The following command shows you, how to do it.
If you previously created an index, this request should return the following error
resource_already_exists_exception. When you create an index with the defined (or not) number of shards, you won’t be able to modify these values. You can only update the value of
number_of_replicas with the following request.
So before you create an index you should consider how big it will be and choose the appropriate number of shards for it. Otherwise, if you changed this value, you would be forced to reindex all the data. So to be able to create a
quote index with the appropriate number of shards and replicas, you need to remove the current one and create it again with the aforementioned PUT request.
When you have the index correctly configured, you can define mappings. Assume that you need three columns in it. Author, content and year. In Elasticsearch there are a lot of types that define how the search will be handled by the search engine. For our purposes, we will set the author and content as a
text. It means that for this column, we will use a full-text search. The last column (year) will be of integer type. You can read more about types in Elasticsearch here.
You also need to remember that once you’ve created mapping, it cannot be changed. You’ll be able to add new properties only, but existing types are immutable. That could invalidate data that is already indexed. If you need to change the mapping of a field, create a new index with the correct mapping and reindex your data into that index.
I’ve already mentioned re-indexation twice, so I think that is a good moment to say little more about it. We’ve created quotes index. We’ve kept data about the author as a
text type, but then we decided that a keyword will be a much better type for keeping authors. Unfortunately, the index isn’t empty, we had put some data there. The solution to change the type of this column will be reindexing. Firstly, we need to create a new index with correct mappings, so let’s look at the example below (INB4: I omitted shards and replica configuration there):
After that, we need to use reindex to copy data to the new index.
Okay, but let’s come back to our
quote index. We have correctly configured index but index without data is useless. So we need to add some of it. We need to make a POST request with the name of an index and with
/_doc postfix like below.
Here’s a response after a successful request.
In request like above, we get an autogenerated identifier. It means that Elasticsearch set for us an ID for the document we insert. Of course, you can specify ID by yourself. Just make a request to endpoint
POST /quotes/_doc/<id> with ID.
To update existing data we need to make a simple PUT request to the same document. For example, we can add a new field to the document, like we added a country below.
The response result will tell you that the update was successful with the following response.
What is interesting, each time any document is updated, the property
_version counter will increment.
To create a query we can use a full Query Domain Specific Language (Query DSL) based on JSON. It provides two types of query clauses:
- Leaf query clauses – in this type we use queries like
matchto find particular value in a particular field.
- Compound query clauses – it wraps leaf query clauses to combine multiple queries.
At first, let’s try to display all the results from our q
uote index. Use a
match_all query for it. As we know, a query is a JSON so we need to start from the bracket, then we need to define a query object and declare the query inside, like below.
Here is my response. I’ve added some new quotes, to make it more interesting. I will also use this data to show you a different type of query.
Can you see anything interesting in this response? Take a look at the
_shards property – we can check how many shards have been scanned for our data, how many were successful, skipped and failed. In the
hits object, we can see information about a number of results, and how accurate is the count. After that, there’s a
max_score – the highest value of score from results. In this query, it is equal to
1.0 but in a different query, this value can also be different. We will talk about the scores in the next section. The last part of our response is an array of results.
So let’s assume that we want to show only results containing the word “success”. Instead of
match_all query, which we used in the previous example, this time we will use a
must clause. In my case, this query returns only two results.
So maybe something more complicated? Let’s display only authors of quotes from before 1900 and including the word success. This time we will get a much bigger query.
Fear not, we’ll divide this query into baby steps:
queryparameter indicates query context.
matchclauses are used in query context, which means that they are used to score how well each document matches the query.
- In the
rangequery, we specify that we are looking for documents from before or including 1900.
- In the
matchquery, we indicate that result should include the word “success”
_sourceproperty, we defined which field should return the query. Here, we can pass one field as a string, or an array of fields.
And we get the following results:
Very often, when we work on a website that displays a huge amount of data we need a pagination mechanism. In Elasticsearch it is pretty simple. For this purpose, we can use two properties:
from– specifies from which record in the index Elasticsearch should start searching,
size– defines how many results should be returned.
Thanks to this, we are only able to pick a specific part of the records and return them to the user.
There’s one teeny-tiny problem with pagination when we have a list with more than 10 000 results.
from + size index cannot be bigger than
index.max-result-window which by default is set to 10 000. In this situation, we have two choices: use a
scroll API which can be very slow for a real-time user, or use a
search_after which is better but we need to know the last result from the previous page. You can read more about it on the official website.
What’s the score?
In the query results array, we can notice that each result has own
_score property. It is a relevance score, which measures how well each document matches a query. The higher this value is, the more relevant the document. Each query type can calculate this score differently and depends on if the query clause runs in a query or filter context.
In the next query, we will look for a phrase: “not to be a success”. To remind you, the content column is the “text” type, so Elasticsearch will perform a full-text search on it. For each record, it will scan the field we are looking for to determine how closely it matches the phrase we want to find.
Here’s the response:
As you can see, Elasticsearch gives each result a different score. The first result has the highest score because the content property is the most similar to the searched phrase. What’s more, the second result also has the word “success” (that was in our searched phrase) but Elasticsearch gave it a lower score.
The higher score is, the more accurate it is.
It’s related to the algorithm that calculates which document is more textually similar to the query.
Above, we have an example of how a query context works. To sum up, it answers the question of how well a document matches the query clause. In a filter context, we get a yes/no answer to: ” Does this document match a query question?” (that’s why we don’t have any score). To find the quote created by Albert Einstein between the year 1800 and 1900, we can simply use a filter presented below.
The score in response will be equal to 0.0.
What’s more, we can easily combine the query and filter context. The benefit of using filter context is caching queries in the “node query cache” that visibly improves performance.
I hope you enjoyed this Elasticsearch tutorial! Elasticsearch is truly a very powerful solution for saving and analysing data. Thanks to its scalability, it can handle really heavy loads of data. You can use it as a read database in your system to provide a very reliable mechanism to retrieve data.
In this tutorial, I’ve shown you only a little part of the features which this tool has. For more information, I encourage you to read its documentation which provides a lot of useful knowledge. Query DSL offers a lot of tools to make very specialised queries that will help you to find appropriate data. I highly recommend you to try Elasticsearch and explore other features.
I wish that this little Elasticsearch tutorial will be your doorway to the exciting world of Elasticsearch. Good luck!
If you’re interested in more Node.js-related tutorials, check out: