This is a continuation of the previous posts about how we successfully created a Google-like search engine for our news articles database. Preceding this solution were also some failed attempts, like using the wrong type of database – if you would like to learn from our mistakes, check our previous post. We gave up on the MySQL solution because queries took too long to complete and found a much better solution in Elasticsearch.
Introduction to Elasticsearch
Elasticsearch is a distributed search engine and that’s why it is very scalable. If your database gets too big you can just run it on more than one server. Elasticsearch will take care of everything else by itself in the background. You won’t have to worry about nothing more than assigning a master server. It even does the indexing of each field by default without any setup required. They developed the product by the mantra that “everything should have a good default”. This makes it really easy for developers to get started with this product. It also has official library packages for many different languages like Node.js, Python, PHP, C# …
This Elasticsearch query will return the first 10 articles that contain the phrase “foo bar” and have a tag with id 15.
Distributed search and aggregation
Think of a query that will return more than a few million results when you type it into Google and then try to get to the 100-th or 1000-th page of results.
We think that you won’t be able to reach even the 100-th page because Google won’t let you and there is a good reason for it. The reason why Google search is so fast is that it is a distributed search engine and distributed systems have certain limitations.
The advantage of distributed systems is that they have a distributed processing power and that means that you can increase the speed of the search engine just by adding new servers to the cluster and also splitting the documents over these servers. Each server can get the same portion of documents to process (example: if we had 1 million documents and a thousand servers then each server would only have to process one thousand documents instead of 1 million). This way you can easily double or triple the performance of the whole cluster.
But distributed systems also have a disadvantage when performing aggregation operations, because the communication between servers is relatively slow (examples of aggregation operations: all associative operations over the whole set of documents like: sum, multiplication, max, min, avg …). Getting the top 10 results for a given query is also an aggregation. Performing a search for the top 10 results on a distributed system would look like this:
first the master server would receive the query
then it would forward the query to all its child servers
all the child servers would calculate their local top 10 results by ordering all the articles in their local database according to the query
all the child servers would send their local top 10 results to the master server
the master server would collect these results and calculate the global top 10 results performing a merge sort on
the collection of all these local top 10 results
finally the master server would send the global top 10 results to the client
This aggregation is very fast for the top 10 results but it gets slower when we want to see the 100-th or the 1000-th page of results because each child server would have to calculate the 1000 pages of results and then the master server would have to aggregate all these results and again calculate the 1000-th page of results. The time complexity of this problem if we use a tree structure for our server cluster is:
O(logd(S)) – to distribute the query
O(NL * log(i)) – to calculate the local results (where NL = NG/S)
O(logd(S) * (i * log(d) + i * d)) – to aggregate the results (i * d is the time complexity of the communication between parent and child servers)
S – number of servers
d – number of child servers for each parent server
NG – global number of documents
NL – local number of documents on each server which is NL = NG/S
i – page index (we assume that i < NL)
If we assume that parameter concerning the server infrastructure that is the number of child servers for each parent server (d) is a constant we get the time complexity of O(log(S) + NG/S * log(i) + log(S) * i). And if we further assume that the global number of documents and server count are fixed we get O(log(i) + i) = O(i). This means that the querying time grows linearly as the page index grows. And the main reason it does not grow logarithmically is the communication between servers and the limitations when performing aggregation operations. These limitations are usually the main constraints with distributed systems. That’s why Google doesn’t want you to see the 100-th or 1000-th page of the results.
Elasticsearch is a very fast solution that returns query results in under a second even with millions of documents. We encourage you to try it out for yourself and make your own search engine with it Elasticsearch: Getting started. Before installing you should check out their support matrix.
https://www.signapps.si/wp-content/uploads/2017/02/SearchEngine102_2.jpg32644896Signappshttps://www.signapps.si/wp-content/uploads/2016/06/Signapps-e1466660796376.pngSignapps2017-02-02 15:27:552017-08-01 13:54:58Search engine 102 - part II