Search engine 102 – part II

This is a continuation of the previous posts about how we successfully created a Google-like search engine for our news articles database. Preceding this solution were also some failed attempts, like using the wrong type of database – if you would like to learn from our mistakes, check our previous post. We gave up on the MySQL solution because queries took too long to complete and found a much better solution in Elasticsearch.

Introduction to Elasticsearch

Elasticsearch is a distributed search engine and that’s why it is very scalable. If your database gets too big you can just run it on more than one server. Elasticsearch will take care of everything else by itself in the background. You won’t have to worry about nothing more than assigning a master server. It even does the indexing of each field by default without any setup required. They developed the product by the mantra that “everything should have a good default”. This makes it really easy for developers to get started with this product. It also has official library packages for many different languages like Node.js, Python, PHP, C# …

Score calculation

Elasticsearch also has a built-in (no setup required) calculation of scores for each article by which the results of a query are ordered. A score of an article tells us how well did the certain article match the given query. If you want to know more about how scoring works in Elasticsearch we suggest the following articles: How scoring works in Elasticsearch? What is relevance? Theory behind relevance scoring

Inserting an article into the Elasticsearch database is as easy as:


PUT my_database_name/article/1
{
    "text" : "Text of the article/document.",
    "tag_ids" : "45 2 3 89 9 ...",
    "date" : "2009-11-15T14:12:12",
}

text – This text can be preprocessed, lemmatized or stemmed with tools like nltk or you can even use the built-in elastic search functionalities for stemming.

tag_ids – You can create a string of tag ids concatenated with spaces. This will allow you to filter the articles by tags.

date – This can be the date when the article was published. Instead of using the score when ordering the results of a query you can use any of the fields that are comparable (have a defined order).

Querying the Elasticsearch database is as easy as:


GET my_database_name/article/_search
{
    "from": 0,
    "size": 10,
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "text": "foo bar"
                    }
                },
                {
                    "match": {
                        "tag_ids": "15"
                    }
                }
            ]
        }
    }
}

This Elasticsearch query will return the first 10 articles that contain the phrase “foo bar” and have a tag with id 15.

Distributed search and aggregation

Think of a query that will return more than a few million results when you type it into Google and then try to get to the 100-th or 1000-th page of results.

We think that you won’t be able to reach even the 100-th page because Google won’t let you and there is a good reason for it. The reason why Google search is so fast is that it is a distributed search engine and distributed systems have certain limitations.

The advantage of distributed systems is that they have a distributed processing power and that means that you can increase the speed of the search engine just by adding new servers to the cluster and also splitting the documents over these servers. Each server can get the same portion of documents to process (example: if we had 1 million documents and a thousand servers then each server would only have to process one thousand documents instead of 1 million). This way you can easily double or triple the performance of the whole cluster.

But distributed systems also have a disadvantage when performing aggregation operations, because the communication between servers is relatively slow (examples of aggregation operations: all associative operations over the whole set of documents like: sum, multiplication, max, min, avg …). Getting the top 10 results for a given query is also an aggregation. Performing a search for the top 10 results on a distributed system would look like this:

  • first the master server would receive the query
  • then it would forward the query to all its child servers
  • all the child servers would calculate their local top 10 results by ordering all the articles in their local database according to the query
  • all the child servers would send their local top 10 results to the master server
  • the master server would collect these results and calculate the global top 10 results performing a merge sort on
  • the collection of all these local top 10 results
  • finally the master server would send the global top 10 results to the client

Query

This aggregation is very fast for the top 10 results but it gets slower when we want to see the 100-th or the 1000-th page of results because each child server would have to calculate the 1000 pages of results and then the master server would have to aggregate all these results and again calculate the 1000-th page of results. The time complexity of this problem if we use a tree structure for our server cluster is:

  • O(logd(S)) – to distribute the query
  • O(NL * log(i)) – to calculate the local results (where NL = NG/S)
  • O(logd(S) * (i * log(d) + i * d)) – to aggregate the results (i * d is the time complexity of the communication between parent and child servers)

  • S – number of servers
  • d – number of child servers for each parent server
  • NG – global number of documents
  • NL – local number of documents on each server which is NL = NG/S
  • i – page index (we assume that i < NL)

If we assume that parameter concerning the server infrastructure that is the number of child servers for each parent server (d) is a constant we get the time complexity of O(log(S) + NG/S * log(i) + log(S) * i). And if we further assume that the global number of documents and server count are fixed we get O(log(i) + i) = O(i). This means that the querying time grows linearly as the page index grows. And the main reason it does not grow logarithmically is the communication between servers and the limitations when performing aggregation operations. These limitations are usually the main constraints with distributed systems. That’s why Google doesn’t want you to see the 100-th or 1000-th page of the results.

Conclusion

Elasticsearch is a very fast solution that returns query results in under a second even with millions of documents. We encourage you to try it out for yourself and make your own search engine with it Elasticsearch: Getting started. Before installing you should check out their support matrix.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *