Search engine 102 – part I

For one of our projects we had the task to build a fast search engine to search through a database of text articles where each article was tagged with zero, one or more tags (tag examples: politics, sports, finances, business, …). In the previous post titled “Making your own search engine 101” we discussed the theory behind, but in this one we want to share with you what we learned through the actual implementation of the search engine.

Scoring the articles

The search engine should be able to find the most relevant articles according to the query. The general idea on how to score articles to be able to sort them by relevancy can be explained with an example. If we search the articles by a query with three words: “foo bar baz”. We are looking for the following order of results:

ocena

Failed attempts

The first technology we tried to solve this problem with was MySQL. We tried to create a good proximity index that would allow fast queries. The proximity index actually only stores the positions of each word within an article. We had to modify it a little bit so that it was able to write a sql query that would return the correct results in the lowest number of queries possible and also allow for as little post-processing as possible. We decided to store each pair of words that didn’t have more than two words in-between (representing this with a variable MAX_GAP = 2). The index would consist of 7 columns. If we had a text “aaa bbb aaa foo ccc bbb bar ddd aaa” and MAX_GAP = 2 we would index it like this:

table

If we would now have to query this index by a query with two words “foo bar” the SQL query would look like this:

SELECT DISTINCT article_id FROM proximity_index
WHERE ((keyword_id=3 AND neighbour_keyword_id=5)
OR (keyword_id=5 AND neighbour_keyword_id=3))
AND gap <= 1;

This query would match the row with id = 12 and the result would be article_id = 1. The result would be the list of article ids that contain the words “foo” and “bar” and don’t have more than 1 word/gap in-between them.

The query gets more complex and also requires some post-processing if we want to search by a query with more than two words. But we won’t get into details about this solution because nonetheless this was a failed attempt.

The problem with this solution was that the index would grow too big to be stored in RAM. It would have millions of rows for just thousands of articles. The exact number of rows that would be used for each article is max(0, WC – (MAX_GAP + 1)) * (MAX_GAP + 1) + Tmin(WC, MAX_GAP + 1) – 1 where WC is the number of words in the article and Tn is the n-th triangular number (equation for the n-th triangular number is n*(n+1)/2). In terms of Big O notation this would be O(WC * MAX_GAP) rows for each article. Consequently because of too many rows the queries would run slow because MySQL would have to query the files on the disk.

We gave up on this MySQL solution because queries took too long to complete. We were already thinking about writing a custom search engine that we would use to replace the MySQL database. That’s when we found a much better solution that was already doing exactly what we wanted to do. This solution is called Elasticsearch and it is open source. It turned out to be the best and fastest solution. Find out more about it in our next post Search engine 102 – part II.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *