Search engine 102 – part II

This is a continuation of the previous posts about how we successfully created a Google-like search engine for our news articles database. Preceding this solution were also some failed attempts, like using the wrong type of database – if you would like to learn from our mistakes, check our previous post. We gave up on the MySQL solution because queries took too long to complete and found a much better solution in Elasticsearch.

Introduction to Elasticsearch

Elasticsearch is a distributed search engine and that’s why it is very scalable. If your database gets too big you can just run it on more than one server. Elasticsearch will take care of everything else by itself in the background. You won’t have to worry about nothing more than assigning a master server. It even does the indexing of each field by default without any setup required. They developed the product by the mantra that “everything should have a good default”. This makes it really easy for developers to get started with this product. It also has official library packages for many different languages like Node.js, Python, PHP, C# …

Score calculation

Elasticsearch also has a built-in (no setup required) calculation of scores for each article by which the results of a query are ordered. A score of an article tells us how well did the certain article match the given query. If you want to know more about how scoring works in Elasticsearch we suggest the following articles: How scoring works in Elasticsearch? What is relevance? Theory behind relevance scoring

Inserting an article into the Elasticsearch database is as easy as:


PUT my_database_name/article/1
{
    "text" : "Text of the article/document.",
    "tag_ids" : "45 2 3 89 9 ...",
    "date" : "2009-11-15T14:12:12",
}

text – This text can be preprocessed, lemmatized or stemmed with tools like nltk or you can even use the built-in elastic search functionalities for stemming.

tag_ids – You can create a string of tag ids concatenated with spaces. This will allow you to filter the articles by tags.

date – This can be the date when the article was published. Instead of using the score when ordering the results of a query you can use any of the fields that are comparable (have a defined order).

Querying the Elasticsearch database is as easy as:


GET my_database_name/article/_search
{
    "from": 0,
    "size": 10,
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "text": "foo bar"
                    }
                },
                {
                    "match": {
                        "tag_ids": "15"
                    }
                }
            ]
        }
    }
}

This Elasticsearch query will return the first 10 articles that contain the phrase “foo bar” and have a tag with id 15.

Distributed search and aggregation

Think of a query that will return more than a few million results when you type it into Google and then try to get to the 100-th or 1000-th page of results.

We think that you won’t be able to reach even the 100-th page because Google won’t let you and there is a good reason for it. The reason why Google search is so fast is that it is a distributed search engine and distributed systems have certain limitations.

The advantage of distributed systems is that they have a distributed processing power and that means that you can increase the speed of the search engine just by adding new servers to the cluster and also splitting the documents over these servers. Each server can get the same portion of documents to process (example: if we had 1 million documents and a thousand servers then each server would only have to process one thousand documents instead of 1 million). This way you can easily double or triple the performance of the whole cluster.

But distributed systems also have a disadvantage when performing aggregation operations, because the communication between servers is relatively slow (examples of aggregation operations: all associative operations over the whole set of documents like: sum, multiplication, max, min, avg …). Getting the top 10 results for a given query is also an aggregation. Performing a search for the top 10 results on a distributed system would look like this:

  • first the master server would receive the query
  • then it would forward the query to all its child servers
  • all the child servers would calculate their local top 10 results by ordering all the articles in their local database according to the query
  • all the child servers would send their local top 10 results to the master server
  • the master server would collect these results and calculate the global top 10 results performing a merge sort on
  • the collection of all these local top 10 results
  • finally the master server would send the global top 10 results to the client

Query

This aggregation is very fast for the top 10 results but it gets slower when we want to see the 100-th or the 1000-th page of results because each child server would have to calculate the 1000 pages of results and then the master server would have to aggregate all these results and again calculate the 1000-th page of results. The time complexity of this problem if we use a tree structure for our server cluster is:

  • O(logd(S)) – to distribute the query
  • O(NL * log(i)) – to calculate the local results (where NL = NG/S)
  • O(logd(S) * (i * log(d) + i * d)) – to aggregate the results (i * d is the time complexity of the communication between parent and child servers)

  • S – number of servers
  • d – number of child servers for each parent server
  • NG – global number of documents
  • NL – local number of documents on each server which is NL = NG/S
  • i – page index (we assume that i < NL)

If we assume that parameter concerning the server infrastructure that is the number of child servers for each parent server (d) is a constant we get the time complexity of O(log(S) + NG/S * log(i) + log(S) * i). And if we further assume that the global number of documents and server count are fixed we get O(log(i) + i) = O(i). This means that the querying time grows linearly as the page index grows. And the main reason it does not grow logarithmically is the communication between servers and the limitations when performing aggregation operations. These limitations are usually the main constraints with distributed systems. That’s why Google doesn’t want you to see the 100-th or 1000-th page of the results.

Conclusion

Elasticsearch is a very fast solution that returns query results in under a second even with millions of documents. We encourage you to try it out for yourself and make your own search engine with it Elasticsearch: Getting started. Before installing you should check out their support matrix.

Search engine 102 – part I

For one of our projects we had the task to build a fast search engine to search through a database of text articles where each article was tagged with zero, one or more tags (tag examples: politics, sports, finances, business, …). In the previous post titled “Making your own search engine 101” we discussed the theory behind, but in this one we want to share with you what we learned through the actual implementation of the search engine.

Scoring the articles

The search engine should be able to find the most relevant articles according to the query. The general idea on how to score articles to be able to sort them by relevancy can be explained with an example. If we search the articles by a query with three words: “foo bar baz”. We are looking for the following order of results:

ocena

Failed attempts

The first technology we tried to solve this problem with was MySQL. We tried to create a good proximity index that would allow fast queries. The proximity index actually only stores the positions of each word within an article. We had to modify it a little bit so that it was able to write a sql query that would return the correct results in the lowest number of queries possible and also allow for as little post-processing as possible. We decided to store each pair of words that didn’t have more than two words in-between (representing this with a variable MAX_GAP = 2). The index would consist of 7 columns. If we had a text “aaa bbb aaa foo ccc bbb bar ddd aaa” and MAX_GAP = 2 we would index it like this:

table

If we would now have to query this index by a query with two words “foo bar” the SQL query would look like this:

SELECT DISTINCT article_id FROM proximity_index
WHERE ((keyword_id=3 AND neighbour_keyword_id=5)
OR (keyword_id=5 AND neighbour_keyword_id=3))
AND gap <= 1;

This query would match the row with id = 12 and the result would be article_id = 1. The result would be the list of article ids that contain the words “foo” and “bar” and don’t have more than 1 word/gap in-between them.

The query gets more complex and also requires some post-processing if we want to search by a query with more than two words. But we won’t get into details about this solution because nonetheless this was a failed attempt.

The problem with this solution was that the index would grow too big to be stored in RAM. It would have millions of rows for just thousands of articles. The exact number of rows that would be used for each article is max(0, WC – (MAX_GAP + 1)) * (MAX_GAP + 1) + Tmin(WC, MAX_GAP + 1) – 1 where WC is the number of words in the article and Tn is the n-th triangular number (equation for the n-th triangular number is n*(n+1)/2). In terms of Big O notation this would be O(WC * MAX_GAP) rows for each article. Consequently because of too many rows the queries would run slow because MySQL would have to query the files on the disk.

We gave up on this MySQL solution because queries took too long to complete. We were already thinking about writing a custom search engine that we would use to replace the MySQL database. That’s when we found a much better solution that was already doing exactly what we wanted to do. This solution is called Elasticsearch and it is open source. It turned out to be the best and fastest solution. Find out more about it in our next post Search engine 102 – part II.

,

How it works?

Have you ever surfed the internet and saw a website or an app that you liked and asked yourself – “How did they do it”? In this post we will try to make the process of creating a website as easy as possible to understand even if you do not know anything about programming. As you can see, the process consist of 9 phases. That is how we work. But it starts with you and your idea.

Phases infographic

Idea

Do you have an idea for a web solution, but no clue how to realise it? We are the ones to talk to. The more you tell us about your idea the better we can understand it and bring it to life. We will ask you many questions to ensure that we are all on the same page.

Brainstorming

Brainstorming means to stick our heads together to come up with the best way to realise your idea. We will also think about other useful features your website could have and come up with additional suggestions.

Specification

We will write down on what we agreed on and create a detailed specification. This will help us with development as we will know exactly what we need to do. It will also ensure that we can plan our development correctly, select the right platforms and deliver the website on time.

Wireframe

Creating a wireframe is like creating a blueprint when building a house. Wireframes help understand the idea of where the elements will be positioned on the website. This also makes it easier for you to influence the final outcome.

Wireframe

Design

When you agree with the wireframe we need to make it look alive. We add images, fonts and colours to the wireframe to create the final look of the website. Of course we follow the latest design trends and make sure it reflects your corporate identity. Or not. As you like it.

Design

Development

It is now time for coding. Our developers will use the most suitable platform and add functionalities to your website such as contact forms, image galleries, animations and more.

Testing

To ensure that the website runs well on all devices and different screen sizes, we will test your website on as many possible variations. We will also test all the functionalities to ensure a great user experience.

Launch

Time to launch the website and make it public. You can provide us with the details of your current hosting service or we can help you arrange a hosting plan that would fit best for you.

Maintenance

If you want your site to work continuously and smoothly you should not forget that this does not just happen automatically. Especially with more complex websites, maintenance is necessary. We will sign a maintenance agreement to provide the updates and fixes needed.

Next time you see a website you like, you will know what the process for creating one looks like and maybe you will get some additional ideas for your own. If so, you know where to find us 😉

,

Starting your career in Web Development & Design

Doing things right, especially at the beginning of your career, saves you a lot of time and trouble. Web development and design are one of the most desired professions at the moment, and most wanted are individuals with a wide-range of education and technical experience. As expectations are getting higher, you have to make sure your knowledge does not stay behind. These are the skills that will help you build and improve your career:

1. Basic programming skills: JavaScript, CSS, HTML

js-css-htm

These are some webpage service providers where you are able to create a new website without having any programming skills. But be aware that these are just the tools that help you create a website. When you’ll face a problem, programming knowledge would be much appreciated.

2. User experience knowledge

You only have one chance to attract a potential customer. Since nowadays most of the businesses have their own website, user experience is what differentiates you from the rest. You have only a few seconds to attract people’s attention. They decide very quickly whether to stay on your website or click the back button. That is why web designers are focusing more on the user experience and a bit less on the design. In the customer’s decision-making process it is more important to find the right data easier and faster, than what your website looks like.

3. Copy-writing skills

Copy-writing-skill

Fact is that every extra knowledge that you have is of great advantage. Customers are looking for holistic services. Writing is probably not one of your preferences, but improving it would give you extra value as a Web designer or developer. Learning about content writing would boost your ideas which you could use while designing or developing a new website.

4. Customer service skills

Take good care of your customers. As a company or a freelancer you have to build trust. It is not about a great web solution but about an unforgettable experience which improves your credibility and consequently customer’s loyalty. With end-to-end focused customer experience you will differentiate your brand from the others and stand out of the competitors. Empathy is the key!

5. Stay up to date

It is pretty obvious that it’s always good to be up to date, especially if you are a web designer where trends change rapidly. To stay informed on events, updated on work-related tools, and gain an great source of inspiration, follow the right people. These are the niche experts who constantly create and change the design landscape. When you study the greats, you acquire their skills.

Sounds pretty obvious? Believe it or not, many focus only on one or two of these points and that is far from being enough…

Making your own search engine 101

Have you ever been given the task to make your own search engine or maybe you are interested in knowing how to make one? On one of our projects we had a task to build a fast search engine which will search through a database of articles and gather only those articles that correspond to the given search query. The search query in this case was either a list of words or a list of phrases.

When creating our search engine we stumbled upon three interesting problems:

  • speed of the search engine, which was the most crucial problem
  • stemming and lemmatization
  • phrase search

Speed of the search engine

The most naive method that comes to mind when thinking about text search engines is the full text search. That means that we search through each text character by character to check if it contains the word we are looking for. This method is very slow. A little less naive way would be to use some good string search algorithms like:

A better solution

One of the most common methods that is used in machine learning is the Tf–Idf (short for term frequency–inverse document frequency). It basically means that for each word we store a list of articles/texts that contain this word. For instance as we can see on the picture below – the word cat is contained in 4 articles – article nr. 5, 8, 10 and 12. Article nr 5. contains the word cat 2 times, article nr. 8 contains the word cat only once, etc. If you search for the word cat, article nr. 10 will be your first search result as it contains the word cat 8 times. The same goes for the word dog.

cat dog

It is like pre-calculating all the search results for each possible word. That means that we have to do some more work when inserting the new texts, but it pays off when we are searching through them. For the details about this method you can click here or watch the second video in the tutorial series down below.

Word stemming and lemmatization

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, these two processes are not the same. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. You might want to read more about it here.

NLTK (Natural Language Toolkit) is a great library in Python which does a very good job in word stemming and lemmatization. You can download it from here.

Phrase search

A phrase is a combination of words which have a different meaning if they are used together in the right order than if they are used separately.

A nice example of a phrase that could cause problems for those creating a text search engine would be Chicago Bulls. Even Google was not perfect at the beginning. Nowadays Google knows that Chicago Bulls is a basketball team, but in the old times Google interpreted Chicago Bulls as two separate words: Chicago (as the city in the United States of America) and Bulls (the farm animal). Instead of returning the results about the basketball team it returned results about cities and bulls.

»Chicago Bulls« in the old times

chicago bull

»Chicago Bulls« nowadays

Google-chicago-bulls

As already mentioned, rather than doing a full text search through all the texts we should use the Tf–Idf. But the problem that we encounter while using this method is that now we lose the information regarding the positions of words in the document, which means that we can’t search for phrases like Chicago Bulls. We can select all the texts that contain both words: Chicago and Bulls, but we cannot guarantee that we will only get the texts that talk about the Chicago Bulls basketball team. That means that we have to improve the Tf–Idf method if we want to be able to search for phrases.

One way we could improve the Tf–Idf is to turn it into a positional or proximity index. It means we would embed the position information into our inverted index (Tf–Idf). So instead of storing the list of texts that contain only one of the given words we would also store a list of pairs (text_id and position) for those words. For the details about this method you can watch the fourth video in the tutorial series down below.

YouTube video series about inverted indexing

If you are interested in the details and algorithms behind a search engine this is a very good tutorial series we found for you by Victor Lavrenko.

5 important tips to help you start out with web development

Starting out with development on a new platform is always a bit problematic. There are new programming paradigms and even technologies you haven’t seen or used before.

In this post, we’ll describe five tips, which will help you start with web development to build the perfect web application. The following tips are listed in no particular order, with the exception of the first one. Read the documentation, it will massively help you out!

1. Read the documentation

If you’re just starting out, you’re probably following tutorials and copying examples from Stack Overflow or other sister sites. While this is a quick way to get you started, many tutorials and examples either contain errors, or are simply too general or outdated.

As soon as you install a new package or try out a new module, we highly recommend that you take your time to read through its documentation (or at least some detailed examples on the package’s repository). This way you may find a better solution for you problem.

Remember, documentation is your friend!

2. Plan ahead

When starting a new project, we’re always eager to get a basic prototype working as soon as possible. While this might be fine for very simple projects, you can quickly get burned with larger ones.

Instead of quickly writing code, it’s better to take a deep breath and think ahead. What is the goal of your project? Which frameworks and plugins do you plan to use? Are there any other alternatives that might be better suited for this project? Are there any major drawbacks?

Remember to plan ahead. There could be many hidden pitfalls and issues, which you haven’t thought about, that could pop up late in the development cycle.. Try to write down your thoughts. A flowchart could help as well.

3. Optimize your database

If you’re a fairly new web developer, you’ve probably only set-up a database with simple migrations.

While this should be fine for smaller websites and applications, you could run into some performance issues when introducing your project to a wider audience or larger amounts of data. We recommend you to read more about indices, procedures, views and query optimization.

But that doesn’t mean that you need to delve into the depths of SQL! For every programming language or web framework we’ve come across, there are many powerful migration tools, which allow you to configure the database in detail.

4. Check for possible errors

When processing data with various plugins, there are always ways to catch errors … For every functionality you write, make sure to check for possible errors and properly handle them.

Some environments, like Node.js, take this very seriously. Every response or callback always has a possible error as the first parameter. In other cases you can, if supported, simply catch the error event. This way you will never forget to handle an error.

5. Do not trust the user input!

Whenever you expect to get some input from the user, you must never, NEVER, trust the data. There are no excuses not to validate the input.

Omission of validation can lead to various errors, from incorrectly formatted data, which can cause many issues on front-end and back-end, to more serious problems like SQL injections, that could easily identify your database structure, insert rows and even delete your whole database.

The most fool-proof solution is to perform input validation on both front-end and back-end. This way, you can easily let the user know that they insert incorrect data, and then double-check it in the back-end to prevent any malicious behaviour.

Wrap up

If you follow these tips, your development process will be a lot smoother. Maybe you’ll even find a kick-ass new feature, which allows you to do something you thought was impossible 😉