Search engine optimization (How search engines work)

How do search engines work?

Are you someone who runs an online business, blog, website, or anything and want to grow it? Knowing about search engines is a vital part to start learning about SEO. Know How search engines work, and take nessesary steps to grow your business.

By the end of this part, How search engines work, you will be able to:

Describe major search engines
Explain how search engines work
Define crawling, storing, processing and indexing, and ranking
Explain sitemap and robot.txt

Major search engines and how search engine work.

We all know what search engines are but which are the major search engines?

It doesn’t need to be said that the number 1 search engine in the world is Google.

Based in mountain view California, it has the largest reach, the most searches, and the most amount of languages.

After Google’s market share gets a bit tricky, technically YouTube is a search engine and it is owned by Google. I

t gets more searches and activity than the rest of the worldwide search engines combined.

Other search engines are Bing which is owned by Microsoft and yahoo.

Google burst on the scene after Yahoo and because of its PageRank algorithm, it’s results were better and faster than any other search engine on the market.

This attributed to Google’s quick rise to popularity PageRank attributed quality to a page by analyzing the complex structure of links to that document or that page in order to determine influence.

Other search engines have to implemented similar ranking technology but they each have their own flavor, each mixing links, on-page relevance, and other factors to determine their ranking.

Minor search engines: Search

There are many search engines worldwide.

Some of the minor search engines you may want to be aware of are:

Ask, which has a long history, longer than Google’s AOl search

DuckDuckGo which right now is one of the fastest-growing search engines.

It does not keep your search history and use it for advertising, countering many of the privacy concerns people have about Google.

Wolfram Alpha is a specialized search engine for mathematical and scientific computations.

If you have never used it, I recommend checking it out sometime, it is a great resource.

Other minor search engines are WebCrawler, Search, and Excite.

of course, they do not have near the market presence of the top three.

Now internationally there are many search engines that have been developed the serve a specific country, language, or region.

International Search Engines.

Three of the most important to know are Yandex which is about 50% of the Russian market.

Baidu which is more than 60% of the Chinese market

Naiver which dominates south Korea at 74% of the market share.

now, Specifically about Yandex

International search engine: Yandex

More than just search engine, Yandex also provides a browser, email, news, maps, paid ads and translation services.

Yandex not only serves the the Russian market but also Belarus, Kazakhstan, the Ukraine, and Turkey.

It’s estimated there are about 55.2 million users in Russia.

Comparing Yandex and Google.

Algorithm	MatrixNet	PageRank
Geo Targeting	Favors Metro Areas	Favors National
SEO Emphasis	On-page	Inbound links
User Behavior	Major Factors	Minor factor
Inbound Links	Increasing Factors	Major Factor
Spidering	Submit Sitemaps	Aggressive Spidering
Meta-Keyword Influence	Yes-Minor	No

One of the things that has enabled Yandex to complete head-to-head with Google and increase market share, it’s the political backing of the government.

In may 2017, the Russian anti-monopoly service ruled that Google’s default Android operating system was too restrictive and not in the best interest of the consumer.

They forced Google to develop a widget and a new Android operating system that made it easy for users to change their default search engine.

From that action, Yandex went from a 29% to a 44% market share.

International search engine: Baidu

Baidu is the number one search engine in china and it is the fifth most popular site in the world.

Baidu has over 665 million monthly active users, and over 148 million daily app users.

It has over 76% of china’s search market and 44 and a half percent share of china’s mobile search market.

There are few noticeable differences between Google and Baidu.

First, your choice of images on the page are important as Baidu presents results with thumbnails.

Now, similar to Google, Baidu also offers rich snippets but with additional website content and information.

Beyond this Baidu also enables brands to develop widgets that can offer interactive searches, additional content, and exploration of that brand’s website, all within the search results.

Now, because of the massive amount of websites and users in china.

Baidu improves the relevancy and experience for its searchers by offering a domain credibility system.

Website owners have to apply, pay an application fee, and be verified in order to have this certification.

For best results in developing a website to rank in Baidu, the website should be written in Mandarin and hosted in china.

No other language is supported by Baidu.

The algorithm of Baidu is very comparable to Google.

Typically when Google implements a major algorithm change or update, Baidu follows with similar update one or two years later.

Baidu is heavily biased to Chinese websites and the mandarin language.

Foreign sites not hosted in china will not rank consistently or reliably.

Because of this back links from other Chinese websites will count more in the algorithm than the backlinks from foreign websites.

Baidu also credits websites that have been around longer.

There is more weight given to a domain’s age and ownership, making it a little difficult for newer sites to compete.

Now more that Google, Baidu was handling mobile demand much earlier.

The Chinese mobile market hit much faster and earlier than western markets.

Even now, Baidu averages 70 million daily active mobile search users and 77% of their revenue comes from mobile.

Like other search engines, they offer a Webmaster tool to see stats on usage and ranking data, all with a heavy focus on mobile.

International search engine : Naver

Naver was built around the Korean language and culture meaning that US-based sites should develop a version of their websites catering to this culture rather than just translating their text.

Naver also draws heavily from user-generated content in their knowledge base as well as from the social community, which is reflected in the search results.

The distinctive difference in Naver is the search results page.

Because the search results are limited to south Korea and the Korean language, it limits the amount of available websites to pull from.

So, Naver puts more into the display of the content and the different types of content.

The results page is broken into three primary areas of results.

First, the Q&A knowledge base which comes from user-generated content.

Second, a database search from trusted research, sources, and specific databases.

Finally, the web search which are the organic search results.

Each of these areas have a very rich snippet content and also presents with a thumbnail from that page.

Based on this, Naver is highly content focused emphasizing quality content and eliminating low-quality content.

Naver uses user behavior and interaction from the search results to measure the quality of the page.

This quality is also found in the treatment of the Korean language.

Any sites using poor translation or non-native Korean will not perform well.

International search engine: Qwant

Finally, a newer international search engines jumped into prominence recently in order to challange Google. France and Google have had a very antagonistic relationship over the years.

In 2018, the French military and many government officers officially moved away from Google.

The default search engine in the French military and many areas of the government is now Qwant, a French and German owned search engine that does not track or record your search history.

Qwant, being a newer search engine, is focused on newer technology to develop its algorithm. Qwant favours websites built with HTML5 and standards-based programming.

There is also a large weight on social signals, much more than Google.

Also, while Google is focusing on broadening meaning for keywords in context, Qwant has gone back to looking at exact expressions in keyword and key phrases rather than deriving from context or synonyms.

Mechanism: How Search engines work

Now let’s look at how search engines work.

The first thing to understand about search engines is that when you perform a search, the results that you see are not live results.

They are results from the search engines’ database of websites.

That’s right! The search engines make copies of all the websites they find and the algorithm is run against the database, not the live internet.

What this means is that is an optimizer.

You need to ensure that the search engine is first able to find your website, than download it, assess it properly, and then appear accurately in the search results.

In order for search engine and SEO’s to work together, SEOs need to have an understanding of how this process of search engine indexing works.

So, what we are going to do here is, we are going to simplify into four categories:

Crawling
Storing
Processing and indexing
Ranking

When you understand these, you have a more complete understanding of the search engine process and how search engines work.

crawling: Search engine (SEO for google)

What is crawler?

At the root of every search engine is software.

In this phase, we are just going to talk about one type of software.

It’s something we call crawlers or bots or robots or spiders.

Now, this is a software that does one thing but it does that thing very very well.

The crawler or bot goes to a specific website and downloads all of its content back on to the servers of the search engines.

It then follows every single link on that document.

It goes to that corresponding page and download all of that content and then repeats the process by following all of those links and going to those pages and downloding that content.

In theory, by doing this, it crawls the entire internet.

Now, sit back and think about how big the internet is and how quickly it’s growing but that’s just a basic idea of what a crawler or spider does.

It actually has a whole lot of traps that it has to avoid.

One of these traps is called an infinite loop.

What will happen is, a crawler goes on to something like a calender application online and it goes to the next year.

Then it will click on the link for the next year and then it will click on the link for the next year and on and on and on and it gets stuck because those links are going to go for ever.

It is going count up into infinity.

So, the crawler itself needs to know this and be smart enough not to get stuck in the infinite loops that are everywhere though out the internet.

So, what does this have to do with an SEO, exactly?

Well, SEOs do have some influence on this.

It’s important for us to understand to make sure that the crawler do not go to protected files, like something that should be password protected, for example.

If you have user information stored in your website, then you want their protected and not open to the search engines.

So, luckily for us, most of the time when search engines come to our website, they will identify themselves.

They will use a user agent like Google bot or Bing bot, so that we know that this is not normal user, it’s a search engine.

Now unfortunately, this has been abused and there are many bots online.

Most of them don’t belong to search engines.

They belong to organization that are looking for exploits and the management of your website.

They are testing password combinations and more in order to gain access to your website.

You can see the engines activity through the webmaster tools that are offered by most search engines.

In these tools, you can see how often Google Bot, Google’s crawler requests page on your website.

How much content is downloaded, and the speed of the downloaded, and any problem that may arise from the spidering such as pages not being available, slow download speeds or websites errors.

Robots.txt: Search engine crawling (SEO for Google)

A robot.txt file provides search engines with the necessary information to properly crawl and index a website.

It is a protocol for instructing search engines to exclude certain content while indexing.

Now, this is important.

The robot.txt file is not intended for security, secure files or anything that you do not want to publish online must be behind password protection.

The robot.txt protocol is typically used to block access to development sites or non search content like scripting files or redundant or duplicate information.

Remember, this is only a protocol.

Not all bots will follow the protocol.

So, do not use this for any type of security at all.

Now, the robot.txt file can be tricky and I have seen companies all over the world forget about this little file and when they forget about it, they end up blocking access to the search engine.

Which will cause their site to disappear from the results and they won’t remember why; all because someone made a mistake with this little file.

Now, if you want to spider your entire website, the format is this.

User agent: *
Disallow

and you will see that nothing follows the disallow command.

That means that any user agent is able to access anything on the website.

Now if you want to allow access expect for a directory that has duplicate files such as printer-friendly documents that are the same as the web page.

You don’t want two versions of the same page in the search engines.

So you will add that directory into my disallow command like this:

User-agent: *
Disallow: /printerfriendly/

This now make the directory of printer-friendly documents disallow.

So that won’t be crawled or included in the search results.

Now if you have a site in development and you don’t want it published yet, you can disallow the entire website like this:

User-agent: *
Disallow: /

Now, when you add that forward slash, it disallows the entire website from being indexed at the root level.

Nothing is allowed to be spidered or accessed by search engines.

This is where a lot of companies forget about this file as they forget to remove this or change it once they go live.

So, once they go live, if they forget about it, the new website doesn’t get indexed.

Therefore, if you use this as a method of disallowing your development site from being indexed, make sure you add it to your list of go-live steps.

Because so much could go wrong with an improperly formatted robot.txt file.

Google webmaster tools has provided you with a robot.txt test protocol.

Simply, use the test to ensure that your formatting is correct and that you are allowing access to the parts of the site that you desire.

Sitemaps

What is a sitemap?

Now, a sitemap. It is a representational model of a website’s content structure.

There are two types of sitemaps:

One is for primarily users and the other is for search engines.

Now, for users it lists the major hierarchy of the website and important pages and is designed for easy visual navigation.

Now the second sitemap for search engines is an XML sitemap.

XML is the language used and it is mainly for search engines, not so much for people.

It lists all of the pages of the website in a single document providing a map for search engine spiders to find every page.

Now, you can generate the XML sitemap through WordPress with many different plugings.

It’s essentially a long list for every page formatted for search engines and it includes information about the images, files and the date that the page was last updated.

While it makes the job easier for search engines, it’s not always critical.

Search engines, especially Google, prefer to find your website through natural crawling and indexing.

The natural indexing enables the context of the website to come clear with linking, citation and surrounding pages.

These sitemaps are most helpful for websites with thousands of pages and content that needs to be managed closely.

Storing

The next phase is what we call storing.

So, after the crawlers have done their work, the data they have downloaded needs to be stored on the search engines servers.

Now, these servers are located all over the world.

The general idea here is that it’s stored on the server and they have makde a copy of your website.

You can access this copy.

It’s called the cache.

This is an important thing for SEOs to understand that search engines are not working on your live site, they are only working on the copy that they have on their own servers.

If you have problem with search engines not downloading your content properly, viewing the cache will help you see what the search engine sees and this is going to have small diffrences like personalization.

They are going to crawl from different set locations.

So, they are not going to have it personalized or regionalized to every single location.

Processing and Indexing

The third phase in this is by far the most complex, and one of the points to understanding how search engine work.

It’s the processing and indexing phase.

This happens in many different phases itself and it happens in the many different data centers around the world.

Quite frankly, this is where all the magic happens.

This is where the billion-dollar algorithm go in and try to extract relevancy signals that can be used to figure out the most relevant and trustworthy information on the internet.

The important thing to understand at this point is that this is a process that is mostly hidden to marketers and the general publish.

It’s hidden in the vault of the search engines and their computer scientists.

Now, in order for the search engine to determine relevance and trustworthiness, the pages stored in index are subject to numerous analysis. Hundreds of factors are analyzed.

The major factors that we know that are analyzed for each page and websites are:

Outbound link: Links to other documents
Inbound links: Links from other documents
Content on the page and the website
Structure of the content
The structure of the programming or website
The date that page was last updated
other trust factors

We see the final results of the processing as the algorithm is applied within hundreds of a second to produce relevant results, not only for the search but also attempting to figure out what is meant by the search.

Also, based on what you are searching for, the results will change.

Local searches will present local results, business listing, and maps.

Commercial searches will show products from different retailers.

Entertainment searches will provide local theatre times, events and ticket information.

Searching for how to do something will show videos.

In this example, a search for camera reviews provides multiple reviews, additional questions, product for sale and sources for information.

The processing of the search results, it’s not just showing relevant results but also other content that you should find helpful.

Ranking

The last phase is the one you are most familiar with and it is the ranking.

So, at this point, a user has typed something into a search engine.

They are getting the results back and the search engine needs to rank all the possible results they could have on the internet to something that is most relevant for the given phase that was typed in.

Now, this is a very complex thing that happens incredibly fast.

So, when this happens, the query gets sent to the search engine servers, the information is already indexed, and the ranking algorithms have taken effect.

With additional signals, this happens incredibly fast.

But things like time, like current events, like personalization, your past history of searching, all of these are taken into account in milliseconds andthe results is shown so that you can find the closest crispy creame donuts.

So, in order to be a successful SEO, it takes three parts.

First, developing a website that search engines can find, download and process.

This is the architecture of your website.

Second, developing links that reflects the popularity of your site with other websites and sources.

Finally, developing content that people want to read and share.

This provides you relevance to the searches people make.

We have seen how the ranking mechanism works but it is essential to understand the factors that a search engine generally looks for when writing a website.

As an example, Google’s PageRank is a method of measuring the importance of pages.

This is gained primarily by measuring the influence of links from other websites.

Now this is not simply based on the amount of links but the influence of links such as the popularity and influence of the website linking to your site and then the sites linking to that site and so on.

Other factors are words.

How well do those words match the search query by meaning or synonym, content, how good does the content match other similar queries, also the context such as the location of the searchers and the search history

The types of results such as images, videos, recipes, maps, or local business and finally, user behavior, also called searchers signals.

This measures the searchers clicks from the results page to the websites, the time they spend on that page, and if they come back to the results and click on others.

Key takeaways

Google, Yahoo, and Bing are the three major search engines. Apart from these, Yandex, Naver, and Baidu are some other international search engines.
Search engines use spiders or bots, to find and download pages and documents into their index of websites. The results that you see are not live results.
The search engine process includes crawling, storing, processing and indexing, and ranking.
A sitemap is a representational, hierarchical model of a website’s content structure.