Search engines are comprised of two parts: the index (or database), and the crawler.
The index is a list of all the web pages a search engine allows you to search through. Often times, the index also has a saved version of the page information and text on the page to speed up the search and optimize results. This is called a cache. It represents a low-level version of the page that was found when the crawler found the page.
The crawler is effectively the most important part of the search engine. It is a small web-based program that reads details from a web page, such as the text and meta information, and adds it to the search engine index. Crawlers follow any links they find on the page to move on to another page to add to the index, and the cycle continues until the crawler reaches a site it has already indexed. Crawlers don't run like regular programs where you click a button, the program works, and stops. Instead, they are constantly running through web pages without ever stopping. To keep the cache of pages from becoming too old, search engines will restart crawlers with a blank index so that the cache can be freshened. But, seeing how crawlers take up so much computing resources, pages may sit in the cache for extended amounts of time, even for 2 years or so.
SOURCE:
http://www.helium.com/items/257378-how-web-search-engines-work
The index is a list of all the web pages a search engine allows you to search through. Often times, the index also has a saved version of the page information and text on the page to speed up the search and optimize results. This is called a cache. It represents a low-level version of the page that was found when the crawler found the page.
The crawler is effectively the most important part of the search engine. It is a small web-based program that reads details from a web page, such as the text and meta information, and adds it to the search engine index. Crawlers follow any links they find on the page to move on to another page to add to the index, and the cycle continues until the crawler reaches a site it has already indexed. Crawlers don't run like regular programs where you click a button, the program works, and stops. Instead, they are constantly running through web pages without ever stopping. To keep the cache of pages from becoming too old, search engines will restart crawlers with a blank index so that the cache can be freshened. But, seeing how crawlers take up so much computing resources, pages may sit in the cache for extended amounts of time, even for 2 years or so.
SOURCE:
http://www.helium.com/items/257378-how-web-search-engines-work
No response to “How web search engines work”
Post a Comment