HowSearchEnginesWorkGeneralSearchStrategies.ppt

下载文档

1
0
约6.82千字
约 27页
2017-03-13 发布于湖北
举报
版权申诉
保障服务

HowSearchEnginesWorkGeneralSearchStrategies.ppt

1、本文档共27页，可阅读全部内容。
2、有哪些信誉好的足球投注网站（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

HowSearchEnginesWorkGeneralSearchStrategies

How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007 Fun Quiz Take the search engine quiz located at /library/quizzes/search_engine_quiz/blsearchenginequiz.htm Record the no. of incorrect answers Share the results of the quiz with a classmate. How Search Engines Work? They collect information from selected web sites The employ special software robots, called spiders, to crawl web pages Spiders build lists of the words found in Web sites. When a spider is building its lists, the spider is Web crawling. Spiders store the lists in the engine’s database The engine’s indexing software builds an index of words Information is matched against query input and retrieved (processing algorithm) How Spiders and Crawlers Work? They begin with popular and heavily used web servers. They begin with a popular site, collect the words on its pages and follow every link found within the site. Spiders travel across pages and the most widely used portions of the Web How Spiders and Crawlers Work? A dedicated server of URLs is built by a search engine company (e.g., Google) so that spiders collect information quickly More than one spider is used to craw web pages at a time Google uses 3-4 spiders and collect over 100 pages per second How Spiders and Crawlers Work? When no dedicated URL server is used, search engine company relies on ISP for the domain names (translated into addresses) to use for crawling the web Delay in gathering information Delay in updating information Lack of control over URL addresses Google Spider and How it Works A spider looks at the html or xml or other coding used to build a web page and collects information from the meta-tags It indexes words within the actual text of a page It indicates where the words were found (URL, title, headings, etc.) It disregards initial articles It disregards pages that should not be crawled or indexed Google Spider and How it Works It uses Robot-Exclusion Protocol in disregarding pages