- 1、本文档共27页,可阅读全部内容。
- 2、有哪些信誉好的足球投注网站(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
- 3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载。
- 4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
查看更多
毕业设计
XML文档检索结果的聚类算法
摘 要
现有的有哪些信誉好的足球投注网站引擎得到的检索结果,虽经过相关度排序,仍包含较多与用户查询请求不相关的文档。为提高检索效率,需对检索结果进行聚类。可扩展标记语言XML是信息表达和数据交换的格式和标准,具有自描述性和可扩展性等特点,近年来广泛应用于数据交换、Web服务、内容管理、Web集成等领域。
本文对Web检索结果聚类和XML文档聚类的国内外研究现状进行了深入分析,并综合考虑了XML技术和文档聚类等,对结果文档(片段)采取了新的建模方法:用标签路径和元素特征来表示XML文档的结构语义、用文本中的关键词来表示文档的内容信息,用标签路径、元素特征和文本内容三个向量来表示XML文档(片段)。同时用传统的Cosine度量来计算相似度并以最小最大化原则初始化簇,对经典k-means算法加以改进。实验表明,聚类质量较好,也有一定的稳定性。
【关键词】XML;文档检索;建模;k-means聚类算法
A Clustering Algorithm for XML Document Retrieval Results
Abstract:
Retrieval results across current search engines, though sorted with respect to degree of correlation, still contain scores of documents which are not related to users’ query requests to some extent. In order to improve the retrieval efficiency, it’s necessary to cluster retrieval results. Extensible Markup Language (XML) is the format and standard of information expression and data exchange and has features such as self-descriptiveness and extensibility. So, it has applied widely in the fields of data exchange, web services, content management, web integration and so on.
This thesis seriously analyzes the current research situation of clustering web retrieval results and XML documents, and synthetically considers XML technology and document clustering, etc, and takes new measures to model retrieval result documents or fragment. We make use of tag path and element characteristics to express structure semantics of XML documents, key words in texts to represent documents’ content information. To be precise, it uses three vectors, containing tag path, element character, and text content to stand for any XML document or fragment. At the same time, traditional Cosine measure is used to calculate similarity degrees and Minimum-Maximum Principle to initialize clusters. With these three methods, author adapts classic k-means algorithm. The final results show that quality of clustering is relatively high and stable.
Keywords:
XML; document retrieval; mode
文档评论(0)