2007--Semantic hashing.pdf

下载文档

130
0
约4.26万字
约 10页
2017-04-13 发布于江苏
举报
版权申诉
保障服务

2007--Semantic hashing.pdf

1、本文档共10页，可阅读全部内容。
2、有哪些信誉好的足球投注网站（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

2007--Semantic hashing

International Journal of Approximate Reasoning 50 (2009) 969–978Contents lists available at ScienceDirect International Journal of Approximate Reasoning journal homepage: www.elsevier .com/locate / i jarSemantic hashing Ruslan Salakhutdinov *, Geoffrey Hinton Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario, Canada M5S 3G4a r t i c l e i n f o Article history: Received 11 January 2008 Received in revised form 15 November 2008 Accepted 19 November 2008 Available online 10 December 2008 Keywords: Information retrieval Graphical models Unsupervised learning0888-613X/$ - see front matter 2008 Elsevier Inc doi:10.1016/j.ijar.2008.11.006 * Corresponding author. E-mail addresses: rsalakhu@ (R. Saa b s t r a c t We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Anal- ysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs ‘‘semantic hashing”: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set. 2008 Elsevier Inc. All rights reserved.1. Introduction One of the most popular and widely-used algorithms for retrieving documents that are similar to a query document is TF- IDF [19,18] which measures the similarity between documents