Introduction to Machine Learning Lecture 8 Deep Belief Nets[介绍了机器学习讲座8深层信仰篮网](-52).ppt

下载文档 降价啦

0
0
约1.89万字
约 52页
2020-07-13 发布于湖北
举报
版权申诉
保障服务

Introduction to Machine Learning Lecture 8 Deep Belief Nets[介绍了机器学习讲座8深层信仰篮网](-52).ppt

1、本文档共52页，可阅读全部内容。
2、有哪些信誉好的足球投注网站（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * The variables in h0 are conditionally independent given v0. Inference is trivial. We just multiply v0 by W transpose. The model above h0 implements a complementary prior. Multiplying v0 by W transpose gives the product of the likelihood term and the prior term. Inference in the directed net is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data. Inference in a directed net with replicated weights v1 h1 v0 h0 v2 h2 etc. + + + + The learning rule for a sigmoid belief net is: With replicated weights this becomes: v1 h1 v0 h0 v2 h2 etc. First learn with all the weights tied This is exactly equivalent to learning an RBM Contrastive divergence learning is equivalent to ignoring the small derivatives contributed by the tied weights between deeper layers. Learning a deep directed network v1 h1 v0 h0 v2 h2 etc. v0 h0 Then freeze the first layer of weights in both directions and learn the remaining weights (still tied together). This is equivalent to learning another RBM, using the aggregated posterior distribution of h0 as the data. v1 h1 v0 h0 v2 h2 etc. v1 h0 What happens when the weights in higher layers become different from the weights in the first layer? The higher layers no longer implement a complementary prior. So performing inference using the frozen weights in the first layer is no longer correct. Using this incorrect inference procedure gives a variational lower bound on the log probability of the data. We lose by the slackness of the bound. The higher layers learn a prior that is closer to the aggregated posterior distribution of the first hidden layer. This improves the network’s model of the data. Hinton, Osindero and Teh (2006) prove that this