You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,7 +86,7 @@ end;
86
86
A simpler approach would be to use [`Turing`](https://github.com/TuringLang/Turing.jl), where a `Turing.Model` can be automatically be converted into a `LogDensityProblem` and a corresponding `bijector` is automatically generated.
87
87
88
88
Since most VI algorithms assume that the posterior is unconstrained, we will apply a change-of-variable to our model to make it unconstrained.
89
-
This amounts to wrapping it into a `LogDensityProblem` that applies the transformation and apply a Jacobian adjustment.
89
+
This amounts to wrapping it into a `LogDensityProblem` that applies the transformation and the corresponding Jacobian adjustment.
90
90
91
91
```julia
92
92
struct TransformedLogDensityProblem{Prob,Trans}
@@ -178,7 +178,7 @@ For the variational family, we will consider a `FullRankGaussian` approximation:
This algorithm aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
6
-
Specifically, it uses the the *reparameterization gradient estimator*.
3
+
This algorithms aim to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
4
+
Specifically, it uses the the *reparameterization gradient* estimator.
7
5
As a result, this algorithm is best applicable when the target log-density is differentiable and the sampling process of the variational family is differentiable.
8
6
(See the [methodology section](@ref klminrepgraddescent_method) for more details.)
9
7
This algorithm is also commonly referred to as automatic differentiation variational inference, black-box variational inference with the reparameterization gradient, and stochastic gradient variational inference.
@@ -13,30 +11,15 @@ This algorithm is also commonly referred to as automatic differentiation variati
13
11
KLMinRepGradDescent
14
12
```
15
13
16
-
The associated objective value can be estimated through the following:
where $\mathcal{Q}$ is some family of distributions, often called the variational family, by running stochastic gradient descent in the (Euclidean) space of parameters.
38
-
That is, for all $$q_{\lambda} \in \mathcal{Q}$$, we assume $$q_{\lambda}$$ there is a corresponding vector of parameters $$\lambda \in \Lambda$$, where the space of parameters is Euclidean such that $$\Lambda \subset \mathbb{R}^p$$.
39
-
40
23
Since we usually only have access to the unnormalized densities of the target distribution $\pi$, we don't have direct access to the KL divergence.
41
24
Instead, the ELBO maximization strategy maximizes a surrogate objective, the *evidence lower bound* (ELBO; [^JGJS1999])
42
25
@@ -57,7 +40,7 @@ Algorithmically, `KLMinRepGradDescent` iterates the step
57
40
where $\widehat{\nabla \mathrm{ELBO}}(q_{\lambda})$ is the reparameterization gradient estimate[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient and $$\mathrm{operator}$$ is an optional operator (*e.g.* projections, identity mapping).
58
41
59
42
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
60
-
For the variational family $$\mathcal{Q}$$, suppose the process of sampling from $$q_{\lambda} \in \mathcal{Q}$$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution*$$\varphi$$ independent of $$\lambda$$ such that
43
+
For the variational family $\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda\}$, suppose the process of sampling from $q_{\lambda}$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution*$$\varphi$$ independent of $$\lambda$$ such that
61
44
62
45
```math
63
46
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
@@ -80,50 +63,6 @@ where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
80
63
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
81
64
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
82
65
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
83
-
## [Handling Constraints with `Bijectors`](@id bijectors)
84
-
85
-
As mentioned in the docstring, `KLMinRepGradDescent` assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
86
-
However, in general, it is most convenient to use variational families that have the whole Euclidean space $$\mathbb{R}^d$$ as their support.
87
-
This is the case for the [location-scale distributions](@ref locscale) provided by `AdvancedVI`.
88
-
For target distributions which the support is not the full $$\mathbb{R}^d$$, we can apply some transformation $$b$$ to $$q_{\lambda}$$ to match its support such that
89
-
90
-
```math
91
-
z \sim q_{b,\lambda} \qquad\Leftrightarrow\qquad
92
-
z \stackrel{d}{=} b^{-1}\left(\eta\right);\quad \eta \sim q_{\lambda},
93
-
```
94
-
95
-
where $$b$$ is often called a *bijector*, since it is often chosen among bijective transformations.
96
-
This idea is known as automatic differentiation VI[^KTRGB2017] and has subsequently been improved by Tensorflow Probability[^DLTBV2017].
97
-
In Julia, [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl)[^FXTYG2020] provides a comprehensive collection of bijections.
98
-
99
-
[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*, 18(14), 1-45.
100
-
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
101
-
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
102
-
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
103
-
That is, the objective is now
104
-
```math
105
-
\mathrm{ADVI}\left(\lambda\right)
106
-
\triangleq
107
-
\mathbb{E}_{\eta \sim q_{\lambda}}\left[
108
-
\log \pi\left( b^{-1}\left( \eta \right) \right)
109
-
+ \log \lvert J_{b^{-1}}\left(\eta\right) \rvert
110
-
\right]
111
-
+ \mathbb{H}\left(q_{\lambda}\right)
112
-
```
113
-
114
-
This is automatically handled by `AdvancedVI` through `TransformedDistribution` provided by `Bijectors.jl`.
By passing `q_transformed` to `optimize`, the Jacobian adjustment for the bijector `b` is automatically applied.
126
-
(See the [Basic Example](@ref basic) for a fully working example.)
127
66
128
67
## [Entropy Gradient Estimators](@id entropygrad)
129
68
@@ -157,7 +96,6 @@ Depending on the variational family, this might be computationally inefficient o
157
96
For example, if ``q_{\lambda}`` is a Gaussian with a full-rank covariance, a back-substitution must be performed at every step, making the per-iteration complexity ``\mathcal{O}(d^3)`` and reducing numerical stability.
158
97
159
98
```@setup repgradelbo
160
-
using Bijectors
161
99
using FillArrays
162
100
using LinearAlgebra
163
101
using LogDensityProblems
@@ -168,51 +106,38 @@ using Optimisers
168
106
using ADTypes, ForwardDiff, ReverseDiff
169
107
using AdvancedVI
170
108
171
-
struct NormalLogNormal{MX,SX,MY,SY}
172
-
μ_x::MX
173
-
σ_x::SX
174
-
μ_y::MY
175
-
Σ_y::SY
109
+
struct Dist{D}
110
+
dist::D
176
111
end
177
112
178
-
function LogDensityProblems.logdensity(model::NormalLogNormal, θ)
0 commit comments