Skip to content

Commit 4d8d95e

Browse files
committed
update docs to the new recommended use of Bijectors
1 parent f23eb7a commit 4d8d95e

File tree

5 files changed

+179
-213
lines changed

5 files changed

+179
-213
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ end;
8686
A simpler approach would be to use [`Turing`](https://github.com/TuringLang/Turing.jl), where a `Turing.Model` can be automatically be converted into a `LogDensityProblem` and a corresponding `bijector` is automatically generated.
8787

8888
Since most VI algorithms assume that the posterior is unconstrained, we will apply a change-of-variable to our model to make it unconstrained.
89-
This amounts to wrapping it into a `LogDensityProblem` that applies the transformation and apply a Jacobian adjustment.
89+
This amounts to wrapping it into a `LogDensityProblem` that applies the transformation and the corresponding Jacobian adjustment.
9090

9191
```julia
9292
struct TransformedLogDensityProblem{Prob,Trans}
@@ -178,7 +178,7 @@ For the variational family, we will consider a `FullRankGaussian` approximation:
178178
using LinearAlgebra
179179

180180
d = LogDensityProblems.dimension(prob_trans_ad)
181-
q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.37*I, d, d)))
181+
q = FullRankGaussian(zeros(d), LowerTriangular(Matrix{Float64}(0.6*I, d, d)))
182182
q = MeanFieldGaussian(zeros(d), Diagonal(ones(d)));
183183
```
184184

@@ -202,7 +202,7 @@ This, however, is not the original constrained posterior that we wanted to appro
202202
Therefore, we finally need to apply a change-of-variable to `q_opt` to make it approximate our original problem.
203203

204204
```julia
205-
q_trans = Bijectors.TransformedDistribution(q, binv)
205+
q_trans = Bijectors.TransformedDistribution(q_opt, binv)
206206
```
207207

208208
For more examples and details, please refer to the documentation.

docs/src/klminrepgraddescent.md

Lines changed: 25 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
# [`KLMinRepGradDescent`](@id klminrepgraddescent)
22

3-
## Description
4-
5-
This algorithm aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
6-
Specifically, it uses the the *reparameterization gradient estimator*.
3+
This algorithms aim to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
4+
Specifically, it uses the the *reparameterization gradient* estimator.
75
As a result, this algorithm is best applicable when the target log-density is differentiable and the sampling process of the variational family is differentiable.
86
(See the [methodology section](@ref klminrepgraddescent_method) for more details.)
97
This algorithm is also commonly referred to as automatic differentiation variational inference, black-box variational inference with the reparameterization gradient, and stochastic gradient variational inference.
@@ -13,30 +11,15 @@ This algorithm is also commonly referred to as automatic differentiation variati
1311
KLMinRepGradDescent
1412
```
1513

16-
The associated objective value can be estimated through the following:
17-
18-
```@docs; canonical=false
19-
estimate_objective(
20-
::Random.AbstractRNG,
21-
::Union{<:KLMinRepGradDescent,<:KLMinRepGradProxDescent,<:KLMinScoreGradDescent},
22-
::Any,
23-
::Any;
24-
::Int,
25-
::AbstractEntropyEstimator,
26-
)
27-
```
28-
2914
## [Methodology](@id klminrepgraddescent_method)
3015

31-
This algorithm aims to solve the problem
16+
This algorithms aim to solve the problem
3217

3318
```math
3419
\mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right)
3520
```
3621

3722
where $\mathcal{Q}$ is some family of distributions, often called the variational family, by running stochastic gradient descent in the (Euclidean) space of parameters.
38-
That is, for all $$q_{\lambda} \in \mathcal{Q}$$, we assume $$q_{\lambda}$$ there is a corresponding vector of parameters $$\lambda \in \Lambda$$, where the space of parameters is Euclidean such that $$\Lambda \subset \mathbb{R}^p$$.
39-
4023
Since we usually only have access to the unnormalized densities of the target distribution $\pi$, we don't have direct access to the KL divergence.
4124
Instead, the ELBO maximization strategy maximizes a surrogate objective, the *evidence lower bound* (ELBO; [^JGJS1999])
4225

@@ -57,7 +40,7 @@ Algorithmically, `KLMinRepGradDescent` iterates the step
5740
where $\widehat{\nabla \mathrm{ELBO}}(q_{\lambda})$ is the reparameterization gradient estimate[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient and $$\mathrm{operator}$$ is an optional operator (*e.g.* projections, identity mapping).
5841

5942
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
60-
For the variational family $$\mathcal{Q}$$, suppose the process of sampling from $$q_{\lambda} \in \mathcal{Q}$$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution* $$\varphi$$ independent of $$\lambda$$ such that
43+
For the variational family $\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda\}$, suppose the process of sampling from $q_{\lambda}$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution* $$\varphi$$ independent of $$\lambda$$ such that
6144

6245
```math
6346
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
@@ -80,50 +63,6 @@ where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
8063
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
8164
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
8265
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
83-
## [Handling Constraints with `Bijectors`](@id bijectors)
84-
85-
As mentioned in the docstring, `KLMinRepGradDescent` assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
86-
However, in general, it is most convenient to use variational families that have the whole Euclidean space $$\mathbb{R}^d$$ as their support.
87-
This is the case for the [location-scale distributions](@ref locscale) provided by `AdvancedVI`.
88-
For target distributions which the support is not the full $$\mathbb{R}^d$$, we can apply some transformation $$b$$ to $$q_{\lambda}$$ to match its support such that
89-
90-
```math
91-
z \sim q_{b,\lambda} \qquad\Leftrightarrow\qquad
92-
z \stackrel{d}{=} b^{-1}\left(\eta\right);\quad \eta \sim q_{\lambda},
93-
```
94-
95-
where $$b$$ is often called a *bijector*, since it is often chosen among bijective transformations.
96-
This idea is known as automatic differentiation VI[^KTRGB2017] and has subsequently been improved by Tensorflow Probability[^DLTBV2017].
97-
In Julia, [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl)[^FXTYG2020] provides a comprehensive collection of bijections.
98-
99-
[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*, 18(14), 1-45.
100-
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
101-
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
102-
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
103-
That is, the objective is now
104-
```math
105-
\mathrm{ADVI}\left(\lambda\right)
106-
\triangleq
107-
\mathbb{E}_{\eta \sim q_{\lambda}}\left[
108-
\log \pi\left( b^{-1}\left( \eta \right) \right)
109-
+ \log \lvert J_{b^{-1}}\left(\eta\right) \rvert
110-
\right]
111-
+ \mathbb{H}\left(q_{\lambda}\right)
112-
```
113-
114-
This is automatically handled by `AdvancedVI` through `TransformedDistribution` provided by `Bijectors.jl`.
115-
See the following example:
116-
117-
```julia
118-
using Bijectors
119-
q = MeanFieldGaussian(μ, L)
120-
b = Bijectors.bijector(dist)
121-
binv = inverse(b)
122-
q_transformed = Bijectors.TransformedDistribution(q, binv)
123-
```
124-
125-
By passing `q_transformed` to `optimize`, the Jacobian adjustment for the bijector `b` is automatically applied.
126-
(See the [Basic Example](@ref basic) for a fully working example.)
12766

12867
## [Entropy Gradient Estimators](@id entropygrad)
12968

@@ -157,7 +96,6 @@ Depending on the variational family, this might be computationally inefficient o
15796
For example, if ``q_{\lambda}`` is a Gaussian with a full-rank covariance, a back-substitution must be performed at every step, making the per-iteration complexity ``\mathcal{O}(d^3)`` and reducing numerical stability.
15897

15998
```@setup repgradelbo
160-
using Bijectors
16199
using FillArrays
162100
using LinearAlgebra
163101
using LogDensityProblems
@@ -168,51 +106,38 @@ using Optimisers
168106
using ADTypes, ForwardDiff, ReverseDiff
169107
using AdvancedVI
170108
171-
struct NormalLogNormal{MX,SX,MY,SY}
172-
μ_x::MX
173-
σ_x::SX
174-
μ_y::MY
175-
Σ_y::SY
109+
struct Dist{D}
110+
dist::D
176111
end
177112
178-
function LogDensityProblems.logdensity(model::NormalLogNormal, θ)
179-
(; μ_x, σ_x, μ_y, Σ_y) = model
180-
logpdf(LogNormal(μ_x, σ_x), θ[1]) + logpdf(MvNormal(μ_y, Σ_y), θ[2:end])
113+
function LogDensityProblems.logdensity(model::Dist, θ)
114+
return logpdf(model.dist, θ)
181115
end
182116
183-
function LogDensityProblems.logdensity_and_gradient(model::NormalLogNormal, θ)
117+
function LogDensityProblems.logdensity_and_gradient(model::Dist, θ)
184118
return (
185119
LogDensityProblems.logdensity(model, θ),
186120
ForwardDiff.gradient(Base.Fix1(LogDensityProblems.logdensity, model), θ),
187121
)
188122
end
189123
190-
function LogDensityProblems.dimension(model::NormalLogNormal)
191-
length(model.μ_y) + 1
124+
function LogDensityProblems.dimension(model::Dist)
125+
return length(model.dist)
192126
end
193127
194-
function LogDensityProblems.capabilities(::Type{<:NormalLogNormal})
128+
function LogDensityProblems.capabilities(::Type{<:Dist})
195129
LogDensityProblems.LogDensityOrder{1}()
196130
end
197131
198132
n_dims = 10
199-
μ_x = 2.0
200-
σ_x = 0.3
201-
μ_y = Fill(2.0, n_dims)
202-
σ_y = Fill(1.0, n_dims)
203-
model = NormalLogNormal(μ_x, σ_x, μ_y, Diagonal(σ_y.^2));
133+
μ = Fill(2.0, n_dims)
134+
σ = Fill(1.0, n_dims)
135+
model = Dist(MvNormal(μ, Diagonal(σ.^2)));
204136
205137
d = LogDensityProblems.dimension(model);
206-
μ = zeros(d);
207-
L = Diagonal(ones(d));
208-
q0 = AdvancedVI.MeanFieldGaussian(μ, L)
209-
210-
function Bijectors.bijector(model::NormalLogNormal)
211-
(; μ_x, σ_x, μ_y, Σ_y) = model
212-
Bijectors.Stacked(
213-
Bijectors.bijector.([LogNormal(μ_x, σ_x), MvNormal(μ_y, Σ_y)]),
214-
[1:1, 2:1+length(μ_y)])
215-
end
138+
μ0 = zeros(d);
139+
L0 = Diagonal(ones(d));
140+
q0 = AdvancedVI.MeanFieldGaussian(μ0, L0)
216141
```
217142

218143
In this example, the true posterior is contained within the variational family.
@@ -223,10 +148,6 @@ Recall that the original ADVI objective with a closed-form entropy (CFE) is give
223148

224149
```@example repgradelbo
225150
n_montecarlo = 16;
226-
b = Bijectors.bijector(model);
227-
binv = inverse(b)
228-
229-
q0_trans = Bijectors.TransformedDistribution(q0, binv)
230151
231152
cfe = KLMinRepGradDescent(
232153
AutoReverseDiff();
@@ -250,20 +171,19 @@ nothing
250171
```
251172

252173
```@setup repgradelbo
253-
max_iter = 3*10^3
174+
max_iter = 10^3
254175
255176
function callback(; params, restructure, kwargs...)
256-
q = restructure(params).dist
257-
dist2 = sum(abs2, q.location - vcat([μ_x], μ_y))
258-
+ sum(abs2, diag(q.scale) - vcat(σ_x, σ_y))
177+
q = restructure(params)
178+
dist2 = sum(abs2, q.location - μ) + sum(abs2, diag(q.scale) - σ)
259179
(dist = sqrt(dist2),)
260180
end
261181
262182
_, info_cfe, _ = AdvancedVI.optimize(
263183
cfe,
264184
max_iter,
265185
model,
266-
q0_trans;
186+
q0;
267187
show_progress = false,
268188
callback = callback,
269189
);
@@ -272,7 +192,7 @@ _, info_stl, _ = AdvancedVI.optimize(
272192
stl,
273193
max_iter,
274194
model,
275-
q0_trans;
195+
q0;
276196
show_progress = false,
277197
callback = callback,
278198
);
@@ -281,7 +201,7 @@ _, info_stl, _ = AdvancedVI.optimize(
281201
stl,
282202
max_iter,
283203
model,
284-
q0_trans;
204+
q0;
285205
show_progress = false,
286206
callback = callback,
287207
);
@@ -364,7 +284,7 @@ _, info_qmc, _ = AdvancedVI.optimize(
364284
KLMinRepGradDescent(AutoReverseDiff(); n_samples=n_montecarlo, optimizer=Adam(1e-2), operator=ClipScale()),
365285
max_iter,
366286
model,
367-
q0_trans;
287+
q0;
368288
show_progress = false,
369289
callback = callback,
370290
);

0 commit comments

Comments
 (0)