Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Results

Energy-Based Models

Energy-based models (EBMs) are generative models that learn a density pϕp_\phi parametrized by an energy function fϕf_\phi, so that pϕ(x)=pˉϕ(x)/Zϕp_\phi(x) = \bar p_\phi(x) / Z_\phi, with pˉϕ(x)=exp(fϕ(x))\bar p_\phi(x)=\exp(f_\phi(x)). Typically, EBMs are trained via maximum likelihood estimation (MLE) by estimating the gradient of the maximum likelihood loss, since direct computation of the loss is not feasible due to the intractability of ZϕZ_\phi. However, when a sampler with a tractable density qθq_\theta is given, a tight lower bound on the MLE loss can be computed:

LELBO(ϕ,θ)=Exqθ[logpˉϕ(x)]Expdata[logpˉϕ(x)]+H(qθ).\mathcal{L}_\text{ELBO}(\phi, \theta) = \mathbb{E}_{x \sim q_\theta}\left[ \log \bar p_\phi(x) \right] - \mathbb{E}_{x \sim p_\text{data}}\left[ \log \bar p_\phi(x) \right] + \mathcal{H}(q_\theta).

The EBM is then trained by alternating between maximizing LELBO(ϕ,θ)\mathcal{L}_\text{ELBO}(\phi, \theta) with respect to θ\theta to further tighten the lower bound, and minimizing it with respect to ϕ\phi.

In Figure 1, we show the Fréchet Inception Distance (FID) averaged over 5 random seeds for CIFAR10 image generation. We compare against P-SVGD Messaoud et al., 2024 and Glow (Normalizing Flow) Kingma & Dhariwal, 2018.

FID on CIFAR-10; bold marks changes between configurations.

Figure 1:FID on CIFAR-10; bold marks changes between configurations.

Removing either the trace-of-Hessian term or the step-size bound causes the training to diverge, as the violet and gray curves show. Figure 2 shows that adding the trace-of-Hessian term produces smoother energy landscapes that are much easier to sample from. The step-size bound plays a different but equally important role, as it ensures that the entropy estimation remains valid.

Smoothness of the learnt EBM as measured by ||\nabla \log \bar p_\phi|| throughout training iterations.
The configurations with the best FID () exhibit smoother landscapes.

Figure 2:Smoothness of the learnt EBM as measured by logpˉϕ||\nabla \log \bar p_\phi|| throughout training iterations. The configurations with the best FID (Figure 1) exhibit smoother landscapes.

Replacing the median heuristic bandwidth, σmed\sigma_\text{med}, with a learnable bandwidth greatly improves stability and leads to significantly better FID scores compared to P-SVGD (green versus orange). As shown in Figure 3a, the median heuristic bandwidth is on average more than an order of magnitude larger than the learned value σθ2\sigma_{\theta_2}. This overly large bandwidth causes particles to become spuriously correlated, hurting sample quality.

Learning the step size as well (red curve) allows the method to converge to the target distribution much faster. Figure 3b illustrates that the learned step size ϵθ3\epsilon_{\theta_3} is much larger than the fixed step size ϵ\epsilon.

Kernel bandwidth across training iterations.

(a)Kernel bandwidth across training iterations.

Step-size across training iterations

(b)Step-size across training iterations

Figure 3:Learnt kernel bandwidth and step-size across training iterations. The trends exhibited by the learnt kernel bandwidth and step-size in MET-SVGD configurations are significantly different from σmed\sigma_\text{med} and constant ϵ\epsilon.

Using an adaptive number of update steps LcL_c further stabilizes training, as shown by the brown curve.

In contrast to the other MET-SVGD optimizations, experiments that incorporate MH diverge, as indicated by the pink curve. This instability is not inherent to MET-SVGD itself, but instead comes from the interaction between MH rejection dynamics and a the loss function, LELBO\mathcal{L}_\text{ELBO}, not being lower-bounded. The stability of the training objective depends critically on the quality of the samples used to estimate the expectation with respect to the sampler. When the energy landscape is highly complex, MH acceptance rates can become very low Figure 4, preventing particles from moving toward high-density regions of the target distribution. As a result, sample quality degrades and training eventually diverges.

MH rejection rate across training iterations.

Figure 4:MH rejection rate across training iterations.

Glow-NF exhibits a similar failure mode, struggling to produce good samples early in training and ultimately diverging as well.

Maximum Entropy Reinforcement Learning

MaxEnt RL learns a stochastic policy over actions given a state by maximizing the sum of expected rewards and entropies:

πθ=argmaxπθtE(st,at)[r(st,at)+αH(πθ(st))]\pi^*_\theta = \underset{\pi_\theta}{\mathrm{argmax}} \sum_t \mathbb{E}_{(s_t, a_t)} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi_\theta(\cdot | s_t)) \right]

We model the sampler and estimate the entropy using the MET-SVGD framework and compare against SAC Haarnoja et al., 2018, which models the policy as a diagonal Gaussian, SAC-NF, which models the policy as an auto-regressive normalizing flow Papamakarios et al., 2017, and P-SVGD Messaoud et al., 2024. In Figure 5, we report the average return on 10 rollouts every 1000 steps on Walker2d-v2 averaged over 5 seeds.

Return IQM across training iterations.

(a)Return IQM across training iterations.

Return IQM at the end of training.

(b)Return IQM at the end of training.

Figure 5:Return IQM on Walker2d-v2.

Learning the kernel bandwidth (orange) and adding the trace-of-Hessian correction (green) both lead to substantial improvements over the P-SVGD baseline. In contrast, removing the bound on the step size (pale green) causes the method to diverge, which is expected.

Learning the step size without the trace-of-Hessian term (peach) performs worse than all baselines. In this case, particles tend to drift into non-smooth regions of the landscape Figure 6.

Smoothness as measured by ||\nabla \log \bar p_\phi|| throughout training iterations.

Figure 6:Smoothness as measured by logpˉϕ||\nabla \log \bar p_\phi|| throughout training iterations.

The best results come from incorporating MH with an ϵ\epsilon-greedy schedule (purple): applying MH with high probability later in training and low probability early on helps balance exploration in the beginning with exploitation as learning progresses. These improvements become even more pronounced when we increase the number of particles from 10 to 64 (brown).

References
  1. Messaoud, S., Mokeddem, B., Xue, Z., Pang, L., An, B., Chen, H., & Chawla, S. (2024). S  2 AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic. ICLR.
  2. Kingma, D. P., & Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. NeurIPS.
  3. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML.
  4. Papamakarios, G., Pavlakou, T., & Murray, I. (2017). Masked autoregressive flow for density estimation. NeurIPS.