Monday, May 29, 2023

Getting began with TensorFlow Chance from R


With the abundance of nice libraries, in R, for statistical computing, why would you be curious about TensorFlow Chance (TFP, for brief)? Nicely – let’s have a look at a listing of its parts:

  • Distributions and bijectors (bijectors are reversible, composable maps)
  • Probabilistic modeling (Edward2 and probabilistic community layers)
  • Probabilistic inference (by way of MCMC or variational inference)

Now think about all these working seamlessly with the TensorFlow framework – core, Keras, contributed modules – and in addition, operating distributed and on GPU. The sphere of attainable purposes is huge – and much too numerous to cowl as an entire in an introductory weblog submit.

As an alternative, our purpose right here is to offer a primary introduction to TFP, specializing in direct applicability to and interoperability with deep studying.
We’ll rapidly present learn how to get began with one of many primary constructing blocks: distributions. Then, we’ll construct a variational autoencoder much like that in Illustration studying with MMD-VAE. This time although, we’ll make use of TFP to pattern from the prior and approximate posterior distributions.

We’ll regard this submit as a “proof on idea” for utilizing TFP with Keras – from R – and plan to comply with up with extra elaborate examples from the realm of semi-supervised illustration studying.

To put in TFP along with TensorFlow, merely append tensorflow-probability to the default checklist of additional packages:

library(tensorflow)
install_tensorflow(
  extra_packages = c("keras", "tensorflow-hub", "tensorflow-probability"),
  model = "1.12"
)

Now to make use of TFP, all we have to do is import it and create some helpful handles.

And right here we go, sampling from a typical regular distribution.

n <- tfd$Regular(loc = 0, scale = 1)
n$pattern(6L)
tf.Tensor(
"Normal_1/pattern/Reshape:0", form=(6,), dtype=float32
)

Now that’s good, nevertheless it’s 2019, we don’t wish to must create a session to judge these tensors anymore. Within the variational autoencoder instance beneath, we’re going to see how TFP and TF keen execution are the proper match, so why not begin utilizing it now.

To make use of keen execution, now we have to execute the next strains in a contemporary (R) session:

… and import TFP, identical as above.

tfp <- import("tensorflow_probability")
tfd <- tfp$distributions

Now let’s rapidly have a look at TFP distributions.

Utilizing distributions

Right here’s that commonplace regular once more.

n <- tfd$Regular(loc = 0, scale = 1)

Issues generally achieved with a distribution embody sampling:

# simply as in low-level tensorflow, we have to append L to point integer arguments
n$pattern(6L) 
tf.Tensor(
[-0.34403768 -0.14122334 -1.3832929   1.618252    1.364448   -1.1299014 ],
form=(6,),
dtype=float32
)

In addition to getting the log likelihood. Right here we do this concurrently for 3 values.

tf.Tensor(
[-1.4189385 -0.9189385 -1.4189385], form=(3,), dtype=float32
)

We are able to do the identical issues with numerous different distributions, e.g., the Bernoulli:

b <- tfd$Bernoulli(0.9)
b$pattern(10L)
tf.Tensor(
[1 1 1 0 1 1 0 1 0 1], form=(10,), dtype=int32
)
tf.Tensor(
[-1.2411538 -0.3411539 -1.2411538 -1.2411538], form=(4,), dtype=float32
)

Word that within the final chunk, we’re asking for the log possibilities of 4 impartial attracts.

Batch shapes and occasion shapes

In TFP, we will do the next.

ns <- tfd$Regular(
  loc = c(1, 10, -200),
  scale = c(0.1, 0.1, 1)
)
ns
tfp.distributions.Regular(
"Regular/", batch_shape=(3,), event_shape=(), dtype=float32
)

Opposite to what it would appear to be, this isn’t a multivariate regular. As indicated by batch_shape=(3,), this can be a “batch” of impartial univariate distributions. The truth that these are univariate is seen in event_shape=(): Every of them lives in one-dimensional occasion area.

If as an alternative we create a single, two-dimensional multivariate regular:

n <- tfd$MultivariateNormalDiag(loc = c(0, 10), scale_diag = c(1, 4))
n
tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(), event_shape=(2,), dtype=float32
)

we see batch_shape=(), event_shape=(2,), as anticipated.

In fact, we will mix each, creating batches of multivariate distributions:

nd_batch <- tfd$MultivariateNormalFullCovariance(
  loc = checklist(c(0., 0.), c(1., 1.), c(2., 2.)),
  covariance_matrix = checklist(
    matrix(c(1, .1, .1, 1), ncol = 2),
    matrix(c(1, .3, .3, 1), ncol = 2),
    matrix(c(1, .5, .5, 1), ncol = 2))
)

This instance defines a batch of three two-dimensional multivariate regular distributions.

Changing between batch shapes and occasion shapes

Unusual as it might sound, conditions come up the place we wish to rework distribution shapes between these sorts – in actual fact, we’ll see such a case very quickly.

tfd$Unbiased is used to transform dimensions in batch_shape to dimensions in event_shape.

Here’s a batch of three impartial Bernoulli distributions.

bs <- tfd$Bernoulli(probs=c(.3,.5,.7))
bs
tfp.distributions.Bernoulli(
"Bernoulli/", batch_shape=(3,), event_shape=(), dtype=int32
)

We are able to convert this to a digital “three-dimensional” Bernoulli like this:

b <- tfd$Unbiased(bs, reinterpreted_batch_ndims = 1L)
b
tfp.distributions.Unbiased(
"IndependentBernoulli/", batch_shape=(), event_shape=(3,), dtype=int32
)

Right here reinterpreted_batch_ndims tells TFP how most of the batch dimensions are getting used for the occasion area, beginning to rely from the proper of the form checklist.

With this primary understanding of TFP distributions, we’re able to see them utilized in a VAE.

We’ll take the (not so) deep convolutional structure from Illustration studying with MMD-VAE and use distributions for sampling and computing possibilities. Optionally, our new VAE will be capable to study the prior distribution.

Concretely, the next exposition will encompass three components.
First, we current frequent code relevant to each a VAE with a static prior, and one which learns the parameters of the prior distribution.
Then, now we have the coaching loop for the primary (static-prior) VAE. Lastly, we talk about the coaching loop and extra mannequin concerned within the second (prior-learning) VAE.

Presenting each variations one after the opposite results in code duplications, however avoids scattering complicated if-else branches all through the code.

The second VAE is accessible as a part of the Keras examples so that you don’t have to repeat out code snippets. The code additionally accommodates extra performance not mentioned and replicated right here, comparable to for saving mannequin weights.

So, let’s begin with the frequent half.

On the danger of repeating ourselves, right here once more are the preparatory steps (together with a couple of extra library hundreds).

Dataset

For a change from MNIST and Vogue-MNIST, we’ll use the model new Kuzushiji-MNIST(Clanuwat et al. 2018).

From: Deep Learning for Classical Japanese Literature (Clanuwat et al. 2018)
np <- import("numpy")

kuzushiji <- np$load("kmnist-train-imgs.npz")
kuzushiji <- kuzushiji$get("arr_0")
 
train_images <- kuzushiji %>%
  k_expand_dims() %>%
  k_cast(dtype = "float32")

train_images <- train_images %>% `/`(255)

As in that different submit, we stream the info by way of tfdatasets:

buffer_size <- 60000
batch_size <- 256
batches_per_epoch <- buffer_size / batch_size

train_dataset <- tensor_slices_dataset(train_images) %>%
  dataset_shuffle(buffer_size) %>%
  dataset_batch(batch_size)

Now let’s see what adjustments within the encoder and decoder fashions.

Encoder

The encoder differs from what we had with out TFP in that it doesn’t return the approximate posterior means and variances immediately as tensors. As an alternative, it returns a batch of multivariate regular distributions:

# you may wish to change this relying on the dataset
latent_dim <- 2

encoder_model <- perform(title = NULL) {

  keras_model_custom(title = title, perform(self) {
  
    self$conv1 <-
      layer_conv_2d(
        filters = 32,
        kernel_size = 3,
        strides = 2,
        activation = "relu"
      )
    self$conv2 <-
      layer_conv_2d(
        filters = 64,
        kernel_size = 3,
        strides = 2,
        activation = "relu"
      )
    self$flatten <- layer_flatten()
    self$dense <- layer_dense(models = 2 * latent_dim)
    
    perform (x, masks = NULL) {
      x <- x %>%
        self$conv1() %>%
        self$conv2() %>%
        self$flatten() %>%
        self$dense()
        
      tfd$MultivariateNormalDiag(
        loc = x[, 1:latent_dim],
        scale_diag = tf$nn$softplus(x[, (latent_dim + 1):(2 * latent_dim)] + 1e-5)
      )
    }
  })
}

Let’s do that out.

encoder <- encoder_model()

iter <- make_iterator_one_shot(train_dataset)
x <-  iterator_get_next(iter)

approx_posterior <- encoder(x)
approx_posterior
tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(256,), event_shape=(2,), dtype=float32
)
approx_posterior$pattern()
tf.Tensor(
[[ 5.77791929e-01 -1.64988488e-02]
 [ 7.93901443e-01 -1.00042784e+00]
 [-1.56279251e-01 -4.06365871e-01]
 ...
 ...
 [-6.47531569e-01  2.10889503e-02]], form=(256, 2), dtype=float32)

We don’t find out about you, however we nonetheless benefit from the ease of inspecting values with keen execution – lots.

Now, on to the decoder, which too returns a distribution as an alternative of a tensor.

Decoder

Within the decoder, we see why transformations between batch form and occasion form are helpful.
The output of self$deconv3 is four-dimensional. What we’d like is an on-off-probability for each pixel.
Previously, this was completed by feeding the tensor right into a dense layer and making use of a sigmoid activation.
Right here, we use tfd$Unbiased to successfully tranform the tensor right into a likelihood distribution over three-dimensional photos (width, peak, channel(s)).

decoder_model <- perform(title = NULL) {
  
  keras_model_custom(title = title, perform(self) {
    
    self$dense <- layer_dense(models = 7 * 7 * 32, activation = "relu")
    self$reshape <- layer_reshape(target_shape = c(7, 7, 32))
    self$deconv1 <-
      layer_conv_2d_transpose(
        filters = 64,
        kernel_size = 3,
        strides = 2,
        padding = "identical",
        activation = "relu"
      )
    self$deconv2 <-
      layer_conv_2d_transpose(
        filters = 32,
        kernel_size = 3,
        strides = 2,
        padding = "identical",
        activation = "relu"
      )
    self$deconv3 <-
      layer_conv_2d_transpose(
        filters = 1,
        kernel_size = 3,
        strides = 1,
        padding = "identical"
      )
    
    perform (x, masks = NULL) {
      x <- x %>%
        self$dense() %>%
        self$reshape() %>%
        self$deconv1() %>%
        self$deconv2() %>%
        self$deconv3()
      
      tfd$Unbiased(tfd$Bernoulli(logits = x),
                      reinterpreted_batch_ndims = 3L)
      
    }
  })
}

Let’s do that out too.

decoder <- decoder_model()
decoder_likelihood <- decoder(approx_posterior_sample)
tfp.distributions.Unbiased(
"IndependentBernoulli/", batch_shape=(256,), event_shape=(28, 28, 1), dtype=int32
)

This distribution can be used to generate the “reconstructions,” in addition to decide the loglikelihood of the unique samples.

KL loss and optimizer

Each VAEs mentioned beneath will want an optimizer …

optimizer <- tf$prepare$AdamOptimizer(1e-4)

… and each will delegate to compute_kl_loss to compute the KL a part of the loss.

This helper perform merely subtracts the log probability of the samples beneath the prior from their loglikelihood beneath the approximate posterior.

compute_kl_loss <- perform(
  latent_prior,
  approx_posterior,
  approx_posterior_sample) {
  
  kl_div <- approx_posterior$log_prob(approx_posterior_sample) -
    latent_prior$log_prob(approx_posterior_sample)
  avg_kl_div <- tf$reduce_mean(kl_div)
  avg_kl_div
}

Now that we’ve seemed on the frequent components, we first talk about learn how to prepare a VAE with a static prior.

On this VAE, we use TFP to create the same old isotropic Gaussian prior.
We then immediately pattern from this distribution within the coaching loop.

latent_prior <- tfd$MultivariateNormalDiag(
  loc  = tf$zeros(checklist(latent_dim)),
  scale_identity_multiplier = 1
)

And right here is the entire coaching loop. We’ll level out the essential TFP-related steps beneath.

for (epoch in seq_len(num_epochs)) {
  iter <- make_iterator_one_shot(train_dataset)
  
  total_loss <- 0
  total_loss_nll <- 0
  total_loss_kl <- 0
  
  until_out_of_range({
    x <-  iterator_get_next(iter)
    
    with(tf$GradientTape(persistent = TRUE) %as% tape, {
      approx_posterior <- encoder(x)
      approx_posterior_sample <- approx_posterior$pattern()
      decoder_likelihood <- decoder(approx_posterior_sample)
      
      nll <- -decoder_likelihood$log_prob(x)
      avg_nll <- tf$reduce_mean(nll)
      
      kl_loss <- compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

      loss <- kl_loss + avg_nll
    })
    
    total_loss <- total_loss + loss
    total_loss_nll <- total_loss_nll + avg_nll
    total_loss_kl <- total_loss_kl + kl_loss
    
    encoder_gradients <- tape$gradient(loss, encoder$variables)
    decoder_gradients <- tape$gradient(loss, decoder$variables)
    
    optimizer$apply_gradients(purrr::transpose(checklist(
      encoder_gradients, encoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(checklist(
      decoder_gradients, decoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
 
  })
  
  cat(
    glue(
      "Losses (epoch): {epoch}:",
      "  {(as.numeric(total_loss_nll)/batches_per_epoch) %>% spherical(4)} nll",
      "  {(as.numeric(total_loss_kl)/batches_per_epoch) %>% spherical(4)} kl",
      "  {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(4)} whole"
    ),
    "n"
  )
}

Above, taking part in round with the encoder and the decoder, we’ve already seen how

approx_posterior <- encoder(x)

provides us a distribution we will pattern from. We use it to acquire samples from the approximate posterior:

approx_posterior_sample <- approx_posterior$pattern()

These samples, we take them and feed them to the decoder, who provides us on-off-likelihoods for picture pixels.

decoder_likelihood <- decoder(approx_posterior_sample)

Now the loss consists of the same old ELBO parts: reconstruction loss and KL divergence.
The reconstruction loss we immediately receive from TFP, utilizing the realized decoder distribution to evaluate the probability of the unique enter.

nll <- -decoder_likelihood$log_prob(x)
avg_nll <- tf$reduce_mean(nll)

The KL loss we get from compute_kl_loss, the helper perform we noticed above:

kl_loss <- compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

We add each and arrive on the general VAE loss:

loss <- kl_loss + avg_nll

Other than these adjustments resulting from utilizing TFP, the coaching course of is simply regular backprop, the best way it appears to be like utilizing keen execution.

Now let’s see how as an alternative of utilizing the usual isotropic Gaussian, we might study a mix of Gaussians.
The selection of variety of distributions right here is fairly arbitrary. Simply as with latent_dim, you may wish to experiment and discover out what works greatest in your dataset.

mixture_components <- 16

learnable_prior_model <- perform(title = NULL, latent_dim, mixture_components) {
  
  keras_model_custom(title = title, perform(self) {
    
    self$loc <-
      tf$get_variable(
        title = "loc",
        form = checklist(mixture_components, latent_dim),
        dtype = tf$float32
      )
    self$raw_scale_diag <- tf$get_variable(
      title = "raw_scale_diag",
      form = c(mixture_components, latent_dim),
      dtype = tf$float32
    )
    self$mixture_logits <-
      tf$get_variable(
        title = "mixture_logits",
        form = c(mixture_components),
        dtype = tf$float32
      )
      
    perform (x, masks = NULL) {
        tfd$MixtureSameFamily(
          components_distribution = tfd$MultivariateNormalDiag(
            loc = self$loc,
            scale_diag = tf$nn$softplus(self$raw_scale_diag)
          ),
          mixture_distribution = tfd$Categorical(logits = self$mixture_logits)
        )
      }
    })
  }

In TFP terminology, components_distribution is the underlying distribution kind, and mixture_distribution holds the possibilities that particular person parts are chosen.

Word how self$loc, self$raw_scale_diag and self$mixture_logits are TensorFlow Variables and thus, persistent and updatable by backprop.

Now we create the mannequin.

latent_prior_model <- learnable_prior_model(
  latent_dim = latent_dim,
  mixture_components = mixture_components
)

How can we receive a latent prior distribution we will pattern from? A bit unusually, this mannequin can be referred to as with out an enter:

latent_prior <- latent_prior_model(NULL)
latent_prior
tfp.distributions.MixtureSameFamily(
"MixtureSameFamily/", batch_shape=(), event_shape=(2,), dtype=float32
)

Right here now’s the entire coaching loop. Word how now we have a 3rd mannequin to backprop by means of.

for (epoch in seq_len(num_epochs)) {
  iter <- make_iterator_one_shot(train_dataset)
  
  total_loss <- 0
  total_loss_nll <- 0
  total_loss_kl <- 0
  
  until_out_of_range({
    x <-  iterator_get_next(iter)
    
    with(tf$GradientTape(persistent = TRUE) %as% tape, {
      approx_posterior <- encoder(x)
      
      approx_posterior_sample <- approx_posterior$pattern()
      decoder_likelihood <- decoder(approx_posterior_sample)
      
      nll <- -decoder_likelihood$log_prob(x)
      avg_nll <- tf$reduce_mean(nll)
      
      latent_prior <- latent_prior_model(NULL)
      
      kl_loss <- compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

      loss <- kl_loss + avg_nll
    })
    
    total_loss <- total_loss + loss
    total_loss_nll <- total_loss_nll + avg_nll
    total_loss_kl <- total_loss_kl + kl_loss
    
    encoder_gradients <- tape$gradient(loss, encoder$variables)
    decoder_gradients <- tape$gradient(loss, decoder$variables)
    prior_gradients <-
      tape$gradient(loss, latent_prior_model$variables)
    
    optimizer$apply_gradients(purrr::transpose(checklist(
      encoder_gradients, encoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(checklist(
      decoder_gradients, decoder$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(checklist(
      prior_gradients, latent_prior_model$variables
    )),
    global_step = tf$prepare$get_or_create_global_step())
    
  })
  
  checkpoint$save(file_prefix = checkpoint_prefix)
  
  cat(
    glue(
      "Losses (epoch): {epoch}:",
      "  {(as.numeric(total_loss_nll)/batches_per_epoch) %>% spherical(4)} nll",
      "  {(as.numeric(total_loss_kl)/batches_per_epoch) %>% spherical(4)} kl",
      "  {(as.numeric(total_loss)/batches_per_epoch) %>% spherical(4)} whole"
    ),
    "n"
  )
}  

And that’s it! For us, each VAEs yielded comparable outcomes, and we didn’t expertise nice variations from experimenting with latent dimensionality and the variety of combination distributions. However once more, we wouldn’t wish to generalize to different datasets, architectures, and so on.

Talking of outcomes, how do they appear? Right here we see letters generated after 40 epochs of coaching. On the left are random letters, on the proper, the same old VAE grid show of latent area.

Hopefully, we’ve succeeded in exhibiting that TensorFlow Chance, keen execution, and Keras make for a horny mixture! In the event you relate whole quantity of code required to the complexity of the duty, in addition to depth of the ideas concerned, this could seem as a fairly concise implementation.

Within the nearer future, we plan to comply with up with extra concerned purposes of TensorFlow Chance, largely from the realm of illustration studying. Keep tuned!

Clanuwat, Tarin, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. 2018. “Deep Studying for Classical Japanese Literature.” December 3, 2018. https://arxiv.org/abs/cs.CV/1812.01718.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

Angstrom-resolution fluorescence microscopy

Could 25, 2023 (Nanowerk Information) Cells, the elemental models of life, comprise a plethora of intricate constructions, processes and mechanisms that uphold and perpetuate...

The Way forward for Deep Studying

Synthetic intelligence is being quickly reworked by deep studying, which has already had a considerable affect on fields together with healthcare, finance, and...

Maker and IoT Concepts’ Newest Improvement Board Provides Microchip’s ATtiny1616 CAN Bus Capabilities

Pseudonymous electronics designer "Maker and IoT Concepts" has constructed a growth board based mostly on the Microchip ATtiny1616 microcontroller and boasting CAN bus...