Bayesian Statistics model comparison

rstats
modeling
bayesian
Published

March 22, 2023

Overview

In this post I’ll go through some differences between Bayesian statistical packages in R. Bayesian statistics involves probabilities. This means that the probability of an event to occur is considered in the modeling procedure, and is mainly used in for making inferences, and can be used for an analysis of the speculation of the root cause of a phenomenon under the term of causal inference.

In more details, when Bayesian statistics is performed, the response variable is tested against (causal) predictors with the application of suited prior distributions, and the use of the likelihood function, to finally produce a posterior distribution which should be as much as possible close to the real future outcome of the response variable distribution.

The prior distribution is the starting point; it is the probability distribution on which the future outcome is linked to, such as the probability to have a Girl given the probability to have had a Boy.

\[P( \text{ Girl } | \text{ Boy })\]

The probability to have had a Boy is the prior, while the conditional probability to have a Girl is the posterior.

Briefly, here is a comparison between different R packages that use Bayesian inference for the calculation of the model probability distribution of the posterior.

The Stan model engine, for model replication and prediction is used in conjunction with the Montecarlo simulation technique for the best model solution. The Stan model engine is applied in the following packages:

  • brms
  • rstanarm
  • rethinking
  • MCMCglmm

All of these packages adapt and adjust different model options for a modeling procedure which is the result of the best combination of efficiency to increase productivity and effectiveness, to identify and remove unnecessary steps, automate repetitive tasks, and utilize the most suitable software tools.

This is the original source code that I have updated: https://www.jstatsoft.org/article/view/v080i01

A wide range of distributions and link functions are supported, allowing users to fit - among others - linear, robust linear, binomial, Poisson, survival, ordinal, zero-inflated, hurdle, and even non-linear models all in a multilevel context. (The Brms package)

Loading required packages

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: Rcpp
Loading 'brms' package (version 2.20.4). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').

Attaching package: 'brms'

The following object is masked from 'package:stats':

    ar
This is rstanarm version 2.26.1
- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
- For execution on a local, multicore CPU with excess RAM we recommend calling
  options(mc.cores = parallel::detectCores())

Attaching package: 'rstanarm'

The following objects are masked from 'package:brms':

    dirichlet, exponential, get_y, lasso, ngrps
library("rethinking") 
Loading required package: rstan
Loading required package: StanHeaders

rstan version 2.32.3 (Stan version 2.26.1)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)


Attaching package: 'rstan'

The following object is masked from 'package:tidyr':

    extract

Loading required package: cmdstanr
This is cmdstanr version 0.5.3
- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
- CmdStan path: /Users/macintoshhd/.cmdstan/cmdstan-2.31.0
- CmdStan version: 2.31.0
Loading required package: parallel
rethinking (Version 2.31)

Attaching package: 'rethinking'

The following object is masked from 'package:rstan':

    stan

The following objects are masked from 'package:rstanarm':

    logit, se

The following objects are masked from 'package:brms':

    LOO, stancode, WAIC

The following object is masked from 'package:purrr':

    map

The following object is masked from 'package:stats':

    rstudent
library("MCMCglmm")  
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loading required package: coda

Attaching package: 'coda'

The following object is masked from 'package:rstan':

    traceplot

Loading required package: ape

Attaching package: 'ape'

The following object is masked from 'package:dplyr':

    where


Attaching package: 'MCMCglmm'

The following object is masked from 'package:brms':

    me

Helper function to better compute the effective sample size

eff_size <- function(x) {
  if (is(x, "brmsfit")) {
    samples <- as.data.frame(x$fit)
  } else if (is(x, "stanreg")) {
    samples <- as.data.frame(x$stanfit)
  } else if (is(x, "ulam")) {
    samples <- as.data.frame(x@stanfit)
  } else if (is(x, "stanfit")) {
    samples <- as.data.frame(x)
  } else if (is(x, "MCMCglmm")) {
    samples <- cbind(x$Sol, x$VCV)
  } else {
    stop("invalid input")
  }
  # call an internal function of rstan
  floor(apply(samples, MARGIN = 2, FUN = rstan:::ess_rfun))
}

Compare efficiency between packages

# only used for Stan packages
iter <- 6000  
warmup <- 1000
chains <- 1
adapt_delta <- 0.8

# only used for MCMCglmm
nitt <- 35000  
burnin <- 10000
thin <- 5
# leads to 5000 posterior samples

Dyestuff

brms

prior_dye_brms <- c(set_prior("normal(0, 2000)", class = "Intercept"),
                    set_prior("cauchy(0, 50)", class = "sd"),
                    set_prior("cauchy(0, 50)", class = "sigma"))

dye_brms <- brm(Yield ~ 1 + (1 | Batch), 
                data = lme4::Dyestuff, 
                prior = prior_dye_brms, 
                chains = 0)
Compiling Stan program...
Trying to compile a simple C file
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.2/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Library/Frameworks/R.framework/Versions/4.2/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.2/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/Eigen/Core:88:
/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
namespace Eigen {
^
/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
namespace Eigen {
               ^
               ;
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.2/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/Eigen/Dense:1:
/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
#include <complex>
         ^~~~~~~~~
3 errors generated.
make: *** [foo.o] Error 1
Start sampling
the number of chains is less than 1; sampling not done
time_dye_brms <- system.time(capture.output(
  dye_brms <- update(dye_brms, 
                     iter = iter, 
                     warmup = warmup, 
                     chains = chains,
                     control = list(adapt_delta = adapt_delta))
))
Start sampling
# summary(dye_brms)
eff_dye_brms <- min(eff_size(dye_brms)) / time_dye_brms[[1]]

rstanarm

time_dye_rstanarm <- system.time(capture.output(
  dye_rstanarm <- stan_glmer(Yield ~ 1 + (1 | Batch), data = lme4::Dyestuff,
                             prior_intercept = normal(0, 2000),
                             iter = iter, warmup = warmup, chains = chains,
                             adapt_delta = adapt_delta)
))
Warning: There were 1 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
# summary(dye_rstanarm)
eff_dye_rstanarm <- min(eff_size(dye_rstanarm)) / time_dye_rstanarm[[1]]

rethinking

d <-  lme4::Dyestuff

dat <- list(
  Yield = d$Yield,
  Batch = d$Batch
)

dye_flist <- alist(
  Yield ~ dnorm(eta, sigma),
  eta <- a + a_Batch[Batch],
  a ~ dnorm(0,2000),
  a_Batch[Batch] ~ dnorm(0, sd_Batch),
  sigma ~ dcauchy(0, 50),
  sd_Batch ~ dcauchy(0, 50))

dye_rethinking <- ulam(dye_flist, 
                       data = dat, 
                       chains=1,
                       cores = 4,
                       sample = TRUE)

time_dye_rethinking <- system.time(capture.output(
  dye_rethinking <- update(dye_rethinking,
       iter = iter, 
       warmup = warmup, 
       chains = chains,
       control = list(adapt_delta = adapt_delta))
))


# summary(dye_rethinking)
eff_dye_rethinking <- min(eff_size(dye_rethinking)) / time_dye_rethinking[[1]]

MCMCglmm

time_dye_MCMCglmm <- system.time(capture.output(
  dye_MCMCglmm <- MCMCglmm(Yield ~ 1, 
                           random = ~ Batch, data = lme4::Dyestuff, 
                           thin = thin, nitt = nitt, burnin = burnin)
))
# summary(dye_MCMCglmm)
eff_dye_MCMCglmm <- min(eff_size(dye_MCMCglmm)) / time_dye_MCMCglmm[[1]]

Efficiency

print(c(brms = eff_dye_brms, 
        rstanarm = eff_dye_rstanarm, 
        rethinking = eff_dye_rethinking, 
        MCMCglmm = eff_dye_MCMCglmm))
      brms   rstanarm rethinking   MCMCglmm 
 545.82485  151.78571 4120.30075   98.28571 
brms rstanarm rethinking MCMCglmm
559.55398 202.97177 3660.71429 34.11514

(to be continued…)

Back to top