BCM – Bayesian analysis of Computational Models using samplers

BCM is a C++ software package for the Bayesian analysis of computational models. It provides efficient implementations of eleven sampling algorithms for generating posterior samples and calculating marginal likelihoods. Additional tools are included which facilitate the process of specifying models and visualizing the sampling output. This toolkit can be used for analyzing the uncertainty in the parameters and the predictions of computational models using Bayesian statistics.

The toolkit is further described in the manuscript: BCM: toolkit for Bayesian analysis of Computational Models using samplers.

If you have any questions, comments or suggestions, please feel free to contact us at b.thijssen@nki.nl.

Download

Version 1.0.4 [download .tar.gz or .zip] — this is the release used in the BCM publication.

Version 2.0.1 [download .tar.gz or .zip] — New release of 2017-04-20 — added support for SBML, NetCDF and some ODE functionality using CVode, as well many other changes.

Prerequisites

For Linux users: these prerequisites are usually available through your distribution’s package manager.

For Mac users: CMake and Boost can be most easily installed using Homebrew.

BCM has the following prerequisites:

CMake – version 3.2 or newer
Boost – tested with 1.54.0 and 1.59.0
- Windows users may want to use the precompiled binaries at http://sourceforge.net/projects/boost/files/boost-binaries/. However, unfortunately CMake (verison 3.3.2) doesn’t understand the directory structure of these precompiled binaries. If you do use them, you will need to create a subdirectory “/stage/lib” in the Boost folder, and copy the library files to there; or create a symbolic link, or adjust CMake to find the libraries.
C++ compiler with c++11 support (e.g. GCC 4.8 or newer / Visual C++ 2010 or newer)
For loading the sampler output into R, the CRAN package XML is required. The example plots additionally require the packages gplots and RColorBrewer.
Optional:
- NetCDF-4 – tested with version 4.4.1
- libSBML – tested with version 5.11.4

Installation instructions

Windows / Visual studio

Create a BOOST_ROOT environment variable which points to the root directory of the Boost library
Create a BCM_ROOT environment variable which points to the root directory of BCM. This is only necessary if you wish to use the model parsing tool.
Create a subdirectory “build” in the bcm folder
Run CMake from the build directory, either using cmake .. in a command prompt, or using the CMake GUI tool.
Build the bcm solution
The binaries will be in %BCM_ROOT%/bin/Release

Linux & Mac OS X

Create a BCM_ROOT environment variable which points to the root directory of BCM. This is only necessary if you wish to use the model parsing tool.
If Boost is not installed into global include/lib folders, create a BOOST_ROOT environment variable which points to the root directory of the Boost library
Create a subdirectory “build” in the bcm folder
Run CMake from the build directory, either using cmake .. in a command prompt, or using the CMake GUI tool.
Build bcm using make
The binaries will be in $BCM_ROOT/bin

Running BCM

There are three main ways of using BCM:

Using a model description file and the model parsing tool
Using a dynamic library for likelihood evaluation and an XML prior description
Calling the samplers directly from C++ code

1. Using a model description file and the model parsing tool

An example for how to use the model parsing tool is located in the src/examples/GaussianShellsUsingMdlParse directory.

Parse & compile the model description file
- Windows: %BCM_ROOT%\scripts\make_model.bat shells10D.txt
- Linux/Mac: $BCM_ROOT/scripts/make_model.sh shells10D.txt
Run an inference with a specified sampling algorithm
- Windows: %BCM_ROOT%\bin\Release\mdlinf shells10D –c config_smc.txt
- Linux/Mac: $BCM_ROOT/bin/mdlinf shells10D -c config_smc.txt
The inference output will be stored in a subdirectory of the model folder, which is specified in the configuration file
The plots.r file shows how to load the sampling output into R and how to generate some basic plots from this output.

2. Using a dynamic library for likelihood evaluation and an XML prior description

An example for a basic dynamic library is provided in the src/examples/LikelihoodDynamicLibrary directory.

An inference always consists of two files, a DLL/so file which contains the likelihood function, and an XML file describing the parameters to be inferred and their prior distributions.

In the DLL/so, there are two functions to be implemented:

EvaluateLikelihood, this function receives the variables values and should return the log likelihood
GetOutput, this function can be used to store simulation output along with the sampled variable values

Both the DLL and XML file should have the same base file name, e.g. my_model.dll/so and my_model.xml. Furthermore, both files need to be stored in a folder with again the same name, my_model. The configuration of the sampling algorithm to be used can be specified in a configuration file. This gives the following directory structure:

some_path\. \config.txt \my_model\my_model.dll [or .so on linux&mac] \my_model\my_model.xml

You can then run the inference as follows:

Windows: %BCM_ROOT%\bin\Release\mdlinf my_model –c config.txt
Linux/Mac: $BCM_ROOT/bin/mdlinf my_model -c config.txt

If you wish to run BCM using multiple threads, the two functions in the DLL need to be re-entrant. A thread index is supplied which can be used to store local data; the total number of threads that will be used is supplied to the DLL at startup as well. The easiest way to ensure thread safety may be to allocate several model instances as shown in the example.

3. Calling the samplers directly from C++ code

There is a C++ example in the src/examples/GaussianShells directory which shows how the sampling toolkit can be used from inside a C++ application.

CPU & Memory usage

The CPU & memory usage of BCM depends on which algorithm is selected for the inference.

The CPU usage is controlled through the –numthreads (-n) option. The default setting is to use the number of hardware threads available on the current machine. For the algorithms based on parallel-tempering, the number of threads that can be used is limited to the number of parallel chains, so in that case there is no benefit in increasing the number of threads beyond the number of parallel chains.

The memory usage is proportional to the number of variables that are being sampled. For the algorithms based on parallel-tempering, only summary statistics and a buffer of 100 samples is stored in memory; thus this class of algorithms usually has the lowest memory usage. For the algorithms based on nested sampling, the current population of samples is stored in memory, thus the memory usage is proportional to number_of_variables*population_size. For the algorithms based on sequential Monte Carlo, both the current and the previous population of samples is stored in memory, thus the memory usage is proportional to 2*number_of_variables*population_size.

Reference

mdlinf

Basic usage:

mdlinf model_name –c config_file

There should be a subdirectory with the model’s name, e.g. “my_model”. This subdirectory should contain a dynamic library for the likelihood named “my_model.dll/so”, and an XML file for the prior named “my_model.xml”. Please see the examples for how the XML prior file should be formatted.

All options can be supplied either on the command line or in a configuration file. The configuration file follows the Boost Program options format. For example, there is an option “–sampler.type”. This option can be specified as such on the command line, for example “–sampler.type=smc”. Alternatively, it can be specified in a configuration file; in this case there should be a section “[sampler]” in the configuration file, and in this section the option can be specified as “type=smc”. It is usually easiest to specify most options in a configuration file. Some example configuration files are given in the examples folder. Note that options given on the command line will override the configuration file.

The most important options are the following:

–sampler.type=[mh/pt/smc/nested] This selects which class of sampling algorithms to use: Metropolis-Hastings, Parallel Tempering, Sequential Monte Carlo or Nested sampling, respectively.
For parallel-tempering:
- –sampler.burnin_period=x
- –sampler.sampling_period=x
- –sampler.use_every_nth=x
- –sampler.numchains=x
- –sampler.proposal_type=[scalar/vector/autoblock]
For sequential Monte Carlo:
- –sampler.population_size=x
- –sampler.temperature_schedule=[fixed_linear/fixed_loglinear/adaptive]
- –sampler.smc_strategy=[mcmc_global/mcmc_global_covariance/mcmc_local/kde_local]
For nested sampling:
- –sampler.population_size=x
- –sampler.nested_num_posterior_samples=x
- –sampler.nested_sampling_strategy=[prior/ellipsoid/multinest/mcmc]
- –sampler.termination_criterion=[fixed/ntimesh/remaining_evidence]

Below is the full list of options:

Generic options:
  --help                       produce help message
  -j [ --numthreads ] arg (=0) number of threads to use; set to 0 for using all available hardware concurrency

Command-line only options:
  -c [ --config_file ] arg (=config.txt) Configuration file

Configuration:
  --model arg (=model)                                     Name of the model, determines the filename: 
                                                           library will be [x].dll/.so, parameters will
                                                           be [x].xml
  --output.folder arg (=output)                            Folder in which output files will be 
                                                           generated, subfolder of the model directory.
  --output.posterior_format arg (=bin)                     Format of the posterior samples, can be 
                                                           either "bin" or "tsv".
  --sampler.type arg                                       Sampler type, can be "mh", "pt", "smc" or 
                                                           "nested".
  --sampler.burnin_period arg (=1000)                      [MH/PT] Number of samples for burnin-period,
                                                           these are discarded.
  --sampler.sampling_period arg (=5000)                    [MH/PT] Number of samples to produce, after 
                                                           the burnin period.
  --sampler.use_every_nth arg (=1)                         [MH/PT] Subsampling, a value of 10 means 
                                                           that every 10th sample is used and the 9 
                                                           samples in between are discarded.
  --sampler.num_mcmc_moves_per_iteration arg (=1)          [MH/PT] Number of mcmc moves to make per 
                                                           iteration, mainly useful for decreasing the 
                                                           number of context switches when using 
                                                           multiple threads.
  --sampler.proposal_scaling arg (=1)                      [MH] Scaling factor for the proposal 
                                                           distribution.
  --sampler.numchains arg (=8)                             [PT] Number of chains
  --sampler.min_temperature arg (=0)                       [PT] Temperature of the lowest chain. Note 
                                                           that the evidence calculation is not correct
                                                           if this is not set to 0.
  --sampler.adapt_proposal arg (=2500)                     [PT] Number of samples after which the 
                                                           proposal variance should be adapted, 0 for 
                                                           no adaptation.
  --sampler.exchange_probability arg (=0.5)                [PT] Probability of choosing an exchange 
                                                           step instead of a MH-move.
  --sampler.temperatures arg                               [PT] Filename specifying a file containing 
                                                           the temperatures to be used for each chain, 
                                                           if empty use the default temperature 
                                                           schedule.
  --sampler.initial_temperature_power arg (=5)             [PT] Specifies the rate of the power-law 
                                                           used for the initial temperature schedule, 
                                                           used if min_temperature=0.
  --sampler.adapt_temperatures_times arg (=0)              [PT] Number of iterations of temperature 
                                                           adaptation steps.
  --sampler.adapt_temperatures_period arg (=1000)          [PT] Number of samples to produce during 
                                                           each temperature adaptation step.
  --sampler.proposal_type arg (=autoblock)                 [PT] Proposal type, can be either: scalar, 
                                                           vector, autoblock
  --sampler.population_size arg (=1000)                    [SMC/Nested] Number of particles.
  --sampler.start_temperature arg (=0)                     [SMC] Starting temperature, between 0 and 1.
  --sampler.end_temperature arg (=1)                       [SMC] End temperature, between 0 and 1, 
                                                           typically 1.
  --sampler.temperature_schedule arg (=adaptive)           [SMC] Schedule for the temperatures, can be 
                                                           "fixed_linear", "fixed_loglinear" or 
                                                           "adaptive".
  --sampler.temperature_schedule_adaptive_ratio arg (=1.0) [SMC] Ratio specifying the target of the 
                                                           next population's ESS, [0-1). Only 
                                                           meaningful if temperature_schedule=adaptive
  --sampler.temperature_schedule_fixed_number arg (=1000)  [SMC] The number of iterations in a fixed 
                                                           temperature schedule. Only meaningful if 
                                                           temperature_schedule=fixed_linear or 
                                                           fixed_loglinear.
  --sampler.resampling_threshold arg (=0.5)                [SMC] Threshold for resampling; the 
                                                           population will be resampled if ESS < 
                                                           threshold * population_size.
  --sampler.smc_strategy arg (=mcmc_global)                [SMC] Strategy for sampling the next 
                                                           population, can be "mcmc_global", 
                                                           "mcmc_global_covariance", "mcmc_local", 
                                                           "kde_local".
  --sampler.smc_mcmc_steps arg (=1)                        [SMC] Number of MCMC steps per population 
                                                           iteration. Only meaningful if 
                                                           smc_strategy=mcmc_global or mcmc_local.
  --sampler.smc_mcmc_recalculate arg (=20)                 [SMC] After how many iterations to 
                                                           recalculate the proposal distribution. Only 
                                                           meaningful if smc_strategy=mcmc_global or 
                                                           mcmc_local.
  --sampler.smc_mcmc_local_num_nearby arg (=2)             [SMC] Number of nearby particles to 
                                                           calculate local proposal as a multiple of 
                                                           the dimensionality. Only meaningful if 
                                                           smc_strategy=mcmc_local.
  --sampler.smc_mcmc_max_correlation arg (=1)              [SMC] Run MCMC until the correlation of all 
                                                           marginal parameters between successive steps
                                                           is less than or equal to this threshold, in 
                                                           chunks of smc_mcmc_steps.
  --sampler.sample_weights arg (=1)                        [Nested] Specify whether to calculate the 
                                                           weights by sampling ("true") or by 
                                                           expectation value ("false"); sampling the 
                                                           weights allows an estimation of the 
                                                           uncertainty in the evidence approximation.
  --sampler.termination_criterion arg (=remaining_evidence)[Nested] Termination criterion, can be 
                                                           "fixed", "ntimesh" or "remaining_evidence".
  --sampler.termination_num_steps arg (=1000)              [Nested] Number of steps. Only meaningful if
                                                           termination_criterion=fixed.
  --sampler.termination_remaining_evidence arg (=0.01)     [Nested] Stop sampling when the current 
                                                           population can at most contribute this much 
                                                           to the evidence term. Only meaningful if 
                                                           termination_criterion=remaining_evidence.
  --sampler.termination_ntimesh_factor arg (=2.0)          [Nested] Stop sampling when the number of 
                                                           steps exceeds this factor * population_size 
                                                           * H. Only used if termination_criterion=ntim
                                                           esh.
  --sampler.nested_sampling_strategy arg (=multinest)      [Nested] Strategy to use for proposing new 
                                                           samples, can be "prior", "ellipsoid",  
                                                           "multinest", or "mcmc".
  --sampler.nested_num_posterior_samples arg (=1000)       [Nested] Number of posterior samples to be 
                                                           generated, can be different from 
                                                           population_size
  --sampler.rngseed arg (=0)                               Random number generator seed, set to 0 to 
                                                           use an arbitrary seed based on the computer 
                                                           time.