Here we provide a brief tutorial of the BayesChange
package. The BayesChange
package contains two main
functions: one that performs change points detection on univariate and
multivariate time series and one that perform clustering of time series
and survival functions with common change points. Here we briefly show
how to implement these.
The function detect_cp
provide a method for detecting
change points on univariate and multivariate time series, it is based on
the work Martínez and Mena (2014) and on
the work Corradin, Danese, and Ongaro
(2022).
Depending on the structure of the data, detect_cp
performs change points detection on univariate time series or
multivariate time series. For example we can create a vector of 100
observations where the first 50 observations are sampled from a normal
distribution with mean 0 and variance 0.1 and the other 50 observations
still from a normal distribution with mean 0 but variance 0.25.
Now we can run the function detect_cp
, as arguments of
the function we need to specify the number of iterations, the number of
burn-in steps and a list with the the autoregressive coefficient
phi
for the likelihood of the data, the parameters
a
, b
, c
for the priors and the
probability q
of performing a split at each step.
out <- detect_cp(data = data_uni,
n_iterations = 1000, n_burnin = 100,
params = list(q = 0.25, phi = 0.1, a = 1, b = 1, c = 0.1))
#> Completed: 100/1000 - in 0.011 sec
#> Completed: 200/1000 - in 0.02 sec
#> Completed: 300/1000 - in 0.029 sec
#> Completed: 400/1000 - in 0.039 sec
#> Completed: 500/1000 - in 0.048 sec
#> Completed: 600/1000 - in 0.058 sec
#> Completed: 700/1000 - in 0.067 sec
#> Completed: 800/1000 - in 0.076 sec
#> Completed: 900/1000 - in 0.086 sec
#> Completed: 1000/1000 - in 0.095 sec
With the methods print
and summary
we can
get information about the algorithm.
print(out)
#> DetectCpObj object
#> Type: change points detection on univariate time series
summary(out)
#> DetectCpObj object
#> Detecting change points on an univariate time series:
#> Number of burn-in iterations: 100
#> Number of MCMC iterations: 900
#> Computational time: 0.1 seconds
In order to get a point estimate of the change points we can use the
method posterior_estimate
that uses the method
salso by David B. Dahl and Müller
(2022) to get the final latent order and then detect the change
points.
The package also provides a method for plotting the change points.
In we define instead a matrix of data, detect_cp
automatically performs a multivariate change points detection
method.
data_multi <- matrix(NA, nrow = 3, ncol = 100)
data_multi[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_multi[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_multi[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
Arguments k_0
, nu_0
, phi_0
,
m_0
, par_theta_c
, par_theta_d
and
prior_var_gamma
correspond to the parameters of the prior
distributions for the multivariate likelihood.
out <- detect_cp(data = data_multi, n_iterations = 1000, n_burnin = 100,
list(q = 0.25, k_0 = 0.25, nu_0 = 4, phi_0 = diag(1,3,3),
m_0 = rep(0,3), par_theta_c = 2, par_theta_d = 0.2,
prior_var_gamma = 0.1))
#> Completed: 100/1000 - in 0.013 sec
#> Completed: 200/1000 - in 0.027 sec
#> Completed: 300/1000 - in 0.041 sec
#> Completed: 400/1000 - in 0.054 sec
#> Completed: 500/1000 - in 0.068 sec
#> Completed: 600/1000 - in 0.082 sec
#> Completed: 700/1000 - in 0.095 sec
#> Completed: 800/1000 - in 0.109 sec
#> Completed: 900/1000 - in 0.124 sec
#> Completed: 1000/1000 - in 0.138 sec
table(posterior_estimate(out, loss = "binder"))
#> Warning in salso::salso(mcmc_chain, loss = "binder", maxNClusters =
#> maxNClusters, : The number of clusters equals the default maximum possible
#> number of clusters.
#>
#> 1 2
#> 50 50
plot(out, loss = "binder")
#> Warning in salso::salso(mcmc_chain, loss = "binder", maxNClusters =
#> maxNClusters, : The number of clusters equals the default maximum possible
#> number of clusters.
BayesChange
contains another function,
clust_cp
, that cluster respectively univariate and
multivariate time series and survival functions with common change
points. Details about this methods can be found in Corradin et al. (2024).
In clust_cp
the argument kernel
must be
specified, if data are time series then kernel = "ts"
must
be set. Then the algorithm automatically detects if data are univariate
or multivariate.
If time series are univariate we need to set a matrix where each row is a time series.
data_mat <- matrix(NA, nrow = 5, ncol = 100)
data_mat[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_mat[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_mat[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_mat[4,] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_mat[5,] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
Arguments that need to be specified in clust_cp
are the
number of iterations n_iterations
, the number of elements
in the normalisation constant B
, the split-and-merge step
L
performed when a new partition is proposed and a list
with the parameters of the algorithm, the likelihood and the
priors..
out <- clust_cp(data = data_mat, n_iterations = 1000, n_burnin = 100,
kernel = "ts",
params = list(B = 1000, L = 1, gamma = 0.5))
#> Normalization constant - completed: 100/1000 - in 0.007 sec
#> Normalization constant - completed: 200/1000 - in 0.015 sec
#> Normalization constant - completed: 300/1000 - in 0.024 sec
#> Normalization constant - completed: 400/1000 - in 0.031 sec
#> Normalization constant - completed: 500/1000 - in 0.039 sec
#> Normalization constant - completed: 600/1000 - in 0.047 sec
#> Normalization constant - completed: 700/1000 - in 0.054 sec
#> Normalization constant - completed: 800/1000 - in 0.062 sec
#> Normalization constant - completed: 900/1000 - in 0.07 sec
#> Normalization constant - completed: 1000/1000 - in 0.077 sec
#>
#> ------ MAIN LOOP ------
#>
#> Completed: 100/1000 - in 0.072 sec
#> Completed: 200/1000 - in 0.143 sec
#> Completed: 300/1000 - in 0.213 sec
#> Completed: 400/1000 - in 0.287 sec
#> Completed: 500/1000 - in 0.359 sec
#> Completed: 600/1000 - in 0.43 sec
#> Completed: 700/1000 - in 0.501 sec
#> Completed: 800/1000 - in 0.572 sec
#> Completed: 900/1000 - in 0.644 sec
#> Completed: 1000/1000 - in 0.715 sec
posterior_estimate(out, loss = "binder")
#> [1] 1 2 3 4 4
Method plot
for clustering univariate time series
represents the data colored according to the assigned cluster.
If time series are multivariate, data must be an array, where each element is a multivariate time series represented by a matrix. Each row of the matrix is a component of the time series.
data_array <- array(data = NA, dim = c(3,100,5))
data_array[1,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[1,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[1,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[2,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[3,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[1,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[2,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[3,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[1,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[2,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[3,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
out <- clust_cp(data = data_array, n_iterations = 1000, n_burnin = 100,
kernel = "ts",
list(B = 1000, L = 1, gamma = 0.1, k_0 = 0.25, nu_0 = 5,
phi_0 = diag(0.1,3,3), m_0 = rep(0,3)))
#> Normalization constant - completed: 100/1000 - in 0.008 sec
#> Normalization constant - completed: 200/1000 - in 0.016 sec
#> Normalization constant - completed: 300/1000 - in 0.024 sec
#> Normalization constant - completed: 400/1000 - in 0.032 sec
#> Normalization constant - completed: 500/1000 - in 0.04 sec
#> Normalization constant - completed: 600/1000 - in 0.048 sec
#> Normalization constant - completed: 700/1000 - in 0.056 sec
#> Normalization constant - completed: 800/1000 - in 0.064 sec
#> Normalization constant - completed: 900/1000 - in 0.072 sec
#> Normalization constant - completed: 1000/1000 - in 0.08 sec
#>
#> ------ MAIN LOOP ------
#>
#> Completed: 100/1000 - in 0.082 sec
#> Completed: 200/1000 - in 0.167 sec
#> Completed: 300/1000 - in 0.248 sec
#> Completed: 400/1000 - in 0.33 sec
#> Completed: 500/1000 - in 0.42 sec
#> Completed: 600/1000 - in 0.552 sec
#> Completed: 700/1000 - in 0.668 sec
#> Completed: 800/1000 - in 0.762 sec
#> Completed: 900/1000 - in 0.847 sec
#> Completed: 1000/1000 - in 0.921 sec
posterior_estimate(out, loss = "binder")
#> [1] 1 2 2 3 3
Finally, if we set kernel = "epi"
, clust_cp
cluster survival functions with common change points. Also here details
can be found in Corradin et al.
(2024).
Data are a matrix where each row is the number of infected at each
time. Inside this package is included the function
sim_epi_data
that simulates infection times.
data_mat <- matrix(NA, nrow = 5, ncol = 50)
betas <- list(c(rep(0.45, 25),rep(0.14,25)),
c(rep(0.55, 25),rep(0.11,25)),
c(rep(0.50, 25),rep(0.12,25)),
c(rep(0.52, 10),rep(0.15,40)),
c(rep(0.53, 10),rep(0.13,40)))
inf_times <- list()
for(i in 1:5){
inf_times[[i]] <- sim_epi_data(S0 = 10000, I0 = 10, max_time = 50, beta_vec = betas[[i]], gamma_0 = 1/8)
vec <- rep(0,50)
names(vec) <- as.character(1:50)
for(j in 1:50){
if(as.character(j) %in% names(table(floor(inf_times[[i]])))){
vec[j] = table(floor(inf_times[[i]]))[which(names(table(floor(inf_times[[i]]))) == j)]
}
}
data_mat[i,] <- vec
}
In clust_cp
we need to specify, besides the usual
parameters, the number of Monte Carlo replications M
for
the approximation of the integrated likelihood and the recovery rate
gamma
.
out <- clust_cp(data = data_mat, n_iterations = 100, n_burnin = 10,
kernel = "epi",
list(M = 100, B = 1000, L = 1, q = 0.1, gamma = 1/8))
#> Normalization constant - completed: 10/100 - in 0.062 sec
#> Normalization constant - completed: 20/100 - in 0.123 sec
#> Normalization constant - completed: 30/100 - in 0.185 sec
#> Normalization constant - completed: 40/100 - in 0.247 sec
#> Normalization constant - completed: 50/100 - in 0.308 sec
#> Normalization constant - completed: 60/100 - in 0.37 sec
#> Normalization constant - completed: 70/100 - in 0.431 sec
#> Normalization constant - completed: 80/100 - in 0.492 sec
#> Normalization constant - completed: 90/100 - in 0.553 sec
#> Normalization constant - completed: 100/100 - in 0.614 sec
#>
#> ------ MAIN LOOP ------
#>
#> Completed: 10/100 - in 0.459 sec
#> Completed: 20/100 - in 0.941 sec
#> Completed: 30/100 - in 1.409 sec
#> Completed: 40/100 - in 1.883 sec
#> Completed: 50/100 - in 2.356 sec
#> Completed: 60/100 - in 2.86 sec
#> Completed: 70/100 - in 3.408 sec
#> Completed: 80/100 - in 3.936 sec
#> Completed: 90/100 - in 4.518 sec
#> Completed: 100/100 - in 5.064 sec
posterior_estimate(out, loss = "binder")
#> [1] 1 1 2 1 1