Unveiling Causal Relationships in Time Series Data

Stavros Stavroglou, Athanasios Pantelous, Hui Wang

This vignette demonstrates advanced techniques for examining causal relationships between time series using the patterncausality package. We will focus on three key aspects:

  1. Cross-validation methods: To rigorously assess the robustness of our findings, ensuring they are not mere artifacts of the data.
  2. Parameter optimization: To fine-tune our analysis for the most accurate and reliable results.
  3. Visualization of causality relationships: To provide clear and intuitive insights into the causal connections between time series.

Through cross-validation, we aim to understand:

Cross-Validation: Ensuring the Reliability of Causal Inference

To demonstrate the application of cross-validation, we will begin by importing a climate dataset from the patterncausality package.

library(patterncausality)
data(climate_indices)

Now, let’s apply cross-validation to evaluate the robustness of pattern causality. We will use the Pacific North American (PNA) and North Atlantic Oscillation (NAO) climate indices as our example time series.

set.seed(123)
X <- climate_indices$PNA
Y <- climate_indices$NAO
result <- pcCrossValidation(
  X = X, 
  Y = Y,  
  numberset = seq(100, 500, by = 10),
  E = 3,
  tau = 2,
  metric = "euclidean",
  h = 1,
  weighted = FALSE
)
print(result$results)
#> , , positive
#> 
#>         value
#> 100 0.4444444
#> 110 0.3548387
#> 120 0.1851852
#> 130 0.3157895
#> 140 0.3157895
#> 150 0.4444444
#> 160 0.3571429
#> 170 0.3750000
#> 180 0.3469388
#> 190 0.2291667
#> 200 0.3478261
#> 210 0.3653846
#> 220 0.3000000
#> 230 0.3939394
#> 240 0.3230769
#> 250 0.2881356
#> 260 0.3166667
#> 270 0.3055556
#> 280 0.2753623
#> 290 0.3625000
#> 300 0.3382353
#> 310 0.3068182
#> 320 0.3690476
#> 330 0.2375000
#> 340 0.2727273
#> 350 0.2608696
#> 360 0.3409091
#> 370 0.3414634
#> 380 0.2826087
#> 390 0.3522727
#> 400 0.2980769
#> 410 0.3548387
#> 420 0.3238095
#> 430 0.2803738
#> 440 0.2844037
#> 450 0.3083333
#> 460 0.2905983
#> 470 0.2941176
#> 480 0.3120000
#> 490 0.2892562
#> 500 0.3030303
#> 
#> , , negative
#> 
#>          value
#> 100 0.16666667
#> 110 0.06451613
#> 120 0.29629630
#> 130 0.07894737
#> 140 0.18421053
#> 150 0.25925926
#> 160 0.21428571
#> 170 0.12500000
#> 180 0.14285714
#> 190 0.35416667
#> 200 0.17391304
#> 210 0.30769231
#> 220 0.32000000
#> 230 0.18181818
#> 240 0.10769231
#> 250 0.30508475
#> 260 0.13333333
#> 270 0.29166667
#> 280 0.31884058
#> 290 0.11250000
#> 300 0.11764706
#> 310 0.17045455
#> 320 0.13095238
#> 330 0.25000000
#> 340 0.23863636
#> 350 0.29347826
#> 360 0.17045455
#> 370 0.14634146
#> 380 0.23913043
#> 390 0.13636364
#> 400 0.25000000
#> 410 0.15053763
#> 420 0.16190476
#> 430 0.24299065
#> 440 0.21100917
#> 450 0.25000000
#> 460 0.22222222
#> 470 0.24369748
#> 480 0.22400000
#> 490 0.23140496
#> 500 0.22727273
#> 
#> , , dark
#> 
#>         value
#> 100 0.3888889
#> 110 0.5806452
#> 120 0.5185185
#> 130 0.6052632
#> 140 0.5000000
#> 150 0.2962963
#> 160 0.4285714
#> 170 0.5000000
#> 180 0.5102041
#> 190 0.4166667
#> 200 0.4782609
#> 210 0.3269231
#> 220 0.3800000
#> 230 0.4242424
#> 240 0.5692308
#> 250 0.4067797
#> 260 0.5500000
#> 270 0.4027778
#> 280 0.4057971
#> 290 0.5250000
#> 300 0.5441176
#> 310 0.5227273
#> 320 0.5000000
#> 330 0.5125000
#> 340 0.4886364
#> 350 0.4456522
#> 360 0.4886364
#> 370 0.5121951
#> 380 0.4782609
#> 390 0.5113636
#> 400 0.4519231
#> 410 0.4946237
#> 420 0.5142857
#> 430 0.4766355
#> 440 0.5045872
#> 450 0.4416667
#> 460 0.4871795
#> 470 0.4621849
#> 480 0.4640000
#> 490 0.4793388
#> 500 0.4696970

To better visualize the results, we will use the plot function to generate a line chart.

plot(result)

As you can see from the plot, the location of the causality tends to stabilize as the sample size increases. This indicates that our method is effective at capturing the underlying patterns and causal connections within the time series.

In this tutorial, you’ve learned how to use cross-validation to assess the reliability of time series causality and how to use visualization tools to better understand the results.

Cross-Validation: Convergence of Pattern Causality

Now, let’s examine the cross-validation process when the random parameter is set to FALSE. This approach uses a systematic sampling method rather than random sampling.

set.seed(123)
X <- climate_indices$PNA
Y <- climate_indices$NAO
result_non_random <- pcCrossValidation(
  X = X,
  Y = Y,
  numberset = seq(100, 500, by = 100),
  E = 3,
  tau = 2,
  metric = "euclidean",
  h = 1,
  weighted = FALSE,
  random = FALSE
)
print(result_non_random$results)
#> , , positive
#> 
#>         value
#> 100 0.2941176
#> 200 0.2400000
#> 300 0.2972973
#> 400 0.2692308
#> 500 0.3000000
#> 
#> , , negative
#> 
#>         value
#> 100 0.1764706
#> 200 0.3200000
#> 300 0.3108108
#> 400 0.2596154
#> 500 0.2307692
#> 
#> , , dark
#> 
#>         value
#> 100 0.5294118
#> 200 0.4400000
#> 300 0.3918919
#> 400 0.4711538
#> 500 0.4692308

We can also visualize the results of the non-random cross-validation:

plot(result_non_random)

By comparing the results of the random and non-random cross-validation, you can gain a deeper understanding of how different sampling methods affect the stability and reliability of the causality analysis.

Cross-Validation with Bootstrap Analysis

To obtain more robust results and understand the uncertainty in our causality measures, we can use bootstrap sampling in our cross-validation analysis. This approach repeatedly samples the data with replacement and provides statistical summaries of the causality measures.

set.seed(123)
X <- climate_indices$PNA
Y <- climate_indices$NAO
result_boot <- pcCrossValidation(
  X = X,
  Y = Y,
  numberset = seq(100, 500, by = 100),
  E = 3,
  tau = 2,
  metric = "euclidean",
  h = 1,
  weighted = FALSE,
  random = TRUE,
  bootstrap = 10  # Perform 100 bootstrap iterations
)

The bootstrap analysis provides several statistical measures for each sample size: - Mean: Average causality measure across bootstrap samples - 5% and 95% quantiles: Confidence intervals for the causality measure - Median: Central tendency measure robust to outliers

Let’s examine the results:

print(result_boot$results)
#> , , positive
#> 
#>          mean         5%       95%    median
#> 100 0.2935145 0.00000000 0.5130952 0.3484848
#> 200 0.2128219 0.02416667 0.3593833 0.2586976
#> 300 0.3054764 0.17454545 0.4886905 0.2940141
#> 400 0.2814055 0.20677419 0.3564555 0.2854029
#> 500 0.2921291 0.18026316 0.3955296 0.2940621
#> 
#> , , negative
#> 
#>          mean        5%       95%    median
#> 100 0.3364599 0.1065934 0.6964286 0.3030303
#> 200 0.3573748 0.2100710 0.5719253 0.3361582
#> 300 0.2927158 0.1591206 0.4129583 0.2900433
#> 400 0.3209360 0.2532708 0.3866656 0.3158903
#> 500 0.3021787 0.1848489 0.4481203 0.2892157
#> 
#> , , dark
#> 
#>          mean        5%       95%    median
#> 100 0.3700256 0.1733083 0.5119565 0.4160839
#> 200 0.4298033 0.3531579 0.5409119 0.4099116
#> 300 0.4018078 0.3214286 0.4705645 0.4075221
#> 400 0.3976585 0.3348185 0.4485013 0.3989247
#> 500 0.4056922 0.3365340 0.4638937 0.4129274

We can visualize the bootstrap results using the plot function, which now shows confidence intervals:

plot(result_boot, separate = TRUE)

The shaded area in the plot represents the range between the 5th and 95th percentiles of the bootstrap samples, providing a measure of uncertainty in our causality estimates. The solid line shows the median value, which is more robust to outliers than the mean.