Retrieving Longitudinal and Repeating Structures

Will Beasley

Biomedical & Behavior Methodology Core, OUHSC Pediatrics

Stephan Kaduke

CHOP

Background

This vignette pertains to reading REDCap records from a project that (a) has longitudinal events or (b) has a repeating measure. The first section conceptually discusses how REDCap stores complex structures. The remaining sections describe how to best retrieve complex structures with the REDCapTidyieR and REDCapR packages.

If your REDCap project is longitudinal or contains repeating measures, a single call to the API (or a single export through the browser) will return a dataset that is not readily analyzed. Instead, the dataset will resemble Table 5. This isn’t because of a software bug, but because you haven’t told the software how you would like the data structured. There isn’t a good way to jam this multidimensional space into a rectangle of points. Our advice for querying REDCap is the same as querying any database system: request separate datasets that have a natural “grain” and assemble them as to fit your analyses.

Illustration of How Data Points are Structured

Possible Table Structures

Suppose you have two patients (i.e., “1” and “2”) with three intake variables (i.e., height, weight, and bmi). If you record this on a piece of paper, it would probably look like Table 1. The table’s grain is “patient”, because each row represents a distinct patient. Understanding the grain of each structure below will help you understand how the structures are re-expressions of the same set of observations.

Table 1: patient grain

pt
id
height weight bmi
1 1.0 11.0 111.0
2 2.0 22.0 222.0

This patient-grain structure is how the data points are most comfortably inputted by humans into REDCap, and it is the default when exported through the browser and API. However it is stored differently by REDCap’s internal database.

REDCap’s flexibility is a driver of its success. Once a research team learns REDCap, it can reuse the knowledge to capture anything from leukemia to lunch orders. But to achieve this flexibility in the world of REDCap and EMRs, data are stored along the observation grain. In computer science, this is commonly called an EAV structure (which stands for entity-attribute-value). The patient’s ID is the entity, the variable type is the attribute, and the observed point is the value. It can also be thought of as a “key-value store” nested within a patient (where “key” is a synonym of “attribute”). Notice that the two wider rows have morphed into six skinnier rows –one row per observation. If you are a curious database administrator, peek at the the structure and rows of the redcap_data table. It is the most important table in the database.

Table 2: observation grain for intake instrument

REDCap and EMR databases store observations in their underlying table. This table is a simplification of the redcap_data table, which is the heart of the REDCap’s internal database.

pt
id
key value
1 height 1.0
1 weight 11.0
1 bmi 111.0
2 height 2.0
2 weight 22.0
2 bmi 222.0

If the investigation gains a longitudinal or repeating component, it becomes necessary to include the dimension of time. Suppose the protocol specifies five time points; the blood pressure instrument is captured at times 1, 2, & 3 while the laboratory instrument is captured at times 1 & 2. If you record this stage on paper, it will likely resemble Tables 3a & 3b: one for vitals and one for labs.

Table 3a: patient-time grain for blood_pressure instrument

pt
id
time sbp dbp
1 1 1.1 11.1
1 2 1.2 11.2
1 3 1.3 11.3
2 1 2.1 22.1
2 2 2.2 22.2
2 3 2.3 22.3

Table 3b: patient-time grain for laboratory instrument

pt
id
time lab dose
1 1 aa1 1.1 mg
1 2 aa2 1.2 mg
2 1 bb1 2.1 mg
2 2 bb2 2.2 mg

When these measurements are added to REDCap’s observation table, it resembles Table 4. Two new columns are required to uniquely distinguish the instrument and its ordinal position. Notice the first six rows are copied from Table 2; they have empty values for the repeating structure.

Table 4: observation grain for intake, blood_pressure, and laboratory instruments

pt
id
repeat
instrument
repeat
instance
key value
1 height 1.0
1 weight 11.0
1 bmi 111.0
2 height 2.0
2 weight 22.0
2 bmi 222.0
1 blood_pressure 1 sbp 1.1
1 blood_pressure 1 dbp 11.1
1 blood_pressure 2 sbp 1.2
1 blood_pressure 2 dbp 11.2
1 blood_pressure 3 sbp 1.3
1 blood_pressure 3 dbp 11.3
1 laboratory 1 lab aa1
1 laboratory 1 conc 1.1 ppm
1 laboratory 2 lab aa2
1 laboratory 2 conc 1.2 ppm
2 blood_pressure 1 sbp 2.1
2 blood_pressure 1 dbp 22.1
2 blood_pressure 2 sbp 2.2
2 blood_pressure 2 dbp 22.2
2 blood_pressure 3 sbp 2.3
2 blood_pressure 3 dbp 22.3
2 laboratory 1 lab bb1
2 laboratory 1 conc 2.1 ppm
2 laboratory 2 lab bb2
2 laboratory 2 conc 2.2 ppm

As mentioned above, there isn’t a universally good way to coerce Tables 1, 3a, and 3b into a single rectangle because the rows represent different things. Or from REDCap’s perspective, there’s not a good transformation of redcap_data (i.e., Table 4) that is appropriate for most statistical programs.

When forced to combine the different entities, the best option is probably Table 5. We call this a “block dataset”, borrowing from linear algebra’s block matrix term. You can see the mishmash of tables masquerading as a unified dataset. The rows lack the conceptual coherency of Tables 1, 3a, & 3b.

Table 5: mishmashed grain

pt
id
repeat
instrument
repeat
instance
height weight bmi sbp dbp lab conc
1 1.0 11.0 111.0
1 blood_pressure 1 1.1 11.1
1 blood_pressure 2 1.2 11.2
1 blood_pressure 3 1.3 11.3
1 laboratory 1 aa1 1.1 ppm
1 laboratory 2 aa2 1.2 ppm
2 2.0 22.0 222.0
2 blood_pressure 1 2.1 22.1
2 blood_pressure 2 2.2 22.2
2 blood_pressure 3 2.3 22.3
2 laboratory 1 bb1 2.1 ppm
2 laboratory 2 bb2 2.2 ppm

A block dataset is not inherently bad. After all, Table 5 can be unambiguously transformed to and from Table 4.

Table 5’s primary limitation is that a block dataset is not understood by analysis software used in conventional medical research. At best, the dataset always will require additional preparation. At worst, the analyst will model the rows inappropriately, which will produce misleading conclusions.

Table 5’s secondary limitation is inefficiency. The empty cells aren’t computationally free. Every cell must be queried from the database and concatenated in REDCap’s web server in order to return Table 5 in the plain-text csv, json, or xml format. In our simple example, more than half of the block dataset’s cells are wasted. The emptiness frequently exceeds 90% in real-world REDCap projects (because they tend to have many more variables and repeating instances). The emptiness always exceeds 99.9% in real-world EMRs.

For this reason, REDCap and EMR design their observation table to resemble the computational structure of a sparse matrix. (The only important difference is that REDCap’s unspecified cells are interpreted as null/empty, while a sparse matrix’s unspecified cells are interpreted as zero.)

In the case of a sparse matrix, substantial memory requirement reductions can be realized by storing only the non-zero entries. Depending on the number and distribution of the non-zero entries, different data structures can be used and yield huge savings in memory when compared to the basic approach. The trade-off is that accessing the individual elements becomes more complex and additional structures are needed to be able to recover the original matrix unambiguously. (source: Wikipedia: Sparse matrix - storage)

Terminology

observation

The term “observation” in the world of medical databases has a different and more granular meaning than it does in the tidyverse literature. In REDCap and medical databases, an observation is typically a single point (such as a heart rate or systolic blood pressure) with contextual variables (such as the the associated date, unit, visit ID, and patient ID); see Tables 2 and 4 above. In the tidyverse publications, an observation is roughly equivalent to a REDCap instrument (which is a collection of associated values); see Tables 1, 3a, and 3b.

(We use the medical terminology in this vignette. We’d love to hear if someone has another term that’s unambiguous.)

Concept REDCap & Medical World Tidyverse Literature
A single measured point observation value
A collection of associated points instrument observation

Retrieving from REDCap

Many new REDCap users will submit a single API call and unintentionally obtain something like Table 5; they then try to extract something resembling Tables 1, 3a, & 3b. Although this can be successful, we strongly discourage it. The code is difficult to maintain and is not portable to REDCap projects with different instruments. (The code is really slow and ugly too.)

Our advice is to start before Table 5 is assembled –retrieve the information in a better way. Like other database systems, request the three tables separately from the server and then combine them on your desktop to fit your analyses if necessary.

Two approaches are appropriate for most scenarios:

  1. multiple calls to REDCapR’s redcap_read(), or
  2. a single call to REDCapTidieR’s redcap_read_tidy().

The code in the vignette requires the magrittr package for the %>% (alternatively you can use |> if you’re using R 4.0.2 or later).

The vignette uses these credentials to retrieve the practice/fake dataset. This is not appropriate for datasets containing PHI or other sensitive information. Please see Part 2 - Retrieve Protected Token of the Typical REDCap Workflow for a Data Analyst vignette for secure approaches.

# Support pipes
library(magrittr)

# Retrieve token
path_credential <- system.file("misc/dev-2.credentials", package = "REDCapR")
credential  <- REDCapR::retrieve_credential_local(
  path_credential = path_credential,
  project_id      = 62
)

One REDCapR Call for Each Table

The tidy datasets represented in Tables 1, 3a, and 3b can be obtained by calling REDCapR three times –one call per table. Using the forms parameter, pass “intake” to get Table 1, “blood_pressure” to get Table 3a, and “laboratory” to get Table 3b.

Although it is not required, we recommend specifying a readr::cols() object to ensure the desired variable data types.

Retrieve patient-level table (corresponding to Table 1)

col_types_intake <-
  readr::cols_only(
    record_id                 = readr::col_integer(),
    height                    = readr::col_double(),
    weight                    = readr::col_double(),
    bmi                       = readr::col_double()
  )

ds_intake <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri, # From the previous code snippet.
    token       = credential$token,
    forms       = c("intake"),
    col_types   = col_types_intake,
    verbose     = FALSE,
  )$data

ds_intake

Retrieve patient-time-level tables (corresponding to Tables 3a & 3b)

col_types_blood_pressure <-
  readr::cols(
    record_id                 = readr::col_integer(),
    redcap_repeat_instrument  = readr::col_character(),
    redcap_repeat_instance    = readr::col_integer(),
    sbp                       = readr::col_double(),
    dbp                       = readr::col_double(),
    blood_pressure_complete   = readr::col_integer()
  )

ds_blood_pressure <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri,
    token       = credential$token,
    forms       = c("blood_pressure"),
    col_types   = col_types_blood_pressure,
    verbose     = FALSE
  )$data

ds_blood_pressure %>%
  tidyr::drop_na(redcap_repeat_instrument)

col_types_laboratory  <-
  readr::cols(
    record_id                 = readr::col_integer(),
    redcap_repeat_instrument  = readr::col_character(),
    redcap_repeat_instance    = readr::col_integer(),
    lab                       = readr::col_character(),
    conc                      = readr::col_character(),
    laboratory_complete       = readr::col_integer()
  )

ds_laboratory  <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri,
    token       = credential$token,
    forms       = c("laboratory"),
    col_types   = col_types_laboratory,
    verbose     = FALSE
  )$data

ds_laboratory %>%
  tidyr::drop_na(redcap_repeat_instrument)

Retrieve block tables (corresponding to Table 5)

If for some reason you need the block dataset through the API, one call will retrieve it.

ds_block <-
  REDCapR::redcap_read(
    redcap_uri  = credential$redcap_uri,
    token       = credential$token,
    col_types   = readr::cols(.default = readr::col_character()),
    verbose     = FALSE,
  )$data

ds_block

One REDCapTidieR Call for All Tables

REDCapTidieR’s initial motivation is to facilitate longitudinal analyses and promote tidy data hygiene.

{Stephan Kadauke & Richard Hanna, please represent your package as you wish. Tell me if I’ve positioned it differently than you would have.}

Choosing between the Approaches

When retrieving data from REDCap, we recommend calling REDCapTidieR in many scenarios, such as:

  • you are new to managing or analyzing data with R, or
  • your analyses will require most of the dataset’s rows or columns, or
  • you’d benefit from some of the auxiliary information in REDCapTidieR’s supertibble, such as the instrument’s structure.

However we recommend calling REDCapR in other scenarios. It could be worth calling REDCapR multiple times if:

  • you are performing some operation other than retrieving/reading data from REDCap,
  • you are comfortable with managing and analyzing data with R, or
  • your analyses require only a fraction of the data (such as (a) you need only the first event, or (b) the analyses don’t involve most of the instruments), or
  • you want to specify the variables’ data types with readr::cols().

If in doubt, start with REDCapTidieR. Escalate to REDCapR if your download time is too long and might be decreased by reducing the information retrieved from the server and transported across the network.

And of course many scenarios are solved best with a combination of both packages, such as (a) REDCapR populates the initial demographics in REDCap, (b) research staff enter measures collected from patients over time, (c) REDCapTidieR retrieves the complete longitudinal dataset, (d) dplyr joins the tibbles, and finally (e) lme4 tests hypotheses involving patient trajectories over time.

Escalating to REDCapR

Even if you think you’ll need REDCapR’s low-level control, consider starting with REDCapTidieR anyway. …particularly if you are unsure how to specify the grain of each table. The structure of REDCapTidieR’s tables easily compatible with conventional analyses. If you need the performance of REDCapR but are unsure how the tables should look, simply execute something like REDCapTidieR::redcap_read_tidy(url, project_token) and study its output. Then try to mimic it exactly with REDCapR::redcap_read() calls.

Finally, cull unwanted cells using the parameters of REDCapR::redcap_read(). These data points will not even leave the REDCap instance, which will improve performance. Some possible strategies include passing arguments to

  • forms: this will retrieve only the specified instruments/forms. Beginners should start here. It is easy to conceptualize and usually has a big increase in speed for just a little development work.
  • events: this will retrieve only the desired events within a longitudinal project. For instance if the analyses involve only the “intake” and “follow up #1” events, leave follow ups #2, #3, and #4 on the server.
  • fields: this is more granular than forms. It can be combined with calls to forms, such as passing "bmi" to fields and c("blood_pressure", "laboratory") to forms.
  • records: in the example scenario, this will pluck individual patients and their associated events and repeating instances. To be useful in research, it’s usually combined with other REDCapR calls. See the Read a subset of records, conditioned on the values in some variables section of REDCapR’s Basic REDCapR Operations vignette.
  • filter_logic: this will leverage the observation values to limit the rows returned. Because the underlying table does not index the obs values, this will be less computationally efficient than the options above.
  • datetime_range_begin & datetime_range_end: this will return only the records that have been created or modified during the specified window.

Note that the efficiency gain from moving from the block dataset to REDCapTidieR is different than the gain from moving from REDCapTidieR to REDCapR. When moving to from Table 5 to a REDCapTidieR Supertibble, you are eliminating empty cells that will never contain worthwhile data. When moving from a REDCapTidieR Supertibble call to a collection of REDCapR calls, you are eliminating cells that contain data, but may not be relevant to your analysis (such as a patient’s name or the time a lab specimen was collected). {This paragraph needs work.}

Advanced

Longitudinal

{TODO: Generalize the argument to longitudinal events}

pt repeat
instrument
repeat
instance
height weight bmi sbp dbp lab conc
1 1.0 11.0 111.0
1 blood_pressure 1 1.1 11.1
1 blood_pressure 2 1.2 11.2
1 blood_pressure 3 1.3 11.3
1 laboratory 3 aa3 1.3 ppm
1 laboratory 4 aa4 1.4 ppm
2 2.0 22.0 222.0
2 blood_pressure 1 2.1 22.1
2 blood_pressure 2 2.2 22.2
2 blood_pressure 3 2.3 22.3
2 laboratory 3 bb3 2.3 ppm
2 laboratory 4 bb4 2.4 ppm

Caching Results to Improve Performance

If escalating to REDCapR didn’t decrease the duration enough, consider the strategy of calling REDCap only once per day (with either package) and saving the results to a secured disk. This will be efficient when the analyses require a large dataset, but not a real-time dataset.

In many retrospective projects, multiple analyses during the day can reuse the same dataset retrieved the night before (with a function like readr::write_rds()). This has helped our complicated investigations when multiple statisticians frantically tackle overlapping aspects before a funder’s deadline.

Essentially you are transferring security responsibilities from REDCap to the file server. To balance its advantages, the approach’s drawbacks include:

  • extra effort is required to create, secure, and maintain a networked drive that is accessible to only authorized analysts. This will take 15-60 minutes, depending on your institution.
  • extra effort is required to create an automated program to retrieve and cache the dataset every night. See [CDS] for a head start.
  • REDCap’s logs will not support a meaningful audit because no human is accessing the API. User access audits are now the responsibility of the file system.

Caching Results for other Languages

Packages for other programming languages have been developed that access REDCap’s API, such PyCap and PhpCap. Please see a more complete list at https://redcap-tools.github.io/projects/. The caching strategy described above may also benefit your investigation if:

  • it uses a language like SAS that does not support packages. We strongly discourage deploying loose scripts that call the API. When something is fixed in the original script, it must be distributed by the developer to the ~50 client institutions, then it must by copied to the ~100 projects within each institution. In contrast, when a developer updates a package distributed through a repository like CRAN, the client needs to run only base::update.packages() once per machine to update all its packages –not just REDCap packages. Data scientists already do this regularly.

  • it uses a language like Julia that lacks a stable or updated package. The REDCap developers are always releasing useful features (e.g., repeated measures) that improve the quality and efficiency of your research. Leverage them.

  • your investigation incorporates multiple programming languages and you would like the analysis team to use a consistent dataset. In this scenario, we recommend the code save the same R dataset in multiple file formats, such as a csv file and a feather file.

Session Information

For the sake of documentation and reproducibility, the current report was rendered in the following environment. Click the line below to expand.

Environment