This vignette pertains to reading REDCap records from a project that (a) has longitudinal events or (b) has a repeating measure. The first section conceptually discusses how REDCap stores complex structures. The remaining sections describe how to best retrieve complex structures with the REDCapTidyieR and REDCapR packages.
redcap_read()
.If your REDCap project is longitudinal or contains repeating measures, a single call to the API (or a single export through the browser) will return a dataset that is not readily analyzed. Instead, the dataset will resemble Table 5. This isn’t because of a software bug, but because you haven’t told the software how you would like the data structured. There isn’t a good way to jam this multidimensional space into a rectangle of points. Our advice for querying REDCap is the same as querying any database system: request separate datasets that have a natural “grain” and assemble them as to fit your analyses.
Suppose you have two patients (i.e., “1” and “2”) with three
intake variables (i.e., height
,
weight
, and bmi
). If you record this on a
piece of paper, it would probably look like Table 1. The table’s
grain is “patient”, because each row represents a distinct
patient. Understanding the grain of each structure below will help you
understand how the structures are re-expressions of the same set of
observations.
pt id |
height | weight | bmi |
---|---|---|---|
1 | 1.0 | 11.0 | 111.0 |
2 | 2.0 | 22.0 | 222.0 |
This patient-grain structure is how the data points are most comfortably inputted by humans into REDCap, and it is the default when exported through the browser and API. However it is stored differently by REDCap’s internal database.
REDCap’s flexibility is a driver of its success. Once a research team
learns REDCap, it can reuse the knowledge to capture anything from
leukemia to lunch orders. But to achieve this flexibility in the world
of REDCap and EMRs, data are stored along the observation grain. In
computer science, this is commonly called an EAV structure (which stands
for entity-attribute-value). The patient’s ID is the entity, the
variable type is the attribute, and the observed point is the value. It
can also be thought of as a “key-value store” nested within a patient
(where “key” is a synonym of “attribute”). Notice that the two wider
rows have morphed into six skinnier rows –one row per observation. If
you are a curious database administrator, peek at the the structure and
rows of the redcap_data
table. It is the most important
table in the database.
intake
instrumentREDCap and EMR databases store observations in their underlying
table. This table is a simplification of the redcap_data
table, which is the heart of the REDCap’s internal database.
pt id |
key | value |
---|---|---|
1 | height | 1.0 |
1 | weight | 11.0 |
1 | bmi | 111.0 |
2 | height | 2.0 |
2 | weight | 22.0 |
2 | bmi | 222.0 |
If the investigation gains a longitudinal or repeating component, it becomes necessary to include the dimension of time. Suppose the protocol specifies five time points; the blood pressure instrument is captured at times 1, 2, & 3 while the laboratory instrument is captured at times 1 & 2. If you record this stage on paper, it will likely resemble Tables 3a & 3b: one for vitals and one for labs.
blood_pressure
instrumentpt id |
time | sbp | dbp |
---|---|---|---|
1 | 1 | 1.1 | 11.1 |
1 | 2 | 1.2 | 11.2 |
1 | 3 | 1.3 | 11.3 |
2 | 1 | 2.1 | 22.1 |
2 | 2 | 2.2 | 22.2 |
2 | 3 | 2.3 | 22.3 |
laboratory
instrumentpt id |
time | lab | dose |
---|---|---|---|
1 | 1 | aa1 | 1.1 mg |
1 | 2 | aa2 | 1.2 mg |
2 | 1 | bb1 | 2.1 mg |
2 | 2 | bb2 | 2.2 mg |
When these measurements are added to REDCap’s observation table, it resembles Table 4. Two new columns are required to uniquely distinguish the instrument and its ordinal position. Notice the first six rows are copied from Table 2; they have empty values for the repeating structure.
intake
,
blood_pressure
, and laboratory
instrumentspt id |
repeat instrument |
repeat instance |
key | value |
---|---|---|---|---|
1 | – | – | height | 1.0 |
1 | – | – | weight | 11.0 |
1 | – | – | bmi | 111.0 |
2 | – | – | height | 2.0 |
2 | – | – | weight | 22.0 |
2 | – | – | bmi | 222.0 |
1 | blood_pressure | 1 | sbp | 1.1 |
1 | blood_pressure | 1 | dbp | 11.1 |
1 | blood_pressure | 2 | sbp | 1.2 |
1 | blood_pressure | 2 | dbp | 11.2 |
1 | blood_pressure | 3 | sbp | 1.3 |
1 | blood_pressure | 3 | dbp | 11.3 |
1 | laboratory | 1 | lab | aa1 |
1 | laboratory | 1 | conc | 1.1 ppm |
1 | laboratory | 2 | lab | aa2 |
1 | laboratory | 2 | conc | 1.2 ppm |
2 | blood_pressure | 1 | sbp | 2.1 |
2 | blood_pressure | 1 | dbp | 22.1 |
2 | blood_pressure | 2 | sbp | 2.2 |
2 | blood_pressure | 2 | dbp | 22.2 |
2 | blood_pressure | 3 | sbp | 2.3 |
2 | blood_pressure | 3 | dbp | 22.3 |
2 | laboratory | 1 | lab | bb1 |
2 | laboratory | 1 | conc | 2.1 ppm |
2 | laboratory | 2 | lab | bb2 |
2 | laboratory | 2 | conc | 2.2 ppm |
As mentioned above, there isn’t a universally good way to coerce
Tables 1, 3a, and 3b into a single rectangle because the rows represent
different things. Or from REDCap’s perspective, there’s not a good
transformation of redcap_data
(i.e., Table 4) that
is appropriate for most statistical programs.
When forced to combine the different entities, the best option is probably Table 5. We call this a “block dataset”, borrowing from linear algebra’s block matrix term. You can see the mishmash of tables masquerading as a unified dataset. The rows lack the conceptual coherency of Tables 1, 3a, & 3b.
pt id |
repeat instrument |
repeat instance |
height | weight | bmi | sbp | dbp | lab | conc |
---|---|---|---|---|---|---|---|---|---|
1 | – | – | 1.0 | 11.0 | 111.0 | – | – | – | – |
1 | blood_pressure | 1 | – | – | – | 1.1 | 11.1 | – | – |
1 | blood_pressure | 2 | – | – | – | 1.2 | 11.2 | – | – |
1 | blood_pressure | 3 | – | – | – | 1.3 | 11.3 | – | – |
1 | laboratory | 1 | – | – | – | – | – | aa1 | 1.1 ppm |
1 | laboratory | 2 | – | – | – | – | – | aa2 | 1.2 ppm |
2 | – | – | 2.0 | 22.0 | 222.0 | – | – | – | – |
2 | blood_pressure | 1 | – | – | – | 2.1 | 22.1 | – | – |
2 | blood_pressure | 2 | – | – | – | 2.2 | 22.2 | – | – |
2 | blood_pressure | 3 | – | – | – | 2.3 | 22.3 | – | – |
2 | laboratory | 1 | – | – | – | – | – | bb1 | 2.1 ppm |
2 | laboratory | 2 | – | – | – | – | – | bb2 | 2.2 ppm |
A block dataset is not inherently bad. After all, Table 5 can be unambiguously transformed to and from Table 4.
Table 5’s primary limitation is that a block dataset is not understood by analysis software used in conventional medical research. At best, the dataset always will require additional preparation. At worst, the analyst will model the rows inappropriately, which will produce misleading conclusions.
Table 5’s secondary limitation is inefficiency. The empty cells aren’t computationally free. Every cell must be queried from the database and concatenated in REDCap’s web server in order to return Table 5 in the plain-text csv, json, or xml format. In our simple example, more than half of the block dataset’s cells are wasted. The emptiness frequently exceeds 90% in real-world REDCap projects (because they tend to have many more variables and repeating instances). The emptiness always exceeds 99.9% in real-world EMRs.
For this reason, REDCap and EMR design their observation table to resemble the computational structure of a sparse matrix. (The only important difference is that REDCap’s unspecified cells are interpreted as null/empty, while a sparse matrix’s unspecified cells are interpreted as zero.)
In the case of a sparse matrix, substantial memory requirement reductions can be realized by storing only the non-zero entries. Depending on the number and distribution of the non-zero entries, different data structures can be used and yield huge savings in memory when compared to the basic approach. The trade-off is that accessing the individual elements becomes more complex and additional structures are needed to be able to recover the original matrix unambiguously. (source: Wikipedia: Sparse matrix - storage)
The term “observation” in the world of medical databases has a different and more granular meaning than it does in the tidyverse literature. In REDCap and medical databases, an observation is typically a single point (such as a heart rate or systolic blood pressure) with contextual variables (such as the the associated date, unit, visit ID, and patient ID); see Tables 2 and 4 above. In the tidyverse publications, an observation is roughly equivalent to a REDCap instrument (which is a collection of associated values); see Tables 1, 3a, and 3b.
(We use the medical terminology in this vignette. We’d love to hear if someone has another term that’s unambiguous.)
Concept | REDCap & Medical World | Tidyverse Literature |
---|---|---|
A single measured point | observation | value |
A collection of associated points | instrument | observation |
Many new REDCap users will submit a single API call and unintentionally obtain something like Table 5; they then try to extract something resembling Tables 1, 3a, & 3b. Although this can be successful, we strongly discourage it. The code is difficult to maintain and is not portable to REDCap projects with different instruments. (The code is really slow and ugly too.)
Our advice is to start before Table 5 is assembled –retrieve the information in a better way. Like other database systems, request the three tables separately from the server and then combine them on your desktop to fit your analyses if necessary.
Two approaches are appropriate for most scenarios:
redcap_read()
,
orredcap_read_tidy()
.The code in the vignette requires the magrittr package for the
%>%
(alternatively you can use |>
if
you’re using R 4.0.2 or later).
The vignette uses these credentials to retrieve the practice/fake dataset. This is not appropriate for datasets containing PHI or other sensitive information. Please see Part 2 - Retrieve Protected Token of the Typical REDCap Workflow for a Data Analyst vignette for secure approaches.
# Support pipes
library(magrittr)
# Retrieve token
path_credential <- system.file("misc/dev-2.credentials", package = "REDCapR")
credential <- REDCapR::retrieve_credential_local(
path_credential = path_credential,
project_id = 62
)
The tidy datasets represented in Tables 1, 3a, and 3b can be obtained
by calling REDCapR three times –one call per table. Using the
forms
parameter, pass “intake” to get Table 1,
“blood_pressure” to get Table 3a, and “laboratory” to get Table 3b.
Although it is not required, we recommend specifying a readr::cols()
object to ensure the desired variable data types.
col_types_intake <-
readr::cols_only(
record_id = readr::col_integer(),
height = readr::col_double(),
weight = readr::col_double(),
bmi = readr::col_double()
)
ds_intake <-
REDCapR::redcap_read(
redcap_uri = credential$redcap_uri, # From the previous code snippet.
token = credential$token,
forms = c("intake"),
col_types = col_types_intake,
verbose = FALSE,
)$data
ds_intake
col_types_blood_pressure <-
readr::cols(
record_id = readr::col_integer(),
redcap_repeat_instrument = readr::col_character(),
redcap_repeat_instance = readr::col_integer(),
sbp = readr::col_double(),
dbp = readr::col_double(),
blood_pressure_complete = readr::col_integer()
)
ds_blood_pressure <-
REDCapR::redcap_read(
redcap_uri = credential$redcap_uri,
token = credential$token,
forms = c("blood_pressure"),
col_types = col_types_blood_pressure,
verbose = FALSE
)$data
ds_blood_pressure %>%
tidyr::drop_na(redcap_repeat_instrument)
col_types_laboratory <-
readr::cols(
record_id = readr::col_integer(),
redcap_repeat_instrument = readr::col_character(),
redcap_repeat_instance = readr::col_integer(),
lab = readr::col_character(),
conc = readr::col_character(),
laboratory_complete = readr::col_integer()
)
ds_laboratory <-
REDCapR::redcap_read(
redcap_uri = credential$redcap_uri,
token = credential$token,
forms = c("laboratory"),
col_types = col_types_laboratory,
verbose = FALSE
)$data
ds_laboratory %>%
tidyr::drop_na(redcap_repeat_instrument)
If for some reason you need the block dataset through the API, one call will retrieve it.
REDCapTidieR’s initial motivation is to facilitate longitudinal analyses and promote tidy data hygiene.
{Stephan Kadauke & Richard Hanna, please represent your package as you wish. Tell me if I’ve positioned it differently than you would have.}
When retrieving data from REDCap, we recommend calling REDCapTidieR in many scenarios, such as:
However we recommend calling REDCapR in other scenarios. It could be worth calling REDCapR multiple times if:
readr::cols()
.If in doubt, start with REDCapTidieR. Escalate to REDCapR if your download time is too long and might be decreased by reducing the information retrieved from the server and transported across the network.
And of course many scenarios are solved best with a combination of both packages, such as (a) REDCapR populates the initial demographics in REDCap, (b) research staff enter measures collected from patients over time, (c) REDCapTidieR retrieves the complete longitudinal dataset, (d) dplyr joins the tibbles, and finally (e) lme4 tests hypotheses involving patient trajectories over time.
Even if you think you’ll need REDCapR’s low-level control, consider
starting with REDCapTidieR anyway. …particularly if you are unsure how
to specify the grain of each table. The structure of REDCapTidieR’s
tables easily compatible with conventional analyses. If you need the
performance of REDCapR but are unsure how the tables should look, simply
execute something like
REDCapTidieR::redcap_read_tidy(url, project_token)
and
study its output. Then try to mimic it exactly with
REDCapR::redcap_read()
calls.
Finally, cull unwanted cells using the parameters of
REDCapR::redcap_read()
. These data points will not even
leave the REDCap instance, which will improve performance. Some possible
strategies include passing arguments
to
forms
: this will retrieve only the specified
instruments/forms. Beginners should start here. It is easy to
conceptualize and usually has a big increase in speed for just a little
development work.events
: this will retrieve only the desired events
within a longitudinal project. For instance if the analyses involve only
the “intake” and “follow up #1” events, leave follow ups #2, #3, and #4
on the server.fields
: this is more granular than forms
.
It can be combined with calls to forms
, such as passing
"bmi"
to fields and
c("blood_pressure", "laboratory")
to forms.records
: in the example scenario, this will pluck
individual patients and their associated events and repeating instances.
To be useful in research, it’s usually combined with other REDCapR
calls. See the Read
a subset of records, conditioned on the values in some variables
section of REDCapR’s Basic
REDCapR Operations vignette.filter_logic
: this will leverage the observation values
to limit the rows returned. Because the underlying table does not index
the obs values, this will be less computationally efficient than the
options above.datetime_range_begin
&
datetime_range_end
: this will return only the records that
have been created or modified during the specified window.Note that the efficiency gain from moving from the block dataset to REDCapTidieR is different than the gain from moving from REDCapTidieR to REDCapR. When moving to from Table 5 to a REDCapTidieR Supertibble, you are eliminating empty cells that will never contain worthwhile data. When moving from a REDCapTidieR Supertibble call to a collection of REDCapR calls, you are eliminating cells that contain data, but may not be relevant to your analysis (such as a patient’s name or the time a lab specimen was collected). {This paragraph needs work.}
{TODO: Generalize the argument to longitudinal events}
pt |
repeat instrument |
repeat instance |
height |
weight |
bmi |
sbp |
dbp |
lab |
conc |
---|---|---|---|---|---|---|---|---|---|
1 | – | – | 1.0 | 11.0 | 111.0 | – | – | – | – |
1 | blood_pressure | 1 | – | – | – | 1.1 | 11.1 | – | – |
1 | blood_pressure | 2 | – | – | – | 1.2 | 11.2 | – | – |
1 | blood_pressure | 3 | – | – | – | 1.3 | 11.3 | – | – |
1 | laboratory | 3 | – | – | – | – | – | aa3 | 1.3 ppm |
1 | laboratory | 4 | – | – | – | – | – | aa4 | 1.4 ppm |
2 | – | – | 2.0 | 22.0 | 222.0 | – | – | – | – |
2 | blood_pressure | 1 | – | – | – | 2.1 | 22.1 | – | – |
2 | blood_pressure | 2 | – | – | – | 2.2 | 22.2 | – | – |
2 | blood_pressure | 3 | – | – | – | 2.3 | 22.3 | – | – |
2 | laboratory | 3 | – | – | – | – | – | bb3 | 2.3 ppm |
2 | laboratory | 4 | – | – | – | – | – | bb4 | 2.4 ppm |
If escalating to REDCapR didn’t decrease the duration enough, consider the strategy of calling REDCap only once per day (with either package) and saving the results to a secured disk. This will be efficient when the analyses require a large dataset, but not a real-time dataset.
In many retrospective projects, multiple analyses during the day can
reuse the same dataset retrieved the night before (with a function like
readr::write_rds()
).
This has helped our complicated investigations when multiple
statisticians frantically tackle overlapping aspects before a funder’s
deadline.
Essentially you are transferring security responsibilities from REDCap to the file server. To balance its advantages, the approach’s drawbacks include:
Packages for other programming languages have been developed that access REDCap’s API, such PyCap and PhpCap. Please see a more complete list at https://redcap-tools.github.io/projects/. The caching strategy described above may also benefit your investigation if:
it uses a language like SAS that does not support packages. We
strongly discourage deploying loose scripts that call the API. When
something is fixed in the original script, it must be distributed by the
developer to the ~50 client institutions, then it must by copied to the
~100 projects within each institution. In contrast, when a developer
updates a package distributed through a repository like CRAN, the client
needs to run only base::update.packages()
once per machine to update all its packages –not just REDCap packages.
Data scientists already do this regularly.
it uses a language like Julia that lacks a stable or updated package. The REDCap developers are always releasing useful features (e.g., repeated measures) that improve the quality and efficiency of your research. Leverage them.
your investigation incorporates multiple programming languages and you would like the analysis team to use a consistent dataset. In this scenario, we recommend the code save the same R dataset in multiple file formats, such as a csv file and a feather file.
For the sake of documentation and reproducibility, the current report was rendered in the following environment. Click the line below to expand.