The challenges of implementing hybrid baselines for the interpretation of longitudinal behavioral data from individuals
npj | digital medicine
Comment
Published in partnership with Seoul National University Bundang Hospital
https://doi.org/10.1038/s41746-026-02668-5
The challenges of implementing hybrid
baselines for the interpretation of
longitudinal behavioral data from individuals
1234567890():,;
1234567890():,;
Sandra Anna Just, Enrico Tedeschi, Einar Holsbø, Karl Øyvind Mikalsen,
Lars Ailo Bongo, Philipp Homan & Brita Elvevåg
Establishing whether observed behavioral
differences reflect meaningful change in an
individual necessitates baselines specific to the
individual and task. Automated hybrid solutions
combine adaptive baselines with fixed thresholds.
Applying this approach to behavioral science
harbors challenges: the pronounced gap between
observable measurement and underlying construct
means ground truth is typically unavailable. A
stepwise framework is proposed to determine and
evaluate the validity of baselines for longitudinal
behavioral measurements.
Check for updates
as baselines conflates variability between individuals with variability within
an individual and mischaracterizes intra-individual differences and putative
change3. Moreover, static group-level norms ignore temporal variability,
which is essential for distinguishing routine fluctuations from unusual
patterns in a time series5. Reliable interpretation of longitudinal behavioral
data, therefore, requires baselines derived from an individual’s prior measurements rather than group averages6. This principle applies broadly across
behavioral research, including clinical trials. For example, although psychiatric assessment is often criterion-based (i.e., is a symptom present or
not?), changes in symptom severity must still be interpreted relative to an
individual’s baseline psychopathology. To illustrate, consider a patient who
experiences auditory hallucinations multiple times per day, who will have a
fundamentally different baseline than one who hears voices only a few times
per month. A change to experiencing hallucinations once a week would
represent improvement for the former but worsening for the latter. Individual baselines are therefore essential for accurate interpretation of
symptom trajectories in clinical research.
The necessity for data-driven individual baselines in behavioral science
Group-level norms are not suitable for the interpretation of
longitudinal data from individuals
Behavioral science domains such as psychology have overwhelmingly been
built around group-level inference, aiming to understand inter-individual
differences. The field lacks a comparable tradition of treating intra-individual
change as a primary object of interpretation. This gap is significant because
determining how a person changes over time requires different conceptualizations and statistical approaches than determining how individuals
differ from one another at a given time1,2. Group-level norms help quantify
how a single, cross-sectional measurement compares to a reference population. However, these norms are not appropriate when evaluating intraindividual change in longitudinal behavioral measurements3. For example,
when an individual’s memory performance is assessed repeatedly over time,
group-level norms are useful to indicate how task performance compares to a
group average at each single measurement. They are not useful for understanding how the individual’s measurements relate to each other over time,
how the measurements are augmented – and potentially transformed – by
practice or learning effects, or when they represent a change.
Intra-individual measurements taken over time need to be interpreted
relative to a baseline. In behavioral science, group averages are unsuitable as
individual baselines. Treating a group average as an individual baseline
assumes ergodicity, whereby individual-level statistical properties equal
those observed at the group level. This assumption likely does not hold for
most behavioral processes4. As shown in Fig. 1 (a), the use of group averages
npj Digital Medicine | (2026)9:331
Longitudinal behavioral data can now be collected at an unprecedented density and frequency – through remote, digitized, or automated assessments7–10. These measurements produce time-series
datasets, whose value lies in the trajectories and dependencies they
reveal rather than single measurements. Moreover, as one task alone is
rarely used to infer human behavior, measurements are collected
across different assays. Such repeated and combined observations
contain rich information about intra-individual patterns and dependencies, but also present methodological challenges that cannot be
addressed using group-level norms. The challenge is that the sheer
volume and density of repeated measurements transform the very
nature of the data being collected. There are two reasons for this. First,
the experimental process becomes more familiar to the participant
over time, and they learn what is expected from them and may perform better. Indeed, most cognitive and behavioral tasks are associated
with familiarity, learning, and practice effects. Second, as the years,
months, weeks of assessment go by, inevitably, participants will be
required to engage in behavioral tasks where stimuli material may
overlap with tasks they have previously taken part in. The observed
effects will be the result of a combination of effects (e.g., multiplicative
and task transfer effects). Thus, as data collection continues, the
datasets have the potential to get messier and more complex (in terms
of discerning what is attributable specifically to the task construct
versus other variables).
1
npj | digital medicine
Comment
Fig. 1 | Hybrid baselines for the interpretation of longitudinal behavioral data.
The four panels show hypothetical line charts of longitudinal measurements. (a–c)
show the trajectory of raw scores from two individuals for the same behavioral
measure, while (d) shows the trajectory of raw scores from one person in two
different measures. (a) Group-average baselines ignore different individual averages
and trajectories. Scores from person A (purple) and person B (brown) are plotted
alongside a group baseline that represents the group’s average (yellow) with a defined
uncertainty range (shaded light yellow area). Person A shows an individual average
of scores far above the group baseline. Based on that baseline, the sudden decline in
their last measurement is missed as it is still considered ‘above average’. In comparison, scores from person B decline more gradually but are flagged as they lie below
the group-average baseline. (b) Individual baselines support interpretation of
longitudinal data. Plotted scores are identical to (a), but now each person has an
individual baseline (baseline 1 and 2). While both baselines start off at the group
average, they continuously adapt to the person’s measurements. In contrast to (a),
person A’s last measurement is now flagged as it falls below their adapted individualaverage baseline (...truncated)