The evidence base is thinner than it looks
Keystroke dynamics have emerged as one of the most promising modalities for passive cognitive assessment. Recent studies report impressive numbers:
But look closer.
Li et al. (2025) computed their AUC on the same 72-person sample used to fit a stepwise logistic regression. Kim et al. (2024) used 99 participants without cross-validation; a correction notice followed. The closest published study to what a modern keystroke-cognition instrument would actually capture (Meulemans, Van Waes, and Leijten 2022) was cross-sectional with n = 30.
The modality is real. The results are suggestive. But the field is drawing broad conclusions from samples that would not survive a single replication attempt.
The studies are on the wrong population
Keystroke timing can only reflect cognitive processes when typing is automatic. Below roughly 40 words per minute, variation in flight time is dominated by motor search and planning, not cognitive state. The cognitive signal is buried under motor variance that has nothing to do with neurodegeneration.
Pinet et al. (2022) documented two qualitatively different processing architectures in typists. In novices, each keystroke requires conscious key location, finger selection, and movement execution. In experts, the motor layer is automated and the keystroke signal becomes transparent to cognitive processes: word retrieval, syntactic planning, coherence monitoring.
The population currently at risk for cognitive decline (ages 60-80, born 1944-1964) largely did not grow up typing.
Schematic model of generational typing proficiency. Proficiency labels are editorial estimates informed by cohort-level data (Cramer-Petersen et al. 2022, Dhakal et al. 2018, Pew Research 2019), not individually validated assessments. The 2010+ estimate is based on early reports of declining keyboard proficiency in touchscreen-native cohorts, not validated studies.
The people currently old enough to be at risk for cognitive decline are the wrong people to study this way. Motor inexperience masquerades as cognitive slowing. Fluent elderly typists are unrepresentative. And typing proficiency correlates with cognitive reserve, confounding the very construct the instrument is supposed to isolate.
The window is opening. And closing.
The demographic confound is self-resolving. By 2045, the 60-80 age range will consist almost entirely of lifelong fluent typists. The motor noise floor drops with each decade. The automaticity threshold will be met by default.
But this projection has a load-bearing assumption: that those cohorts are still producing unassisted text character-by-character at volume when they arrive.
The cohort arriving with typing fluency is the same cohort arriving with maximum AI-mediation exposure.
Autocomplete, predictive text, and AI-assisted drafting are not future threats. They are present features of how most people under 40 already write. The demographic shift that would have resolved the motor confound is being partially foreclosed by a technology shift operating on a faster timeline.
The cohort born 1985-1995 is the first generation that acquired typing fluency in adolescence, has spent their entire adult lives producing unassisted text at volume, and will enter the cognitive decline screening window (ages 55-75) between 2040 and 2070. If personal baselines are not started on this cohort within the next 10-15 years, the opportunity is significantly degraded. No subsequent cohort will have spent decades producing unmediated text before AI assistance became the default input mode.
AI mediation doesn't add noise. It replaces the construct.
The distinction is between measurement error and construct invalidation.
Noise
The construct being measured is intact. The instrument captures it with some degree of error. More data helps. You can model and correct it.
Construct replacement
The measurement now corresponds to a different construct entirely. The surface form looks identical. The generating process is different. More data makes it worse.
When a person retrieves a word from memory and types it character by character, the timing data reflects lexical retrieval, orthographic encoding, and motor execution. When the same word appears because the person accepted a predictive text suggestion, the surface output is identical. The cognitive process is different: visual scanning, suggestion evaluation, acceptance execution.
This is empirically documented:
The contamination is invisible at the point of collection. No dataset records whether a word was typed or accepted from a suggestion. The distinguishing information exists only in process-level data that standard behavioral datasets do not collect.
This makes the contamination boundary a research contribution, not an engineering decision. An instrument that enforces unmediated input, attests that enforcement cryptographically on every session record, and archives the attestation specification by version and commit hash, produces a clean cognitive corpus at exactly the moment when clean corpora are disappearing from the world. The value of that corpus increases as AI mediation becomes more pervasive, because the baseline it preserves becomes harder to replicate.
This has happened before
The clock-drawing test has been a standard neuropsychological screening tool for decades. Draw a clock face showing a specified time. It assesses visuospatial ability, executive function, and semantic memory in a single brief task.
Vishnevsky, Fisher, and Specktor (2024) demonstrated that Gen Z adults underperform on the clock-drawing test not because of cognitive impairment but because of reduced familiarity with analog clock faces.
The assessment embedded a hidden cultural competency that is no longer universal. Keyboards will eventually be displaced the same way. The instruments built on them need to be designed so the cognitive constructs being measured can migrate to new input modalities, not be permanently coupled to keystroke timing.
From "compared to healthy people" to "compared to yourself"
Every published keystroke-cognition study asks: does this person look impaired compared to a population norm? That is the only question available from a one-time lab visit.
Once you have someone typing fluently every day for years, you can ask a different question:
Does this person look different from themselves?
This shift is not novel. Ecological Momentary Assessment (Shiffman, Stone, and Hufford 2008), Experience Sampling Methods (Csikszentmihalyi and Larson 1987), and idiographic approaches (Molenaar 2004) all argue that group-level averages obscure the individual-level dynamics that matter clinically. Molenaar demonstrated formally that inter-individual and intra-individual variation are statistically non-ergodic: what is true of a population average is not necessarily true of any individual in that population. N-of-1 trial designs are now accepted by the FDA for rare disease contexts.
What these traditions lack is an instrument that can sustain daily ecological data collection over years in healthy populations. EMA studies typically last days to weeks. ESM relies on self-report prompted by random signals. Neither captures behavioral process data at the temporal resolution needed for cognitive signal extraction. The paradigm exists. The instrument does not.
The instrument gap
The current research landscape is fragmented into two silos that do not talk to each other.
Keystroke dynamics researchers
Capture timing data but ignore the content of what is typed. Can tell you flight time is elevated but not whether the writer was struggling to retrieve a specific word or planning a complex sentence.
Computational linguistics researchers
Analyze transcribed text but have no access to the temporal process that produced it. Can tell you vocabulary diversity is declining but not whether the decline reflects retrieval difficulty or avoidance.
The combination is not additive. It is multiplicative. A decline in lexical diversity concurrent with a decline in production fluency means something clinically different from a decline in diversity with stable fluency. The first suggests retrieval difficulty under cognitive load. The second suggests vocabulary contraction independent of production. You need both signal channels to distinguish them.
No existing instrument at validated scale combines process capture with content analysis, longitudinal self-referential baselines with same-day calibration, and unmediated input with modality-aware construct definitions. What each of these requires, and why satisfying five of six is not enough, is the subject of the remaining sections.
Why writing is a uniquely rich cognitive channel
Why writing? Why not speech, gait analysis, eye tracking, or any other behavioral modality?
Writing is the only common daily behavior that simultaneously requires lexical retrieval, syntactic planning, coherence monitoring, semantic integration, and fine motor execution. No other modality loads this many cognitive systems at once during a single naturalistic act.
Speech
Rich in prosody and lexical content but lacks the motor encoding channel. Articulatory planning is automated differently from keystroke execution. No character-level timing data. Harder to capture in private, ecologically valid contexts without introducing observer effects.
Gait / Motor
Strong evidence for motor biomarkers of neurodegeneration (Buchman and Bennett 2011). But gait captures motor and postural control without linguistic content. It cannot distinguish cognitive slowing from physical deconditioning without a second signal channel.
Eye tracking
Captures attentional allocation and reading fluency but not production. Measures how someone processes existing text, not how they generate new text. The generative act is where retrieval difficulty, planning breakdown, and coherence failure become visible.
Typed writing
Combines fine motor timing (keystroke dynamics), lexical production (word choice, vocabulary), syntactic structure (sentence complexity), semantic coherence (topic management), and temporal process data (pause architecture, revision patterns). All captured simultaneously from a single naturalistic act.
Pinet et al. (2022) demonstrated that in fluent typists, the motor layer becomes transparent to cognitive processes: keystroke timing variations reflect lexical retrieval difficulty, syntactic planning load, and coherence monitoring, not finger search. The typing is automatic. What remains in the signal is cognition. And unlike every other measurement modality, the act itself is independently valuable: people already write daily, in private, with no expectation of feedback. The measurement modality is also the retention mechanism.
Existing instruments and what they measure
Several approaches to longitudinal behavioral and cognitive measurement exist. Each captures some of the signal channels an instrument would need. None combines them. The gaps are structural, not accidental.
AI-assisted journaling
Showing the participant what the system has learned creates a reflexive loop: the participant writes about the patterns they were shown, which the system detects as new patterns. The instrument alters the construct it measures.
Self-report mood tracking
Captures what the participant reports feeling, not how they are processing. Self-report is subject to recall bias, social desirability effects, and alexithymia. Gamification mechanics (streaks, scores) create engagement patterns that confound the signal.
Physiological wearables
Measures autonomic readiness, not cognitive process. Transparent scoring creates optimization behavior: participants modify their behavior to improve the score rather than the underlying state. The instrument changes what it measures (Goodhart's law).
Clinical cognitive batteries
A 45-minute assessment every 12 months cannot detect gradual trajectory shifts. Practice effects accumulate with repeated administration. These instruments diagnose after clinical threshold. They were not designed for early detection of pre-clinical drift.
Digital therapeutics
Woebot had the strongest clinical evidence of any digital mental health tool and ceased operations in June 2025. Scripted content libraries depreciate. They do not accumulate participant-specific longitudinal records.
Ephemeral writing tools
Closest to addressing the observer-effect problem. Entries fade or vanish, removing the feedback loop. But without data retention, no longitudinal analysis is possible. They solve the reactivity problem by abandoning the longitudinal one.
Alice
Architecturally satisfies the six requirements. Validation requires the longitudinal data the instrument is designed to accumulate. The gap between implemented and validated is the work that remains.
The structural pattern: every approach that accumulates participant data surfaces it back to the participant. Every approach that avoids the observer effect discards the data. No existing instrument combines longitudinal accumulation with participant-blind measurement. The combination is architecturally difficult because the standard model for justifying sustained participant engagement is showing them what you've learned. An instrument that learns but does not show must find a different mechanism for retention.
The observer effect is not a design preference. It is a validity requirement.
If a participant knows what is being measured, the measurement is no longer valid.
The Hawthorne effect (Adair 1984; McCambridge, Witton, and Elbourne 2014) is well-established: participants who know they are being observed modify their behavior. In longitudinal behavioral measurement, this is not a minor confound. It is fatal to the construct. If a participant knows their writing speed is being measured, they write differently. If they know lexical diversity is being tracked, they reach for unusual words. The signal no longer reflects natural cognitive process. It reflects performance.
Surfacing computed signals creates a second, deeper problem: optimization. Goodhart's law applies directly. Once the participant sees a metric, they optimize for the metric rather than the state the metric was designed to detect. Etkin (2016, Journal of Consumer Research) demonstrated across six experiments that personal quantification increases behavioral output but reduces enjoyment and intrinsic motivation. A participant who sees a decline in their processing speed metric becomes anxious, and the anxiety alters the very signals being measured. The instrument creates the condition it was designed to detect.
The clinical extreme is documented. Rosman et al. (2021, NIH-funded) reported a patient with atrial fibrillation who performed 916 smartwatch ECGs in one year after the device began surfacing cardiac data. Ambiguous readings ("inconclusive") produced the same behavioral response as actual arrhythmia detections. The patient developed illness anxiety disorder, made 12 unnecessary clinic and emergency department visits, and required cognitive behavioral therapy for remission. The authors note anecdotal reports from institutions nationwide, suggesting the case represents "the tip of an iceberg."
The problem compounds in clinical populations. Self-censorship in journaling is most acute in populations with PTSD, trauma histories, and anxiety disorders. The people who would produce the most clinically valuable writing data are the most likely to alter their writing when they know it is being analyzed. An instrument that surfaces its analysis guarantees that the participants who need it most will either censor themselves or stop using it.
This is not paternalism. It is the same logic that prevents a blood pressure cuff from showing readings during an ambulatory monitoring protocol. The measurement must not alter the measured state. The participant's relationship to the instrument must be with the surface (the question, the writing practice) while the measurement operates underneath, invisible.
Requirements for instrument validity
The preceding analysis constrains the design. A valid instrument for longitudinal cognitive measurement through naturalistic writing must satisfy all of the following simultaneously. Satisfying five of six produces a flawed instrument, not a slightly less good one.
Why build this as a product people choose to use, rather than a research tool distributed through clinics? Because longitudinal studies that rely on institutional recruitment cannot sustain daily participation over years. Retention in traditional EMA studies drops sharply after days to weeks. An instrument that produces valid longitudinal data must be something people want to use for its own sake. Ecological validity and sustained retention are not product decisions. They are methodological requirements that happen to produce something that looks like a product.
The instrument can validate itself
External-criterion validity requires correlating extracted features with clinical outcomes. That requires longitudinal data that does not yet exist. But there is a second validity question that is answerable now: are the measurements informationally sufficient?
Reconstruction validity (Guzzardo 2026c) tests this through adversarial synthesis. Build the strongest possible statistical reconstruction of a writing session from the instrument's own measurements. Text from the person's vocabulary and transition probabilities. Timing from their motor fingerprint. Revision from their deletion profile. Feed the synthetic session back through the same signal pipeline. Compare the extracted signals to those from real sessions, dimension by dimension.
Where the reconstruction matches reality, the instrument captures that dimension. Where it diverges, the gap is diagnostic.
A single reconstruction invites a follow-up: maybe the generator is just weak. To close that objection, the instrument runs five adversary variants on every session. Each one adds exactly one statistical improvement to the ghost. If the improvement closes the residual, that component of the gap was statistical, not cognitive. Whatever remains after the strongest ghost is the irreducible floor.
Reconstruction fidelity by signal family (schematic). The motor residual dominates across all five adversary variants. Better timing (AR(1) serial dependence, Gaussian copula hold-flight coupling) and better text (variable-order PPM) each close their targeted gaps without collapsing the motor floor. Full results in Guzzardo 2026c.
PPM text generation (adaptive context depth up to order 5) closes the semantic gap further than the baseline Markov chain. Content structure is statistically compressible. The text axis and the timing axis are independent in the measurement.
Preserving IKI serial dependence (AR(1)), coupling hold and flight times (Gaussian copula), improving text generation (PPM), and combining all three: the motor floor holds across every variant. Distributional equivalence is not behavioral equivalence. Motor sequences in genuine composition are coupled to cognitive state. This is where the mind shows in the measurements.
The motor residual is larger on journal questions than on calibration questions. When the question demands more cognitive engagement, the person-reconstruction gap widens. A purely biomechanical residual would not vary with question type. It does.
Every reconstruction residual is a reproducible artifact. The PRNG seed, motor profile snapshot, and corpus integrity hash are stored alongside each residual. Any future build of the instrument can regenerate the identical ghost and verify the stored residual to bit identity. This is not a design aspiration; it is a verified property, demonstrated on production data across all six signal families (dynamical, motor, process, semantic, cross-session, and behavioral state).
The semantic measurement pipeline runs against a self-hosted, archivable embedding model (Qwen3-Embedding-0.6B, Apache 2.0, weights archived by SHA-256 hash, FP32 CPU inference verified bit-reproducible). The vector geometry of every embedding in the corpus can be reproduced from the archived weights and the documented inference environment. See methods specification.
This framework also provides a direct empirical response to Condrey (2026a), who proved that keystroke timing alone cannot distinguish composition from transcription. Five reconstructions are timing-calibrated and meaning-absent. The instrument's full signal set distinguishes all five from real sessions. It captures the content-process binding that Condrey's result says timing-only instruments cannot detect. The multi-adversary system turns that from a single data point into a surface. The surface's shape is the validity evidence.
The methodological commitments described here are not standard practice in writing-process measurement research. Archivable model weights with SHA-256 provenance. Versioned inference environments with verified bit-reproducibility. Cryptographic contamination attestation on every session record. A signal engine whose output is identical across builds, verified by CI on every code change. Alice treats these as preconditions. The cohort artifact argument depends on them being met today: instruments built without long-horizon reproducibility cannot establish the baselines that future validation studies will need.
Theoretical extensions
The argument on this page addresses one modality (typed writing) and one measurement context (longitudinal cognitive assessment). Two broader frameworks extend the argument beyond this domain.
The construct replacement problem is not specific to typing. AI mediation is simultaneously altering the cognitive processes underlying speech (voice assistants restructure request formulation), spatial navigation (GPS replaces cognitive map formation), and decision-making (recommendation engines replace evaluative reasoning). Each modality has its own research silo documenting effects. Guzzardo (2026d) argues from information theory that the loss of process-level cognitive data is mathematically irreversible: the artifact is a lossy compression of the process that produced it, and lossy compression is one-way. No future technology can recover what was discarded. Four design constraints emerge consistently across modalities: unassisted input, process-level capture, longitudinal intra-individual baselines, and attachment to intrinsically motivated practices. The six instrument requirements on this page are a domain-specific instantiation of those four general constraints.
Open questions and limitations
Alice implements this instrument design and has verified its architectural commitments: bit-reproducible signal computation, cryptographically attested contamination boundary, archived embedding methodology with pinned weights and deterministic inference, and a fully operational measurement pipeline computing more than 100 signals across six families (dynamical, motor, process, semantic, cross-session, behavioral state) with 41-dimension reconstruction residuals organized by theoretical family. The instrument has been accumulating data since April 2026. The empirical questions that remain are about longitudinal validation, generalization beyond n=1, sustained engagement without gamification, and modality migration. The following problems remain open.
The baselines need to be accumulating now, so that when the validation studies become possible, the longitudinal records already exist. The gap between "implemented" and "validated" is the work that remains.
Epistemic status: This page presents a working thesis drawn from published literature and two preprints by the author. The cited empirical findings are real. The synthesis and instrument design are proposed, not validated.
Papers and tools
The research program behind Alice produces versioned papers, open-source tools, and empirical results. Each paper addresses a different facet of the same problem. The tool makes the methodology available to other instruments.
A framework for validating behavioral measurement instruments via adversarial reconstruction. Any instrument that extracts features from temporal behavioral streams can define its signal pipeline, and the crate builds the strongest statistical reconstruction it can, then reports where the reconstruction fails. The residual surface is the validity evidence.
cargo add reconstruction-validity use reconstruction_validity::{Session, FiveVariant};
let session = Session::from_keystroke_stream(&events)?;
let residuals = session.run_adversaries(FiveVariant::default())?;
for r in &residuals {
println!("{}: motor={:.4}", r.variant, r.motor_residual);
} baseline: motor=0.4721
conditional: motor=0.4318
copula_motor: motor=0.4156
ppm_text: motor=0.4689
full_adversary: motor=0.3947 Five adversary variants, each adding one statistical improvement. The motor floor that survives the full adversary is the irreducible gap between statistical reconstruction and genuine composition. Implements the methodology from Guzzardo 2026c.