Alice -- Research

The evidence base is thinner than it looks

Keystroke dynamics have emerged as one of the most promising modalities for passive cognitive assessment. Recent studies report impressive numbers:

0.918 AUC for Alzheimer's screening via writing-process biomarkers Li et al. 2025

0.997 AUC for MCI discrimination from smartphone keystroke timing Kim et al. 2024

But look closer.

<500 Total participants across all published keystroke-cognition studies targeting neurodegeneration

0 Studies with held-out test sets or external validation cohorts among the highest-reported results

Li et al. (2025) computed their AUC on the same 72-person sample used to fit a stepwise logistic regression. Kim et al. (2024) used 99 participants without cross-validation; a correction notice followed. The closest published study to what a modern keystroke-cognition instrument would actually capture (Meulemans, Van Waes, and Leijten 2022) was cross-sectional with n = 30.

The modality is real. The results are suggestive. But the field is drawing broad conclusions from samples that would not survive a single replication attempt.

The studies are on the wrong population

Keystroke timing can only reflect cognitive processes when typing is automatic. Below roughly 40 words per minute, variation in flight time is dominated by motor search and planning, not cognitive state. The cognitive signal is buried under motor variance that has nothing to do with neurodegeneration.

Pinet et al. (2022) documented two qualitatively different processing architectures in typists. In novices, each keystroke requires conscious key location, finger selection, and movement execution. In experts, the motor layer is automated and the keystroke signal becomes transparent to cognitive processes: word retrieval, syntactic planning, coherence monitoring.

The population currently at risk for cognitive decline (ages 60-80, born 1944-1964) largely did not grow up typing.

Birth cohort First regular computer use Expected motor noise

1940-1955 Age 35-55 (1990s)

High

1955-1965 Age 25-40 (1990s)

Moderate

1965-1980 Age 15-30 (1990s)

Variable

1980-1995 Childhood

Low

1995-2010 Birth (keyboard native)

Minimal

2010+ Touchscreen-first

Rising

Schematic model of generational typing proficiency. Proficiency labels are editorial estimates informed by cohort-level data (Cramer-Petersen et al. 2022, Dhakal et al. 2018, Pew Research 2019), not individually validated assessments. The 2010+ estimate is based on early reports of declining keyboard proficiency in touchscreen-native cohorts, not validated studies.

The people currently old enough to be at risk for cognitive decline are the wrong people to study this way. Motor inexperience masquerades as cognitive slowing. Fluent elderly typists are unrepresentative. And typing proficiency correlates with cognitive reserve, confounding the very construct the instrument is supposed to isolate.

The window is opening. And closing.

The demographic confound is self-resolving. By 2045, the 60-80 age range will consist almost entirely of lifelong fluent typists. The motor noise floor drops with each decade. The automaticity threshold will be met by default.

But this projection has a load-bearing assumption: that those cohorts are still producing unassisted text character-by-character at volume when they arrive.

The cohort arriving with typing fluency is the same cohort arriving with maximum AI-mediation exposure.

Autocomplete, predictive text, and AI-assisted drafting are not future threats. They are present features of how most people under 40 already write. The demographic shift that would have resolved the motor confound is being partially foreclosed by a technology shift operating on a faster timeline.

The cohort born 1985-1995 is the first generation that acquired typing fluency in adolescence, has spent their entire adult lives producing unassisted text at volume, and will enter the cognitive decline screening window (ages 55-75) between 2040 and 2070. If personal baselines are not started on this cohort within the next 10-15 years, the opportunity is significantly degraded. No subsequent cohort will have spent decades producing unmediated text before AI assistance became the default input mode.

AI mediation doesn't add noise. It replaces the construct.

The distinction is between measurement error and construct invalidation.

Noise

The construct being measured is intact. The instrument captures it with some degree of error. More data helps. You can model and correct it.

A bathroom scale that reads two pounds heavy. The construct (weight) is unchanged.

Construct replacement

The measurement now corresponds to a different construct entirely. The surface form looks identical. The generating process is different. More data makes it worse.

A bathroom scale on a trampoline during an earthquake. The readings are precise. They are precisely wrong.

When a person retrieves a word from memory and types it character by character, the timing data reflects lexical retrieval, orthographic encoding, and motor execution. When the same word appears because the person accepted a predictive text suggestion, the surface output is identical. The cognitive process is different: visual scanning, suggestion evaluation, acceptance execution.

This is empirically documented:

Predictive text shifts production from generation to selection, reducing lexical variety independently of vocabulary size. Arnold, Chauncey, and Gajos 2020

Autocomplete shifts pause distributions from lexical-retrieval patterns to suggestion-evaluation patterns. Banovic et al. 2019

Phrase-level suggestions alter content, not just speed. Writers restructure their message to accommodate what the system offers. Buschek, Zurn, and Eiber 2021

AI-assisted writing enhances individual quality while reducing collective diversity. Different writers converge toward similar outputs. Doshi and Hauser 2024

After seven days of AI-assisted idea generation, creativity dropped and content homogeneity continued climbing even after the AI was removed. The "creative scar" persists beyond the mediation. Zhou and Liu 2025

The contamination is invisible at the point of collection. No dataset records whether a word was typed or accepted from a suggestion. The distinguishing information exists only in process-level data that standard behavioral datasets do not collect.

This makes the contamination boundary a research contribution, not an engineering decision. An instrument that enforces unmediated input, attests that enforcement cryptographically on every session record, and archives the attestation specification by version and commit hash, produces a clean cognitive corpus at exactly the moment when clean corpora are disappearing from the world. The value of that corpus increases as AI mediation becomes more pervasive, because the baseline it preserves becomes harder to replicate.

This has happened before

The clock-drawing test has been a standard neuropsychological screening tool for decades. Draw a clock face showing a specified time. It assesses visuospatial ability, executive function, and semantic memory in a single brief task.

Vishnevsky, Fisher, and Specktor (2024) demonstrated that Gen Z adults underperform on the clock-drawing test not because of cognitive impairment but because of reduced familiarity with analog clock faces.

The assessment embedded a hidden cultural competency that is no longer universal. Keyboards will eventually be displaced the same way. The instruments built on them need to be designed so the cognitive constructs being measured can migrate to new input modalities, not be permanently coupled to keystroke timing.

From "compared to healthy people" to "compared to yourself"

Every published keystroke-cognition study asks: does this person look impaired compared to a population norm? That is the only question available from a one-time lab visit.

Once you have someone typing fluently every day for years, you can ask a different question:

Does this person look different from themselves?

Cross-sectional Longitudinal Thousands of data points per person over years, not a single snapshot.

Population-normed Self-referential Comparison to your own history, not a healthy average. Eliminates confounds of education, occupation, and baseline ability.

Task-constrained Ecologically valid Data from naturalistic daily writing, not artificial lab tasks.

Diagnosis Early detection "Your trajectory shifted six months ago" rather than "you score below the cutoff today."

This shift is not novel. Ecological Momentary Assessment (Shiffman, Stone, and Hufford 2008), Experience Sampling Methods (Csikszentmihalyi and Larson 1987), and idiographic approaches (Molenaar 2004) all argue that group-level averages obscure the individual-level dynamics that matter clinically. Molenaar demonstrated formally that inter-individual and intra-individual variation are statistically non-ergodic: what is true of a population average is not necessarily true of any individual in that population. N-of-1 trial designs are now accepted by the FDA for rare disease contexts.

What these traditions lack is an instrument that can sustain daily ecological data collection over years in healthy populations. EMA studies typically last days to weeks. ESM relies on self-report prompted by random signals. Neither captures behavioral process data at the temporal resolution needed for cognitive signal extraction. The paradigm exists. The instrument does not.

The instrument gap

The current research landscape is fragmented into two silos that do not talk to each other.

Keystroke dynamics researchers

Capture timing data but ignore the content of what is typed. Can tell you flight time is elevated but not whether the writer was struggling to retrieve a specific word or planning a complex sentence.

Computational linguistics researchers

Analyze transcribed text but have no access to the temporal process that produced it. Can tell you vocabulary diversity is declining but not whether the decline reflects retrieval difficulty or avoidance.

The combination is not additive. It is multiplicative. A decline in lexical diversity concurrent with a decline in production fluency means something clinically different from a decline in diversity with stable fluency. The first suggests retrieval difficulty under cognitive load. The second suggests vocabulary contraction independent of production. You need both signal channels to distinguish them.

No existing instrument at validated scale combines process capture with content analysis, longitudinal self-referential baselines with same-day calibration, and unmediated input with modality-aware construct definitions. What each of these requires, and why satisfying five of six is not enough, is the subject of the remaining sections.

Why writing is a uniquely rich cognitive channel

Why writing? Why not speech, gait analysis, eye tracking, or any other behavioral modality?

Writing is the only common daily behavior that simultaneously requires lexical retrieval, syntactic planning, coherence monitoring, semantic integration, and fine motor execution. No other modality loads this many cognitive systems at once during a single naturalistic act.

Speech

Rich in prosody and lexical content but lacks the motor encoding channel. Articulatory planning is automated differently from keystroke execution. No character-level timing data. Harder to capture in private, ecologically valid contexts without introducing observer effects.

Gait / Motor

Strong evidence for motor biomarkers of neurodegeneration (Buchman and Bennett 2011). But gait captures motor and postural control without linguistic content. It cannot distinguish cognitive slowing from physical deconditioning without a second signal channel.

Eye tracking

Captures attentional allocation and reading fluency but not production. Measures how someone processes existing text, not how they generate new text. The generative act is where retrieval difficulty, planning breakdown, and coherence failure become visible.

Typed writing

Combines fine motor timing (keystroke dynamics), lexical production (word choice, vocabulary), syntactic structure (sentence complexity), semantic coherence (topic management), and temporal process data (pause architecture, revision patterns). All captured simultaneously from a single naturalistic act.

Pinet et al. (2022) demonstrated that in fluent typists, the motor layer becomes transparent to cognitive processes: keystroke timing variations reflect lexical retrieval difficulty, syntactic planning load, and coherence monitoring, not finger search. The typing is automatic. What remains in the signal is cognition. And unlike every other measurement modality, the act itself is independently valuable: people already write daily, in private, with no expectation of feedback. The measurement modality is also the retention mechanism.

Existing instruments and what they measure

Several approaches to longitudinal behavioral and cognitive measurement exist. Each captures some of the signal channels an instrument would need. None combines them. The gaps are structural, not accidental.

AI-assisted journaling

Rosebud, Reflection, Life Note, Mindsera

Learns from user history Content analysis

Surfaces analysis back to participant No keystroke-level process capture Feedback loop contaminates behavioral signal

Showing the participant what the system has learned creates a reflexive loop: the participant writes about the patterns they were shown, which the system detects as new patterns. The instrument alters the construct it measures.

Self-report mood tracking

Daylio, Bearable, How We Feel

Longitudinal architecture

Self-reported, not behavioral No process-level data Gamification introduces measurement artifacts

Captures what the participant reports feeling, not how they are processing. Self-report is subject to recall bias, social desirability effects, and alexithymia. Gamification mechanics (streaks, scores) create engagement patterns that confound the signal.

Physiological wearables

WHOOP, Oura, Apple Watch, Exist.io

Dense longitudinal data Personal baselines

Cardiovascular, not cognitive Surfaces all data to participant No writing-process signals

Measures autonomic readiness, not cognitive process. Transparent scoring creates optimization behavior: participants modify their behavior to improve the score rather than the underlying state. The instrument changes what it measures (Goodhart's law).

Clinical cognitive batteries

MoCA, MMSE, clock-drawing, computerized batteries

Validated constructs Clinical utility

Cross-sectional only Task-constrained, not ecological Cannot sustain daily administration

A 45-minute assessment every 12 months cannot detect gradual trajectory shifts. Practice effects accumulate with repeated administration. These instruments diagnose after clinical threshold. They were not designed for early detection of pre-clinical drift.

Digital therapeutics

Woebot (discontinued 2025), Wysa, Headspace, Calm

Daily engagement architecture

Journaling is a bolt-on, not the primary modality Scripted interactions, not generative No behavioral signal computation

Woebot had the strongest clinical evidence of any digital mental health tool and ceased operations in June 2025. Scripted content libraries depreciate. They do not accumulate participant-specific longitudinal records.

Ephemeral writing tools

Drift, Halka, Presently

No data surfaced to participant

No learning from accumulated history No signal computation No data retention

Closest to addressing the observer-effect problem. Entries fade or vanish, removing the feedback loop. But without data retention, no longitudinal analysis is possible. They solve the reactivity problem by abandoning the longitudinal one.

Alice

Process-level keystroke capture (IKI, hold time, flight time, burst architecture) Dual-channel measurement (behavioral signals + semantic signals, jointly analyzed) Longitudinal intra-individual baselines (running distribution, topic-matched z-scoring, minimum-n gating) Participant-blind measurement (no signals, traits, or computed metrics surfaced) Contamination boundary attestation (versioned spec, git commit hash, audited code paths per session) Archived reproducible methodology (bit-reproducible signal engine, archived embedding weights, methods provenance journal)

Longitudinal validation against clinical outcomes Corpus depth beyond n=1 Cross-modality migration (keyboard to voice, gesture, or neural input)

Architecturally satisfies the six requirements. Validation requires the longitudinal data the instrument is designed to accumulate. The gap between implemented and validated is the work that remains.

The structural pattern: every approach that accumulates participant data surfaces it back to the participant. Every approach that avoids the observer effect discards the data. No existing instrument combines longitudinal accumulation with participant-blind measurement. The combination is architecturally difficult because the standard model for justifying sustained participant engagement is showing them what you've learned. An instrument that learns but does not show must find a different mechanism for retention.

The observer effect is not a design preference. It is a validity requirement.

If a participant knows what is being measured, the measurement is no longer valid.

The Hawthorne effect (Adair 1984; McCambridge, Witton, and Elbourne 2014) is well-established: participants who know they are being observed modify their behavior. In longitudinal behavioral measurement, this is not a minor confound. It is fatal to the construct. If a participant knows their writing speed is being measured, they write differently. If they know lexical diversity is being tracked, they reach for unusual words. The signal no longer reflects natural cognitive process. It reflects performance.

Surfacing computed signals creates a second, deeper problem: optimization. Goodhart's law applies directly. Once the participant sees a metric, they optimize for the metric rather than the state the metric was designed to detect. Etkin (2016, Journal of Consumer Research) demonstrated across six experiments that personal quantification increases behavioral output but reduces enjoyment and intrinsic motivation. A participant who sees a decline in their processing speed metric becomes anxious, and the anxiety alters the very signals being measured. The instrument creates the condition it was designed to detect.

The clinical extreme is documented. Rosman et al. (2021, NIH-funded) reported a patient with atrial fibrillation who performed 916 smartwatch ECGs in one year after the device began surfacing cardiac data. Ambiguous readings ("inconclusive") produced the same behavioral response as actual arrhythmia detections. The patient developed illness anxiety disorder, made 12 unnecessary clinic and emergency department visits, and required cognitive behavioral therapy for remission. The authors note anecdotal reports from institutions nationwide, suggesting the case represents "the tip of an iceberg."

The problem compounds in clinical populations. Self-censorship in journaling is most acute in populations with PTSD, trauma histories, and anxiety disorders. The people who would produce the most clinically valuable writing data are the most likely to alter their writing when they know it is being analyzed. An instrument that surfaces its analysis guarantees that the participants who need it most will either censor themselves or stop using it.

This is not paternalism. It is the same logic that prevents a blood pressure cuff from showing readings during an ambulatory monitoring protocol. The measurement must not alter the measured state. The participant's relationship to the instrument must be with the surface (the question, the writing practice) while the measurement operates underneath, invisible.

Requirements for instrument validity

The preceding analysis constrains the design. A valid instrument for longitudinal cognitive measurement through naturalistic writing must satisfy all of the following simultaneously. Satisfying five of six produces a flawed instrument, not a slightly less good one.

Why build this as a product people choose to use, rather than a research tool distributed through clinics? Because longitudinal studies that rely on institutional recruitment cannot sustain daily participation over years. Retention in traditional EMA studies drops sharply after days to weeks. An instrument that produces valid longitudinal data must be something people want to use for its own sake. Ecological validity and sustained retention are not product decisions. They are methodological requirements that happen to produce something that looks like a product.

Ecological validity The writing task must be intrinsically meaningful, not a lab exercise. Participants must want to do it for its own sake, daily, for years. A task perceived as "testing" triggers performance behavior and contaminates the signal.

Unmediated input No autocomplete, predictive text, or AI assistance during the writing task. Any mediation replaces the cognitive construct being measured (Section 4). The input environment must guarantee that every character reflects a discrete motor and cognitive act. Alice implements this as Contamination Boundary v1: each session record carries a cryptographic attestation of the audited code paths active at session-write time and the git commit hash of the running codebase. The boundary is a versioned specification. Any future writing surface that could mediate input requires an explicit boundary version bump and fresh audit before sessions from that path are admitted to the corpus.

Dual-channel capture Both keystroke-level temporal data (process) and submitted text (content) must be captured and analyzed jointly. Neither channel alone can distinguish retrieval difficulty from vocabulary contraction, or cognitive slowing from motor fatigue (Section 7).

Intra-individual baselines Deviation must be measured against the participant's own history, not a population norm. Calibration sessions provide within-person same-day controls that account for transient state variation (fatigue, illness, mood). Alice's semantic baseline infrastructure is operational across fourteen semantic signals (idea density, lexical sophistication, epistemic stance, integrative complexity, deep cohesion, referential cohesion, emotional valence arc, compression ratio, discourse coherence global, discourse coherence local, global/local coherence ratio, coherence decay slope, and NRC emotion densities), with running distribution stores, topic-matched z-scoring via embedding similarity, and minimum-n gating. Longitudinal trajectory analysis is deferred for data depth. See methods provenance.

Participant-blind measurement Computed signals must never be surfaced to the participant. Awareness of measurement alters the measured behavior (Hawthorne). Awareness of specific metrics triggers optimization (Goodhart). The instrument must learn from accumulated data without creating a feedback loop.

Retention without gamification Sustained daily engagement over years cannot rely on streaks, scores, or dashboards, as these are measurement artifacts. The instrument must generate sufficient intrinsic value through the practice itself. Improving question quality from accumulated response history is one mechanism.

The instrument can validate itself

External-criterion validity requires correlating extracted features with clinical outcomes. That requires longitudinal data that does not yet exist. But there is a second validity question that is answerable now: are the measurements informationally sufficient?

Reconstruction validity (Guzzardo 2026c) tests this through adversarial synthesis. Build the strongest possible statistical reconstruction of a writing session from the instrument's own measurements. Text from the person's vocabulary and transition probabilities. Timing from their motor fingerprint. Revision from their deletion profile. Feed the synthetic session back through the same signal pipeline. Compare the extracted signals to those from real sessions, dimension by dimension.

Where the reconstruction matches reality, the instrument captures that dimension. Where it diverges, the gap is diagnostic.

A single reconstruction invites a follow-up: maybe the generator is just weak. To close that objection, the instrument runs five adversary variants on every session. Each one adds exactly one statistical improvement to the ghost. If the improvement closes the residual, that component of the gap was statistical, not cognitive. Whatever remains after the strongest ghost is the irreducible floor.

Reconstruction fidelity by signal family (schematic). The motor residual dominates across all five adversary variants. Better timing (AR(1) serial dependence, Gaussian copula hold-flight coupling) and better text (variable-order PPM) each close their targeted gaps without collapsing the motor floor. Full results in Guzzardo 2026c.

Content residual Small and converging

PPM text generation (adaptive context depth up to order 5) closes the semantic gap further than the baseline Markov chain. Content structure is statistically compressible. The text axis and the timing axis are independent in the measurement.

Motor residual Five adversary strategies. None collapse the floor.

Preserving IKI serial dependence (AR(1)), coupling hold and flight times (Gaussian copula), improving text generation (PPM), and combining all three: the motor floor holds across every variant. Distributional equivalence is not behavioral equivalence. Motor sequences in genuine composition are coupled to cognitive state. This is where the mind shows in the measurements.

Cognitive load signature The falsification test: does the residual vary with cognitive demand?

The motor residual is larger on journal questions than on calibration questions. When the question demands more cognitive engagement, the person-reconstruction gap widens. A purely biomechanical residual would not vary with question type. It does.

Every reconstruction residual is a reproducible artifact. The PRNG seed, motor profile snapshot, and corpus integrity hash are stored alongside each residual. Any future build of the instrument can regenerate the identical ghost and verify the stored residual to bit identity. This is not a design aspiration; it is a verified property, demonstrated on production data across all six signal families (dynamical, motor, process, semantic, cross-session, and behavioral state).

The semantic measurement pipeline runs against a self-hosted, archivable embedding model (Qwen3-Embedding-0.6B, Apache 2.0, weights archived by SHA-256 hash, FP32 CPU inference verified bit-reproducible). The vector geometry of every embedding in the corpus can be reproduced from the archived weights and the documented inference environment. See methods specification.

This framework also provides a direct empirical response to Condrey (2026a), who proved that keystroke timing alone cannot distinguish composition from transcription. Five reconstructions are timing-calibrated and meaning-absent. The instrument's full signal set distinguishes all five from real sessions. It captures the content-process binding that Condrey's result says timing-only instruments cannot detect. The multi-adversary system turns that from a single data point into a surface. The surface's shape is the validity evidence.

The methodological commitments described here are not standard practice in writing-process measurement research. Archivable model weights with SHA-256 provenance. Versioned inference environments with verified bit-reproducibility. Cryptographic contamination attestation on every session record. A signal engine whose output is identical across builds, verified by CI on every code change. Alice treats these as preconditions. The cohort artifact argument depends on them being met today: instruments built without long-horizon reproducibility cannot establish the baselines that future validation studies will need.

Theoretical extensions

The argument on this page addresses one modality (typed writing) and one measurement context (longitudinal cognitive assessment). Two broader frameworks extend the argument beyond this domain.

The construct replacement problem is not specific to typing. AI mediation is simultaneously altering the cognitive processes underlying speech (voice assistants restructure request formulation), spatial navigation (GPS replaces cognitive map formation), and decision-making (recommendation engines replace evaluative reasoning). Each modality has its own research silo documenting effects. Guzzardo (2026d) argues from information theory that the loss of process-level cognitive data is mathematically irreversible: the artifact is a lossy compression of the process that produced it, and lossy compression is one-way. No future technology can recover what was discarded. Four design constraints emerge consistently across modalities: unassisted input, process-level capture, longitudinal intra-individual baselines, and attachment to intrinsically motivated practices. The six instrument requirements on this page are a domain-specific instantiation of those four general constraints.

Open questions and limitations

Alice implements this instrument design and has verified its architectural commitments: bit-reproducible signal computation, cryptographically attested contamination boundary, archived embedding methodology with pinned weights and deterministic inference, and a fully operational measurement pipeline computing more than 100 signals across six families (dynamical, motor, process, semantic, cross-session, behavioral state) with 41-dimension reconstruction residuals organized by theoretical family. The instrument has been accumulating data since April 2026. The empirical questions that remain are about longitudinal validation, generalization beyond n=1, sustained engagement without gamification, and modality migration. The following problems remain open.

Validation No longitudinal keystroke-cognition study with validated outcomes exists at any sample size. External-criterion validation requires longitudinal data that does not yet exist. A complementary approach, reconstruction validity (Guzzardo 2026c), tests whether the instrument's measurements are informationally sufficient by reconstructing the behavior they were extracted from. The reconstruction residual characterizes what the instrument captures and what it does not, without requiring an external criterion. This is computable from n=1 today. External validation must still come from longitudinal outcome data.

Signal-to-construct mapping The mapping between keystroke timing features and cognitive constructs (lexical retrieval, executive function, processing speed) is hypothesized from the psycholinguistic literature, not validated in a longitudinal context. These mappings are candidates, not established biomarkers.

Retention Whether question quality from accumulated history is sufficient to sustain daily engagement over years without gamification is unknown. n=1 cannot answer this. The mechanism is theoretically grounded (personalization increases perceived value) but empirically untested at scale.

Modality migration If keyboard input is displaced by voice, gesture, or neural interfaces, the keystroke-specific signals become invalid. The cognitive constructs must be defined independently of the input modality so measurement can migrate. This is a design goal, not a solved problem.

Ethics of silent measurement Participant-blind measurement requires informed consent about the existence of measurement without disclosure of specific metrics. The ethical framework for this exists in ambulatory monitoring research but has not been applied to keystroke-level cognitive assessment. The boundaries need explicit articulation.

The baselines need to be accumulating now, so that when the validation studies become possible, the longitudinal records already exist. The gap between "implemented" and "validated" is the work that remains.

Epistemic status: This page presents a working thesis drawn from published literature and two preprints by the author. The cited empirical findings are real. The synthesis and instrument design are proposed, not validated.

Papers and tools See the architecture The broader vision

Anthony Guzzardo Software engineer. Background in behavioral data systems and measurement infrastructure. Builder of Alice. The arguments on this page stand on published literature. The instrument is the author's attempt to act on them before the window described above closes.

Papers and tools

The research program behind Alice produces versioned papers, open-source tools, and empirical results. Each paper addresses a different facet of the same problem. The tool makes the methodology available to other instruments.

Fri Apr 24 · v1

Calibration Delta Methodology: Within-Person Provocation Analysis

A within-person matched-pair design compares behavioral and linguistic signals between reflective journal sessions and neutral calibration sessions. Initial screening of 68 signals across four...

Fri Apr 24 · v1

Calibration Delta Replication Plan

Fri Apr 24 · v1

Calibration Design Recommendation

Fri Apr 24 · v2

Signal Partitioning by Reconstruction Fidelity in a Longitudinal Keystroke Instrument

A reconstruction adversary framework partitions behavioral signals by their dependence on distributional versus structural properties of writing. Synthetic writing sessions generated from a person's...

Thu Apr 23 · v3

A Closing Window: The Demographic Confound in Keystroke-Based Cognitive Biomarkers and the AI-Mediation Threat to the Paradigm That Would Replace It

Keystroke dynamics have emerged as a promising modality for passive cognitive assessment, but the keystroke-cognition studies targeting neurodegeneration identified in this review have drawn their...

Tue Apr 21 · v5

Reconstruction Validity: Self-Validation of Process-Level Behavioral Instruments via Adversarial Synthesis

Every behavioral measurement instrument implicitly claims that its extracted features preserve meaningful information about the person who produced the behavior. The standard test of this claim is...

Mon Apr 20 · v2

Construct Replacement: When AI-Mediated Input Invalidates Behavioral Measurement

The behavioral and cognitive sciences depend on observing what humans do. The validity of any behavioral measurement rests on the assumption that observed behavior reflects the cognitive process the...

Mon Apr 20 · v1

Irreversible Loss: An Information-Theoretic Argument for Process-Level Cognitive Preservation

Every historical figure whose artifacts survive is a person whose cognitive process data is permanently lost. The loss is not a failure of archiving. It is a consequence of information theory: the...

reconstruction-validity Rust · MIT license · GitHub

A framework for validating behavioral measurement instruments via adversarial reconstruction. Any instrument that extracts features from temporal behavioral streams can define its signal pipeline, and the crate builds the strongest statistical reconstruction it can, then reports where the reconstruction fails. The residual surface is the validity evidence.

Install

cargo add reconstruction-validity

Usage

use reconstruction_validity::{Session, FiveVariant};

let session = Session::from_keystroke_stream(&events)?;
let residuals = session.run_adversaries(FiveVariant::default())?;

for r in &residuals {
    println!("{}: motor={:.4}", r.variant, r.motor_residual);
}

Output

baseline:        motor=0.4721
conditional:     motor=0.4318
copula_motor:    motor=0.4156
ppm_text:        motor=0.4689
full_adversary:  motor=0.3947

Five adversary variants, each adding one statistical improvement. The motor floor that survives the full adversary is the irreducible gap between statistical reconstruction and genuine composition. Implements the methodology from Guzzardo 2026c.