Longitudinal keystroke-cognition data from ecological daily writing.
Alice is a running measurement instrument that captures dual-channel process data (keystroke timing + linguistic content) from daily naturalistic writing, with intra-individual baselines, same-day calibration, and participant-blind measurement. No existing instrument produces this dataset. We are looking for research partners to help validate it.
What the instrument captures
Key-down and key-up timestamps at millisecond precision. Full temporal microstructure: inter-key intervals, hold durations, flight times, pause architecture, deletion sequences, burst segmentation. Captured from unmediated character-by-character input. No autocomplete, no predictive text. Null model: a five-variant reconstruction adversary that generates synthetic keystroke streams from your statistical profile and measures the residual.
Final submitted text. Deterministic linguistic analysis: idea density, lexical sophistication, epistemic stance, integrative complexity, cohesion, emotional valence, text compression ratio. NRC emotion word densities. MATTR vocabulary diversity. Null model: longitudinal self-referencing baselines, topic-matched against prior sessions via archived embedding model (Qwen3-Embedding-0.6B, SHA-256 identified weights, FP32 CPU inference, bit-reproducible).
The two channels are analyzed jointly. A decline in lexical diversity with stable production fluency means vocabulary contraction. The same decline with slowed production means retrieval difficulty. You need both to distinguish them. Each channel has its own null model and its own reproducibility guarantee, under a matched standard of mathematical rigor.
Signal pipeline
Six signal families computed per session. Native Rust engine for numerical estimation. All methods cited with validation status documented on the methodology page.
Current state
What partnership looks like
Anonymized longitudinal process + content data under data sharing agreement. Full signal pipeline output. Raw keystroke streams available for custom feature extraction.
If your research requires specific features not in the current pipeline, we can implement them. The architecture is designed for extensibility. Every signal family — numerical and linguistic — runs in a single Rust engine; no LLM is invoked anywhere in the signal pipeline.
The question schedule, calibration protocol, and session structure can be adapted for specific research populations or research questions.
We are interested in co-authoring with domain experts. The instrument produces the data. You bring the clinical or theoretical framework. The validation comes from the collaboration.
The gap between "running" and "validated" is the work that remains. If your research could use longitudinal keystroke-cognition data that no other instrument produces, let's talk.
You're on the list.