Learning Next Action Predictors from
Human-Computer Interaction

Omar Shaikh1, Valentin Teutschbein2*, Kanishk Gandhi1*, Yikun Chi3, Nick Haber1, Thomas Robinson1, Nilam Ram1, Byron Reeves1, Sherry Yang1, Michael Bernstein1, Diyi Yang1

1Stanford University   2Hasso Plattner Institute   3New York University

What Will You Do Next?

Language models today are hopelessly restricted to seeing us through a narrow keyhole. They see our prompts, and they construct memories to make sense of them. But they know nothing of what brought us to them in the first place. Truly context-aware AIs should understand us deeply: what problems we're solving, what constraints we face, and how we act in the world.

What if models could instead predict what we'll do next on our computers to help proactively? We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, keystrokes, clicks), predict their next action. For example, if a researcher reads paper reviews, a good predictor should reason over this and over the user's past habits to predict they will check experiments on Weights & Biases, then message coauthors on Slack to divide up revisions.

Long-Context Next Action Predictors draw from the entirety of a user's multimodal context, retrieving over an unbounded history, to predict what they will do next.

We make progress on two fronts. NAPSack passively collects and labels interaction data at scale. LongNAP combines parametric and in-context learning, trained end-to-end to retrieve relevant past reasoning and predict future actions. The temporal structure of behavior provides a natural training signal: just wait and see what the user actually does.

Reason, Retrieve, and Predict with LongNAP

Rather than relying solely on parametric or in-context learning, we train models that learn to retrieve relevant past reasoning and observations into context. We instantiate this in LongNAP, a two-phase model trained end-to-end via policy gradients.

In the first phase, LongNAP reasons to retrieve: the model generates a chain-of-thought, then uses it to query a memory of past observations via BM25. In the second phase, LongNAP reasons to predict: integrating retrieved traces to refine its reasoning and predict concrete future actions. Traces that lead to good predictions are saved back into memory, so the library improves over time.

To score predictions, we use a temporal reward: since we can just wait and see what the user actually does, an LLM judge measures similarity between predicted and actual future actions. This lets us optimize both stages end-to-end through GRPO.

LongNAP significantly outperforms supervised finetuning (by 79%) and prompted baselines (by 39%). 17.1% pass@1, rising to 36.3% pass@20.

LongNAP two-phase pipeline: reason-to-retrieve followed by reason-to-predict, optimized end-to-end with GRPO.

Learning Online with PowerNAP

Instead of storing data and training offline with multiple epochs, we can convert the entire pipeline to run online. In PowerNAP, NAPSack and LongNAP operate asynchronously: NAPSack continuously tracks and labels user actions, enqueueing them for training, while LongNAP consumes labeled actions from the queue and trains on them in a single pass. Crucially, memory is never reset and reasoning traces accumulate, allowing the model to continually build a better representation of the user over time. Explore the embedding space of 7,852 action sequences below.

Clusters

Points:

Clusters:

Cite This Work

@misc{shaikh2026learningactionpredictorshumancomputer,
  title={Learning Next Action Predictors from Human-Computer Interaction},
  author={Omar Shaikh and Valentin Teutschbein and Kanishk Gandhi and Yikun Chi and Nick Haber and Thomas Robinson and Nilam Ram and Byron Reeves and Sherry Yang and Michael S. Bernstein and Diyi Yang},
  year={2026},
  eprint={2603.05923},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.05923},
}