LitBench: A Benchmark and Dataset for Reliable
Evaluation of Creative Writing

Daniel Fein1  Sebastian Russo1  Violet Xiang1
Kabir Jolly1  Rafael Rafailov1  Nick Haber1
1Stanford University

Key Contributions

43k+

Human-labeled pairwise story comparisons in training set

2.5k

Held-out benchmark comparisons for evaluation

Superior

Performance vs existing zero-shot evaluators

LitBench is a benchmark and dataset aimed at improving automated creative writing evaluation, featuring thousands of pairwise, human-labeled story comparisons. We apply preference distillation techniques to fine-tune LLMs, and the resultant Bradley–Terry and generative reward models trained on LitBench markedly surpass existing zero-shot language model evaluators.

Access Dataset

Enter your email address to access the dataset and resources.

Coming Soon

We're actively developing additional features that will be released in the coming weeks and months:

Human vs Machine Detection

Interactive arena for distinguishing between human and AI-written stories

Model vs Model Battles

Head-to-head comparisons between different AI models

Real-time Leaderboards

Live rankings and performance metrics