LitBench: A Benchmark and Dataset for Reliable
Evaluation of Creative Writing

Daniel Fein¹ Sebastian Russo¹ Violet Xiang¹

Kabir Jolly¹ Rafael Rafailov¹ Nick Haber¹

¹Stanford University

Key Contributions

43k+

Human-labeled pairwise story comparisons in training set

2.5k

Held-out benchmark comparisons for evaluation

Superior

Performance vs existing zero-shot evaluators

LitBench is a benchmark and dataset aimed at improving automated creative writing evaluation, featuring thousands of pairwise, human-labeled story comparisons. We apply preference distillation techniques to fine-tune LLMs, and the resultant Bradley–Terry and generative reward models trained on LitBench markedly surpass existing zero-shot language model evaluators.

View Paper

Access Dataset

Enter your email address to access the dataset and resources.

Coming Soon

We're actively developing additional features that will be released in the coming weeks and months:

Human vs Machine Detection

Interactive arena for distinguishing between human and AI-written stories

Model vs Model Battles

Head-to-head comparisons between different AI models

Real-time Leaderboards

Live rankings and performance metrics