LitBench: A Benchmark and Dataset for Reliable
Evaluation of Creative Writing
Key Contributions
Human-labeled pairwise story comparisons in training set
Held-out benchmark comparisons for evaluation
Performance vs existing zero-shot evaluators
LitBench is a benchmark and dataset aimed at improving automated creative writing evaluation, featuring thousands of pairwise, human-labeled story comparisons. We apply preference distillation techniques to fine-tune LLMs, and the resultant Bradley–Terry and generative reward models trained on LitBench markedly surpass existing zero-shot language model evaluators.
Access Dataset
Enter your email address to access the dataset and resources.
Coming Soon
We're actively developing additional features that will be released in the coming weeks and months:
Human vs Machine Detection
Interactive arena for distinguishing between human and AI-written stories
Model vs Model Battles
Head-to-head comparisons between different AI models
Real-time Leaderboards
Live rankings and performance metrics
