You can now run OpenAI evals in ChainForge

With the click of a button

Ian Arawjo
3 min readJun 16, 2023

Evaluation is hard. At ChainForge, we’ve been working on making it easier, but it’s still hard to know what to evaluate: the breadth of human knowledge is too complex, too expansive, for any one organization —let alone person —to test it all. And even when you know what to evaluate, you might struggle with knowing how. That’s why OpenAI pushes evals: a command-line benchmarking tool that lets users propose and add new evaluations for OpenAI models.

Today, ChainForge can load 188 OpenAI evals directly as visual evaluation flows. This makes it significantly easier to explore what LLM evaluations others have proposed, and to tweak, critique, and build upon them. For instance, here’s the ‘tetris’ eval from OpenAI, measuring how well a model can rotate Tetris blocks, loaded into ChainForge:

The flow for measuring accuracy of Tetris block rotations, with an accuracy plot.

All I had to do to see results was click run. (Turns out, ChatGPT is not so great at tetris rotations!) Beyond just running an eval, however, now I have the ability to plot the results, inspect the input prompts, and extend the evaluation at will. No more messing around with editing JSONL files or YAML configs with obscure parameters — with ChainForge, you can immediately test OpenAI evaluations on new prompts, new models, or at different model settings. Want to compare GPT3.5's performance on Tetris to GPT4? Click ‘Add model’, and re-run:

GPT4 outperforms GPT3.5 at Tetris block rotations

Want to try different system messages and see if you can improve the results? Add another model, change settings, run. It’s that easy:

An alternative message to the one used in OpenAI evals, ‘tetris’, which I put in on added model’s settings screen.
Increased the accuracy by ~4%. Or did I? We might change the temperature or request more responses per prompt to verify.

Currently, ChainForge supports a large subset of evals of class ‘includes’, ‘match’, and ‘fuzzy match’, with common system messages and single-turn prompts. In the near future, we’ll add broader support, including evals with LLM evaluators that run on Chain-of-Thought prompting, so you can explore OpenAI’s model-evaluated benchmarks, too.

Let us know what you think and, as always, if there’s a feature you’d like to see or if you run into any problems, open a Discussion or raise an Issue on our GitHub.

Best,

~Ian Arawjo

--

--

Ian Arawjo

Assistant Prof @ Université de Montréal; Previously: Postdoc @Harvard, PhD from @CornellInfoSci. Former game developer.