Evaluate LLMs, right in your browser. Share your experiments with others.

No installation or login required

5 min readJul 5, 2023

Today marks an important milestone for ChainForge, our visual programming environment for prompt engineering: the app is now hosted online at chainforge.ai/play, following a rewrite of the backend into TypeScript. As of today, anyone has the ability to write evaluations of LLMs and share them with others, entirely from their browser. No user account, no installation required.

Here is a prompting experiment about how well an LLM knows the first line of a novel, comparing open-source Falcon-7B-Instruct with ChatGPT:

Asking about the first sentence of a famous novel. This flow is publicly available — and extendable — on chainforge.ai/play

Hard to read? Can’t inspect the responses? No problem. When you go to chainforge.ai/play for the first time on Chrome or Firefox, you should see this exact same experiment. And you can click Inspect Responses to bring up side-by-side comparisons between model responses for each prompt:

Comparing responses between Falcon.7b and ChatGPT, asking about the opening line of Dostoevsky’s Crime and Punishment. (ChatGPT is correct.)

Better yet, you can extend this experiment immediately. For instance, try adding another book, then set your OpenAI API key and press Run on the Prompt Node. (No OpenAI key? Swap out ChatGPT for a free HuggingFace model.) For fun, I’ve added George Orwell’s 1984. Let’s add a different model, too — Google PaLM2 chat — and see how it compares:

PaLM2 is correct for 1984, but maybe it tends towards too much detail? By grouping responses by LLM, we can see this tendency directly:

PaLM2-chat tends to get things right, but over-explain for this particular prompt. (It’s also closer than any other model to the correct opening line of The Secret History.)

This example is simple, but ChainForge is capable of much more, including conducting ground truth evaluations across multiple LLMs and plotting results. If you’re new, check out our Example Flows (top-right corner), which includes 180+ evaluations derived from OpenAI evals.

Share your experiments with others

There’s another feature we’d like to introduce today, beyond being able to play with ChainForge on the web.

Say you’ve experimented a bit, and want to share your prompting experiment or LLM evaluation with others. In other apps, you might need to exchange files, ask people to sign up, or add them to an organization. In ChainForge? Simply click Share:

With Share, it’s easier than ever to share prompting experiments and LLM evaluations.

No passing files, no install, no login. Simply share the link and, presto, they see the same experiment you do. And unlike other tools that only share static content, other users can amend or extend your experiment, right in the browser. Even if they have no API keys, other users can inspect LLM responses that have already been collected and write evaluator functions.

For instance, here’s a experiment I made that tries to get an LLM to reveal a secret key: https://chainforge.ai/play/?f=28puvwc788bog

A more complex evaluation, detecting how susceptible an LLM is to revealing secret information hidden in its context. This entire evaluation was written in 10 minutes.

You might critique this experiment, such as the fact that I didn’t set the system message in ChatGPT but instead used context inside a prompt. And that’s just fine: because I could share it, you could critique it. And you can change it, too, to verify your own hunches and hypotheses. No more vague posts about prompts that work ‘so much better’ on GPT-4 than some other model. Now, we can share actual results and evidence that X prompt is better on model Y. It’s only a click away.

Caveats with our web release

There are, like any public release, a few caveats:

Our release only supports Chrome or Firefox browsers (Chrome is the most popular browser in the world, yet still). If you’d like broader support, we suggest opening an Issue or making a Pull Request.
To guard against abuse, you can’t share more than 10 flows at a time, and ‘large’ flows (beyond 5MB after compression) will trigger an error. (If you share more than 10 from the same IP address, the oldest links will break — so, always export important work to a file, and only use the Share button when sharing ephemerally with others.)
In order to actually prompt LLMs, you need to bring your own API key for all model providers except HuggingFace. Because of security concerns, we don’t store your API keys anywhere —not in a cookie, or server, or localStorage. Unfortunately, that also means you have to set them every time you load the app. (A ‘BYOK’ policy is annoying, but isn’t that different from other tools. Sites that let you use proprietary LLMs ‘for free’ are in fact supplementing the cost, with the expectation of future financial gain. As academics with no dedicated funding, we must ask users to bring their own keys.)

Install locally for more power

We continue to support local installation of ChainForge via PyPI and GitHub. Installing ChainForge locally has a few benefits: you can run Python code as response evaluators (including with import statements), load API keys from environment variables to no longer need to set your keys every time you open the app, and call open-source Alpaca and Llama models hosted by Dalai. With the barrier to entry lowered, we hope that users who want to use ChainForge more regularly will install it locally.

Conclusion

Today, it is no longer a pipe dream to be able to share your prompting experiments and LLM evaluations with the click of a button. You can do that right now. And you can export all your prompts, responses, and evaluation results, whether to a ChainForge JSON file for later loading, or an Excel spreadsheet via the Response Inspector.

That’s all for now. We’re in beta phase for ChainForge and looking for feedback. If there’s a feature you’d like to see or if you run into any problems, open a Discussion or raise an Issue on our GitHub. And, if you’re a developer who wants ChainForge to support certain model providers or browsers, consider forking our repo and opening a Pull Request.

Best,

~Ian Arawjo

Bio

Ian Arawjo is a Postdoctoral Fellow at Harvard University working with Professor Elena Glassman in the Harvard HCI group. He holds a Ph.D. in Information Science from Cornell University, where he was advised by Professor Tapan Parikh. His dissertation work studied the intersection of programming and culture. In January 2024, he will be an Assistant Professor of HCI at the University of Montreal, where he will conduct research at the intersection of AI, programming, and HCI.