LLM Wrapper Papers are Hurting HCI Research

Naming the problem and deciding what to do about it

4 min readJun 6, 2024

An “LLM wrapper,” according to Dall-E-3.

Since the advent of the LLM-era we’ve seen a troubling phenomenon as reviewers to HCI venues: submissions that are, pejoratively speaking, “LLM wrappers” — LLMs applied to X problem. These “LLM wrapper” papers frequently cite little HCI literature or engage with it only at a shallow level, have only a surface-level engagement with X problem, and involve UIs or studies that are little more than editing the system prompt of ChatGPT. Their claim to novelty and thus publication is always the same — “while there is related work, no one has yet applied LLMs to X problem.”

Here’s an example. Recently a student forwarded to me a public call for students to work on NLP projects at a university lab. There were dozens of titles that boil down to: “we apply LLMs to X problem.” Sure, maybe one or two of these projects are fine, but all of them? A dozen projects for one professor, all about applying LLMs to a different problem, in domains from mental health to writing? I have trouble believing that the people working on these papers are engaging deeply and sincerely with every problem or target users: their goal is instead to churn out a paper.

These flood of papers to HCI conferences put further strain on an already encumbered review process. Reviewers and 1ACs spend time reading LLM wrapper papers, and, lo and behold, their reviews boil down to the same thing: “little engagement with HCI literature, shallow system and user study, questionable design decisions.” Serving on the PC of UIST this year, and reviewing many such CHI 2024 submissions, I speak from experience: about half of my reviews have concerned systems that might be called LLM wrappers. (The problem has gotten so bad that I even created a review template to handle this particular issue.) That’s time and energy that could’ve been spent on other things.

I suspect a good portion of these submissions are authors with ML or NLP backgrounds resubmitting rejected work from NeurIPS or EMNLP. Given the problems with those conferences and their review load, we are seeing “submission spillover” into HCI conferences. Some of them might be students with these backgrounds, submitting a paper like pulling a slot machine, hoping to get in.

The problem doesn’t end with a strained review process, however. If published or put up as a pre-print, an LLM Wrapper paper could take up an obnoxious amount of space: if new researchers come along with ample expertise on X problem, and carefully apply LLMs to it (coming to this usage from actual user need), situating their design decisions in past literature and need-finding studies — then these new scholars might be asked to justify their system against the background noise of the LLM wrappers that inevitably “got there first.” “How is your work novel when Y paper already applied LLMs to this problem?” I can already hear an inexperienced reviewer asking. They’re taking up territory and filling it with junk that newer authors need to wade through to justify their work.

To be honest, I think the authors submitting this work bear a large brunt of the blame. But not all. Some of the blame is also the incentive structures that reward this behavior, that allow it to flourish. That’s us, too.

What Gets You Out of LLM Wrapper Jail?

Even if you agree with me that there’s a problem, you may persist in wondering where we draw the line. What counts as an LLM wrapper paper and what doesn’t? If we run a great evaluation, are you saying that our contribution doesn’t matter if all we did was change a system prompt?

While it is somewhat subjective, I think there’s a couple determining factors that —taken together —push a paper outside of “LLM wrapper” territory. For HCI, these are:

Authentic engagement with past literature (both within the field, and outside of it —i.e. a paper on supporting mental health needs to engage both mental health literature, and prior HCI work on mental health)
A solid justification on why to apply LLMs (over “non-AI” approaches)
Effort spent into design and/or architecture iteration
Justification of the contribution and/or novelty beyond an implicit “we applied an LLM to this problem, no one’s done that before”
Careful user study that goes beyond a basic usability test with subjective measures (Likert scores) or a contrived ablation study

These are just some elements that can get you out of LLM wrapper jail. We need to establish and communicate standards like these around LLM-wrapper type papers, to present to submitting authors as guidelines and warnings. For instance, for upcoming CHI or UIST conferences, we might consider warnings like:

“If you don’t engage with past HCI literature in HCI conferences and journals, you will be subject to a quick-reject.”
“If your primary contribution is to change the system prompt of an LLM then run a study on it, you may be subject to a quick-reject.”

These aren’t perfect guidelines, and it’d be best if we worked on them together. But we need to name the problem first in order to work towards solutions that make our collective lives easier. Because otherwise, we will continue to be inundated with a flood of LLM wrapper papers from authors who couldn’t be bothered to engage with past HCI literature, who don’t honestly engage with the problem and whose only goal is to get a publication and move on. And that stinks.

~Ian Arawjo