What AI engineers can learn from qualitative research methods in HCI
Meet inductive coding and grounded theory, the new bread-and-butter of LLMOps
At first glance, qualitative research and the highly technical, benchmark-oriented world of AI/ML seem to have nothing to do with one another. But in reality, software developers building on AI models could learn a lot from qualitative research methods.
Hamel Husain, a consultant in AI/ML engineering, has recently been holding open office hours on LLMOps. What caught my eye was this advice, which he gave to many developers building on LLMs:
“[L]ook at your logs/traces — start with 30 or so. Start categorizing the errors and issues you see. Keep looking at logs and traces until you feel like you aren’t learning anything new. In the end, you will know where your biggest issues are. You prioritize those! You will also get a sense of what is most important to measure (and how).”
Taking a pile of qualitative data, categorizing and clustering it, and looking at data until you “aren’t learning anything new”?
For those of us with qualitative research, HCI or UX backgrounds, this advice isn’t new. In fact, it precisely describes the process of qualitative coding, the bread and butter of qualitative research. In the language of qualitative methods, what Hamel is suggesting is inductive coding until one reaches saturation:
- “Inductive coding is a ground-up approach where a researcher derives codes from the data. Researchers typically don’t start with preconceived notions of what the codes ought to be, allowing the theory or narrative to emerge from the raw data.” (source)
- “Theoretical saturation is achieved when no additional themes or insights emerge from the data collection, and all conceptual categories have been explored, identified, and completed.” (source)
The additional prioritization of issues is emblematic of heuristic evaluation in UX design, where a development team identifies UX problems independently and merges their found issues, then collectively decide upon priorities for fixing each issue.
Effectively, these connections reveal two things:
- Engineers of AI-integrated systems, especially prompt engineering, can benefit from learning qualitative methods like inductive coding.
- Engineering AI-integrated systems is also a process of UX design, and can therefore learn from existing methods in UX design (i.e., engineering prompts/chains == engineering good UX).
In the rest of this post, I overview some basic of qualitative research methods in HCI and UX as an introduction for AI engineers.
Ground your LLMOps with some Grounded Theory
We learned from our “Who Validates the Validators?” paper that people need to observe AI outputs to form criteria. That is: in virtually all real-world scenarios, it is impossible to fully form evaluative criteria before observing model outputs. The reason why this is true is analogous to the reason case law exists: it’s impossible to interpret the law without seeing some real-world “outputs” (crimes) first, and applying one’s criteria to these unique events. Over time, our evaluations become stronger, but they also develop in reaction to the outputs we observe.
Effectively, like law, our criteria and prompts are theories about model outputs. They have been honed, battle-tested, by empirical data (viewing outputs); like physicists run empirical experiments and refine their theories, we refine our prompts and architectures by observing outputs.
Grounded theory is a qualitative method whereby theories about the world are grounded in natural language data. It adopts an interpretivist standpoint, acknowledging that the observer is implicated in the data collected—that there is no such thing as “raw” data. Grounded theory advocates an iterative process of data collection, reflection, and inductive coding. Roughly, this process is as follows:
- Collect a pile of rich qualitative data. Your data collection is an interpretive process: what data you choose to collect, and how you will collect it, is up to you. In LLMOps terms, these are your test examples, models, and prompting/approach. In regular qualitative research, your data collection method includes choices like who you talk to, where you go, how you solicit feedback, etc.
- “Code” the data: look for themes/patterns in the data, and give each “piece of data” a “code” (name of the topic/theme). Inductive coding is an interpretive process which benefits from having an overarching goal. For instance,if you’re asking for JSON, and the model does not reply with JSON: the “code” we give this incident might be “misformatted,” and the overarching goal of our coding is to identify LLM outputs we perceive as “bad.” You might give these codes on a spreadsheet column or in a dedicated output feedback section of an LLMOps tool.
- Reflect on the codes/insights you’ve found, and write down your reflections in what’s called an “in-process memo.” Are there gaps in your methods? How can you change your process to account for these gaps? For LLMOps, gaps are things like missing edge cases and your test dataset not representing real-world inputs well. Share these reflections with members of your team.
- Revise your data collection approach using insights from (3), and go back to (1). For instance, you might expand your test set, or change your prompts/approach. Skip to Step (5) if you’ve reached “saturation”: observing new data does not yield any new codes/insights in (2) and (3).
- Conduct “focused coding”: further standardize your codes and cluster them into higher-order clusters (codes of similar type). The analogous situation for LLMOps is to group codes into categories and prioritize what codes matter (similar to heuristic evaluation in UX research). An extra step here would be to formalize your codes into binary classifiers (code- or LLM-based), such that you can detect now-known failure modes automatically when seeing new outputs.
Note that you need to repeat the process above if you change the model or architecture you’re using. Drawing our analogy further, each different LLM is like a different country where you don’t know the culture in advance — you know the fundamentals are there, like eating and housing, but the form and content and rules of the culture are slightly (sometimes drastically) different. So, while some of your findings from one LLM might transfer to another one, others may not (transferability in qual. research). For instance, the principle of Chain-of-Thought (CoT) transfers from OpenAI to Anthropic models, but the preferred format of prompt internals may not (Claude prefers XML).
Learn from Heuristic Evaluations in UX research
Hamel’s comment also touches upon heuristic evaluations in UX research. Heuristic evaluation is a technique that was invented in the 1990s as a “discount usability test”. In one recorded instance, performing a heuristic evaluation in a company saved about $500,000 and cost $10,500.
In a heuristic evaluation, a developer team tests a user interface (UI) and collectively creates a spreadsheet. They:
- Test the user interface independently, making annotations in a spreadsheet of what they tried and the issues(s) that emerged
- Come together to merge identified issues/pain points into a share sheet
- Collectively decide on the priority of each issue (from minor to catastrophic)
- Finally, fix the interface, addressing issues by priority
- Rinse and repeat the process until satisfied
LLMOps developers could learn from this process to: independently annotate LLM outputs for issues on a spreadsheet, merge and standardize these issue categories as a team, then prioritize what to fix. If we swap “user interface” with “prompt,” we can imagine an interface that lets a team collectively perform a heuristic evaluation on a prompt, decide upon the priorities of fixing each bad output, and then feed this information into an optimizer that generates both assertions detecting bad behavior (LLM- or code-based), and suggests improved prompts.
(Admittedly, in a true heuristic evaluation, there is a fixed set of codes — a “heuristic” or rubric —that the team starts with. In LLMOps, these codes would emerge from the inductive coding process above, and not be pre-determined.
The processes above suggest we/someone should develop their LLMOps support software to explicitly aid this process. Tools like ChainForge (for prompt engineering and LLM auditing) and DocETL (for document processing pipelines) help, but perhaps could do more to explicitly support iterative inductive coding. And, we could imagine LLMOps traceability and observability platforms like LangSmith could better support conducting inductive coding and heuristic evaluations of LLM outputs by learning from the rich literature in HCI and usability studies.)
Why don’t developers embrace qualitative methods?
I’ve heard developers express disdain for this iterative processes above. Even when hearing it, they feel like they are cast adrift, running on quicksand, and instead want me to give them a technical solution with a “100% success rate” for their nuanced problem.
Their confusion seems to stem from their training in science and engineering, which instilled in them a positivist mindset towards solving problems. Positivism is a belief that one looks for “objective truth” by forming hypotheses and testing them with quantitative evidence or mathematical abstractions. The positivist believes: for a claim X, it is possible to prove X true/false.
Grounded theorists instead acknowledge how the social world is constructed and interpreted: there is no objective reality except for our perceptions. LLM outputs may be good, or they may be bad, but that depends on who we are and how we judge good and bad, not on one single “right” answer. Moreover, it also depends on how we collect the data —on the choices we make, such as who to interview or what prompt to use.
I believe LLMOps is fundamentally an interpretivist, not a positivist endeavor. The research, including my own, bears this out, finding that the same evaluative criteria can be interpreted differently and that which model one chooses is highly subjective. That is LLMOps’s key difference to classical software engineering: whereas in SE you look to get 100% correct on a test set with well-defined criteria for success, in LLMOps we frequently must settle for 98% with guardrails and 1% of totally unexpected outputs never seen on the test set.
Quantitative methods — reaching “99% acceptance rate” on your tests, for instance — are still very valuable, but they follow after this inductive coding process of identifying issues and refining your evaluative criteria.
“How do I eliminate errors in LLM outputs?” they ask. “You can’t entirely eliminate them without eliminating the unique value LLMs provide,” I say. “But we need to legally guarantee there’s no errors,” they reply. I weigh how to let them down: I have heard this cry many times before. “100% elimination of errors is not realistic, but you can add guardrails and work with your lawyers to come up with the right disclaimers.” The call cuts off. They go elsewhere, find a snake oil solution claiming 100% success rate. Months later, they return. “It didn’t work,” they say. I hand them a copy of the Google PAIR Guidebook, The UX Book and Constructing Grounded Theory, and make a fast exit.
Conclusion
As a professor, I talk with a range of people, from students to developers to managers in companies, who are looking for solutions to their LLMOps problems. Although many problems are hard, a good number of solutions —particularly the methods they could apply to reach a solution —seem obvious to me. Over time, I’ve come to theorize that the reason why is less about my familiarity with LLMs and more about my familiarity with the iterative, interpretive processes found in qualitative research and UX design. It’s apparent that learning qualitative research analysis, in particular, would just help folks do their jobs.
So, it’s time to learn some qualitative HCI research methods to improve your LLMOps pipeline. I promise you or your company won’t be disappointed by the results.
~Ian Arawjo
Bio
Ian Arawjo is an Assistant Professor of HCI at the Université de Montréal, an Associate Member of the Mila-Quebec Institute for AI Research, the Co-leader of the Montréal HCI group, and the creator of the ChainForge visual toolkit for prompt engineering and LLM auditing. Previously, he served as a Postdoctoral Fellow at Harvard University working with Professor Elena Glassman in the Harvard HCI group. He conducts research at the intersection of AI, programming, and HCI. He holds a Ph.D. in Information Science from Cornell University, advised by Professor Tapan Parikh, where his dissertation studied the intersection of programming and culture.