No, ChatGPT does not have seasonal affective disorder

A Perilous Tale of Why We Check Assumptions Before Running Statistical Tests

Ian Arawjo
3 min readDec 12, 2023

Yesterday, a claim that ChatGPT has “seasonal affective disorder” made the rounds on X, which then went viral and made headlines, including at ArsTechnica:

According to the original post, when ChatGPT has “December” (very specifically, a date in the form 2023–12–07) in its system message, versus “May”, its responses get significantly shorter. The claim was verified by an unpaired t-test and (supposedly) reproduced.

The problem? The test that was to verify this claim assumes that the data is normally distributed.

Turns out, it isn’t — not by a long shot:

Running a Shapiro-Wilk test of normality shows non-normal distributions for token lengths, for both May and Dec response groups (at p<0.05).

Whatever OpenAI is doing, neither token nor character lengths of LLM responses appear to be normally distributed at higher N (here N=240). When we run a Mann-Whitney U test instead — which is what we’re supposed to do in non-normal situations — we don’t reach significance:

In fact, just eyeballing the length of responses at N=240, it doesn’t look like there’s a difference in mean, either:

The ChainForge flow that verifies this, using the original poster’s system messages and prompt, is available here. (You can go to chainforge.ai/play and import it to view, no install necessary.)

Conclusion: No, ChatGPT does not have seasonal affective disorder.

Instead, this is a perilous tale of why we need to verify assumptions before we run statistical tests. Don’t assume data from LLM responses are normally distributed, especially if you’re getting them from black-boxed, opaque providers like OpenAI.

Cheers,

~Ian Arawjo

[Update Dec 13th: I’ve increased the N to 400 to near the original post, just to verify that N=240 above wasn’t by chance. Same result. The data is even less normal for token count distributions. Neither Mann-Whitney U test nor the t-test (which we’re not supposed to run here, but I do for completeness’ sake) gives a significant result:

At N=400, still no significance and high non-normality to the token counts.

If we run it on length of responses, it’s even worse: not only are MWU and t-test insignificant, but the mean length of December is *greater* than May! I rest my case that this result isn’t reproduceable for the exact values in the original post. Whether different GPT model versions result in different response lengths is a different matter — which, by the way, you can use ChainForge to easily verify.]

Bio

Ian Arawjo is a Postdoctoral Fellow at Harvard University working with Professor Elena Glassman in the Harvard HCI group. He is the creator ChainForge, the first open-source visual programming toolkit for prompt engineering. In January 2024, he will be an Assistant Professor of HCI at the University of Montreal, where he will conduct research at the intersection of AI, programming, and HCI. He holds a Ph.D. in Information Science from Cornell University, where he was advised by Professor Tapan Parikh. His dissertation work studied the intersection of programming and culture.

--

--

Ian Arawjo
Ian Arawjo

Written by Ian Arawjo

Assistant Prof @ Université de Montréal; Previously: Postdoc @Harvard, PhD from @CornellInfoSci. Former game developer.

No responses yet