How GPT-4 mastered the entire bar exam, and why that matters

Researchers tested GPT-4 on all three portions of the bar exam, and its final score would put it in the 90th percentile of human test-takers—and above the average for actual test-takers of the most recently administered bar exam.

Is allowance instantly strangers applauded

For more than a century, law students have run themselves ragged trying to pass the written bar exam. Last week, artificial intelligence did it without a struggle.

On March 14, when the much-anticipated news broke that OpenAI had released GPT-4, its most powerful AI model to date, with it came the news that GPT-4 had passed the US Uniform Bar Exam (UBE) with scores that would place it in the 90th percentile of test-takers.

If this news sounds familiar, it’s probably because news circulated that GPT-4’s predecessor model, GPT-3.5, took the bar exam back in January 2023. What many of the stories at the time did not focus on was that GPT-3.5 didn’t even attempt portions of the bar exam. For the portions it did take, it didn’t do particularly well.

The story with GPT-4 is very different.

Why Does GPT Passing the Bar Matter?

While the two tests of running the bar exam through GPT technology might sound similar, they are starkly different in several ways.

In a nutshell, GPT-3.5 (more specifically the underlying model text-davinci-003, which powers ChatGPT) was only tested on the multiple-choice (MBE) portion of the bar exam. It got just shy of 50% of the questions right, which would put it at the 10th percentile of human test-takers. It only received a passing score in one area.

While much of the coverage centered on that one passing score, the paper reporting the findings of that study clearly focuses on the future ability to pass the bar, stating, “the trend in improvement for recent GPT models strongly suggests that an LLM [large language model] will pass the MBE component of the Bar Exam in the near future.”

Just a few months later, GPT-4, on the other hand, got nearly 76% of the answers correct. What does that mean? “Large language models can meet the standard applied to human lawyers in nearly all jurisdictions in the United States by tackling complex tasks requiring deep legal knowledge, reading comprehension, and writing ability,” concluded the research paper.

For even more context on how fast this technology is advancing, an earlier model, GPT-2, couldn’t even perform the test well enough to generate a reportable score.

While grappling with the multiple-choice section is no small feat, those who have not had to endure the rite of passage that is sitting for the bar exam might not realize that it’s only one part of the test. There are also two written portions—one consisting of two long-form essays and one consisting of six short-form written questions—which present a significantly tougher challenge for AI models.

GPT-4, however, met that challenge with flying colors. The researchers tested GPT-4 on all three portions of the bar exam, and its final score would put it in the 90th percentile of human test-takers—and above the average for actual test-takers of the most recently administered bar exam. Also like actual test-takers, it performed better on some subjects than others (read more below).

“It’s the first AI that can pass the bar exam,” said Pablo Arredondo of Casetext, who co-authored the most recent paper and whose company has built an AI legal assistant called CoCounsel on GPT-4 technology, during a Legaltech News webinar. “I think it’s fair to say that we are now in a new age of the practice of law, one where computers have, essentially, literacy. … These large language models are now capable of reading text, interpreting it, classifying it, analyzing it, and doing all the sorts of other things that are so key to the practices of law.”

Daniel Katz, Professor at Chicago-Kent College of Law, co-authored both the GPT-3.5 and GPT-4 bar exam studies and emphasized the importance of GPT-4 mastering not just “regular language,” but so-called legalese with high fidelity. 

“This is a prism, I think, to look at the broader picture,” Katz told Legaltech News. “I don’t even really care about the bar exam, per se. I care about the nature of the capability increase that’s gone on here. … This crystallizes what is happening for people in a way [that says], here’s some tasks that lawyers do, and it does it marginally better.”

Both Katz and Arrredondo agreed that GPT-4’s performance is unrivaled by anything we’ve seen before in terms of the language and reasoning capabilities of AI. For Katz, that means that law firms and corporate legal departments are going to have to strategically think about how they want to employ these technologies going forward.

During the webinar, Arredondo also stressed why these AI advances are so significant and cannot be ignored. “It is a profound mistake to think that what just happened now, especially with GPT-4, with this next generation, is going to be similar to the rather disappointing things that we’ve seen happen with earlier models.,” he said, “This is not a robot lawyer. It is a tool that can make practicing law much, much better, can make you just be able to focus on the things that got you into law to begin with.”

A Deeper Dive into GPT-4’s Bar Exam Performance

The UBE consists of three parts—the multiple-choice MBE, plus the two written components, the open-ended Multistate Essay Exam (MEE) and Multistate Performance Test (MPT). 

The first study, GPT Takes the Bar Exam, authored by Katz and Michael Bommarito, also from Stanford’s Center for Legal Informatics, only tested GPT-3.5 on the MBE. In the second study, GPT-4 Passes the Bar Exam, which added Arredondo and Shang Gao, also of Casetext, as co-authors, the group tested GPT-4’s acumen on the entire UBE—the MBE, plus the written MEE and MPT components. 

Katz described the first study as “a trailer for the movie” that is the second study. 

As outlined in the follow-up paper, GPT-4 not only passed the MBE and both written portions of the UBE, it did so while outscoring the average human bar exam taker.

Also like the average human bar exam taker, GPT-4 performed better in some subjects than in others, with Civil Procedure being its lowest score.

“I think that’s really the story, this kind of growth in a relatively condensed amount of time, and serious progress on the quality of language,” Katz said. “It’s one thing to be able to have to work with general language, but legal language, legalese, as we know … that’s a whole other animal.”

Katz further broke down GPT-4’s impressive performance on the written portions of the exam. Previous GPT models have been known to “hallucinate”—that is, generate completely fabricated, yet convincing, responses to questions. Katz said he didn’t see much of that in the GPT-4 study.

Instead, while GPT-4 got some answers wrong, they were akin to wrong answers that human test-takers make—answers where “it is not totally insane to invoke it, but it is not correct in this problem,” he explained. “It doesn’t get [the point] right, that [answer] misunderstands this idea. [But] these are all things I could see a student easily missing, by the way.”

While the reasons behind GPT-4’s far superior performance on the bar exam are incredibly technical, some aspects of which OpenAI has not yet made public, Katz summarized it as a combination of bigger models, more reinforcement learning and a lot of work to address source hallucinations.

To be sure, hallucinations still happen with GPT-4 and users need to be on the lookout for them. But OpenAI has said it has made, and continues to make, significant efforts to reduce them.

At the end of the day, “like other technology, if used incorrectly, there are pitfalls and there are dangers,” Arredondo concluded during the webinar. But legal professionals have the obligation to “responsibly and reliably deploy these models, because if they’re used correctly, you’re talking about unprecedented value for attorneys and for their clients,” he said.