Evaluating Legal Research Generative AI Tools (Work In Progress)

The hype around rapidly advancing generative AI products (Gen AI) for legal research has reached another peak. Lexis recently released its Gen AI product and announced more developments, including Lexis Snapshot and the integration of its AI tool with Lexis Create. Thomson Reuters just announced its AI-assisted research tool for Westlaw Precision and a Microsoft Word plugin called Copilot. (How many Copilot AI tools can you count?) As for Bloomberg, it has not announced a legal Gen AI product yet, but perhaps it’s only a matter of time. Along with these prominent players, countless other databases and startups are releasing their own ChatGPT-like products. The question now is: How do we evaluate these tools?

I didn’t have a systematic approach when I started experimenting. I was more focused on understanding how the tools worked. Because I haven’t used many Gen AI chatbots, I wanted to play around with different types of prompts to understand better how to formulate my queries. My approach here differed from how I conduct regular Boolean searches, where I rely heavily on filters and try different terms.

I soon realized I needed a more consistent method of evaluating these tools. I couldn’t directly compare the tools’ effectiveness, except by considering their ease of use or dismissing a program if its initial response to a query had no merit. Consequently, I’m drafting a rubric to assess these products based on their accuracy and usability. But I’m taking longer than I expected to finalize my rubric. Here are some things that I’m considering.  

First, users will approach the chatbot differently based on their varying proficiency levels and legal research needs. For example, students will likely use Gen AI tools to help draft memos for their writing classes. They will approach their research questions with less expertise than practitioners, faculty, or other researchers. Will chatbots respond differently depending on how the question is phrased (e.g., using terms of art or everyday language)? Will chatbots meet the needs of all users?

Next, in trying to think of prompts for the rubric, I realized that I would probably be better off testing the products on research topics and questions that I’m already very familiar with so that I can quickly determine whether the answer is correct. Luckily, I’ve been meeting with 1L students on how to research for their open memo assignment, so I am very familiar with the applicable cases and statutes. But what if the accuracy of the tools depends on the jurisdiction and subject matter? To fully assess the tools, I must think of multiple types of questions across jurisdictions (federal and state). This is starting to feel like I’m drafting hypotheticals and grading rubrics for a legal research class.  

So, I’m still drafting a rubric that works for my needs. I’m sure others will evaluate products using their own criteria. You can find a working draft of a rubric assessing AI tools from SAIL: Sensemaking AI Learning, a newsletter that “explores trends in human and artificial cognition, focusing on how sensemaking and learning are impacted by AI.” Please share any other rubrics or tips on how you’re evaluating these tools–I’m open to suggestions!

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Evaluating Legal Research Generative AI Tools (Work In Progress)

  1. rebeccafordon says:

    Have you seen the LegalBench paper, describing a collaboratively built benchmark for measuring legal reasoning? https://hazyresearch.stanford.edu/legalbench/. I wonder if librarians could develop something similar for legal research tasks, build a set of tasks (or types of tasks) that we could all use for benchmarking. I’d love to work together on something like this, if you’d be interested!

Leave a comment