Differences in AI Models for Legal Advice Seekers:
Comparing the LLM public models, and then rating their answers
Check out the experiment in progress here, or keep reading, or both!
Ever since Bing revealed that they seem to be gong full send to using generative AI in their search platform, and Google seems to be following after, I’ve been thinking about how legal information / advice seekers will use it. Not whether they will use it to get legal information / advice, since the most used triage system for people seeking legal help is Google. After all, there’s already been instances of lawyers using ChatGPT to get themselves in trouble (it just happened again btw), and non-lawyers using it to write things “like a lawyer.”
Now we have at least 5 generative AI models that people can use, with varying degrees of availability:
GPT-3.5 / OpenAI
GPT-4 / OpenAI
Bard / Google
Claude / Anthropic
LLaMA2 / Meta
So one thing I’ve been wondering is this:
If people are using these models to get legal advice, what kind of advice are they actually getting? And are they any good?
I feel like I need to point this out here: Each model is different - these aren’t different brands of calculators that answer 2+2 in exactly the same way. Given the same question, the different models give similar but different answers that include different pieces of information. Also the same model with the same question will give similar but different answers at different points in time. So, literally, YMMV.
An aside: I think the horse has left the barn on the “are generative AI models giving legal advice” question. The line between legal advice & information is completely hazy and means next to nothing to these models.
So I devised an experiment:
What if we used the data set of r/legaladvice questions used to train the Spot issue-spotting model, and fed those into each model and recorded the results?
Some benefits to using this data set:
It’s already public / CC licensed;
The question there are very wordy and descriptive, and written in a narrative format. This gives the AI models a lot to chew on, and the models react really well to having a lot of context. For example the input:
“Got Rear-ended today, what other steps should I take?. So I was waiting at a stoplight today and got low-speed rear-ended. It wasn't bad, but did give me a fright. My car was still perfectly drivable, even with the dings and scratches. I took pictures of my car, her car, her insurance and license plate and submitted a claim online. I didn't call the police on her because I am certain it wasn't bad enough to cause injury, and accidents are accidents, but she started being a little cagey after all of this and kept repeating that she thought some of the damage was there before, which it wasn't. I don't expect any issues and I'm not trying to get some payout, but was there anything else I could do besides calling the police to prevent me from being screwed over?”
Works way better than
“How to make property damage liability claim after automobile accident”
Using Airtable and some mild scripting, I’ve created a system for storing the question and responses:
Next, because Airtable is great for storing data but not for viewing it, I used Softr to create a web front-end for the table. You can view that here.
Where possible I’ve set the temperature as low as possible so that the output is deterministic.
Here’s a sample of some of the different responses:
Right now I’m working through the list as I have time, but it’s very time-consuming.
Rating the responses:
I’m also thinking about what system to use to rate the responses, and what metrics should apply? So far I’ve come up with this list:
Understanding of the question
Does the model’s answer seem to reflect an understanding of the question being asked, or is it missing something obvious? I think this can be rated on a 0-5 scale.
Correctness
How correct is the answer? This is a hard one because many of the questions are in different jurisdictions and about different areas of law. This is different from the “did it hallucinate” question, because an answer can be correct by reciting the correct legal principle, but still fail in terms of hallucinations by inventing a supporting case that doesn’t exist.I think this can be rated on a 0-5 scale.
Lack of hallucinations
Does the answer MSU (make shit up)? Turning the temperature down as far as possible seems to really help with this issue. I think this can be measured on a binary Yes / No.
Helpfulness:
Is the answer actually helpful? It seems that some of the models have breakpoints where it refuses to answer. An example: on this question, both the OpenAI models canned up, probably because the question includes the word suicide, while the other models still provided an answer. On this question, the OpenAI models provided answers but Bard canned up, maybe because the question mentioned abuse. I think this can be rated on a 0-5 scale.
If you have any suggestions on how to rate the answers please let me know.
Why do this?
I think this kind of experiment is important for two reasons:
Like I’ve said before, people are going to be using things like Bing, Bard, and ChatGPT for getting legal advice. While I’m just some dude, and the tech gods out in California have no idea I exist, I’d like to think maybe they’ll care at some point how their products are responding to these questions. Getting legal advice is definitely a foreseeable use case, and people sure as hell can’t afford to get advice from lawyers. I’m not saying that these models shouldn’t give legal advice - quite the opposite - I’m saying that the legal advice they give should be helpful, accurate, and include further reading and organizations the person asking the question can contact. I’m hoping that the tech gods will understand this, and maybe little old me tinkering around with this will help.
There’s a great amount of interest in legal aid in using these models for things like triage, and they’re very easy at this point to integrate and deploy. I think any responsible legal aid organization should compare not just the cost of the different models, but how they actually behave in the real world. I want to see legal aid orgs and other groups go into adopting this technology with their eyes open.
I should add that if you’re interested in this kind of thing and want to collaborate on this, please let me know!
The response rating is very thorough but it reads like a lot of work which is difficult to scale unless you have a substantial team. I'm not sure if this helps but I would have come up with an ideal answer to each question (that scores all the factors you were looking at) then NLP (n-grams blah blah) the answers given by each model to see how close they are and give it an overall rating.