2 Comments

The response rating is very thorough but it reads like a lot of work which is difficult to scale unless you have a substantial team. I'm not sure if this helps but I would have come up with an ideal answer to each question (that scores all the factors you were looking at) then NLP (n-grams blah blah) the answers given by each model to see how close they are and give it an overall rating.

Expand full comment
author

That's a really good idea - almost like grading an essay against an ideal answer. I'm not sure that would catch hallucinations but for the other stuff it would work I think...

Expand full comment