Verify Responses over Majority Voting
Diverse Verifier On Reasoning Step (DiVeRSe)1 is a prompting technique which provides two main improvements
- Diverse Prompts : They generate multiple variations of the same prompt by varying the examples used in each prompt
- Verification : They use a finetuned
Deberta-V3-Large
to determine the quality of a generated response. Instead of using majority voting, they use their model to score each generated response from 0 to 1. They then aggregate these scores for each unique answer to determine the best generated solution.
In the paper itself, they also train a step-wise verifier that is able to score each individual reasoning step. This enables much more fine-grained predictions but is challenging to obtain training data for.
We can implement this in instructor
. However, instead of using a deberta-v3-large
model, we'll be using gpt-4o to score its own outputs and generate a quality score.
import instructor
from openai import AsyncOpenAI
from pydantic import BaseModel
from typing import Literal
from textwrap import dedent
import asyncio
from collections import defaultdict
client = instructor.from_openai(AsyncOpenAI())
class Response(BaseModel):
chain_of_thought: str
answer: int
class Grading(BaseModel):
grade: Literal["Poor", "Average", "Good", "Excellent"]
def get_score(self):
mapping = {
"Poor": 0.25,
"Average": 0.5,
"Good": 0.75,
"Excellent": 1,
}
return mapping[self.grade]
async def generate_response(query: str, examples: list[str]):
formatted_examples = "\n".join(examples)
return await client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": dedent(
f"""
You are a world class AI that excels at answering
user queries in a succint and accurate manner.
<query>
{query}
</query>
<examples>
{formatted_examples}
</examples>
"""
),
}
],
response_model=Response,
)
async def score_response(query: str, response: Response) -> tuple[Response, Grading]:
return (
response,
await client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": dedent(
f"""
You are a world class AI that excels at grading
responses to a user query in a succint and clear
manner.
<query>
{query}
</query>
<response>
{response}
</response>
"""
),
}
],
response_model=Grading,
),
)
async def generate_response_batch(
query: str, examples: list[str], n_examples_per_batch: int
):
batches: list[list[str]] = []
for i in range(0, len(examples), n_examples_per_batch):
batches.append(examples[i : i + n_examples_per_batch])
coros = [generate_response(query, example_batch) for example_batch in batches]
return await asyncio.gather(*coros)
async def score_responses(
query: str, responses: list[Response]
) -> list[tuple[Response, Grading]]:
coros = [score_response(query, response) for response in responses]
return await asyncio.gather(*coros)
if __name__ == "__main__":
examples = [
"""
Q: James decides to run 3 sprints 3 times a week.
He runs 60 meters each sprint. How many total
meters does he run a week?
A: James decides to run 3 sprints 3 times a week.
He runs 60 meters each sprint. So he runs 60 meters
x 3 sprints x 3 times a week. That is 60 meters x 9.
The answer is 540.
""",
"""
Q: Brandon's iPhone is four times as old as Ben's
iPhone. Ben's iPhone is two times older than Suzy's
iPhone. If Suzy's iPhone is 1 year old, how old is
Brandon's iPhone?
A: Brandon's iPhone is 4 times as old as Ben's
iPhone. Ben's iPhone is 2 times older than Suzy's
iPhone. So Brandon's iPhone is 4 x 2 = 8 times older
than Suzy's iPhone. Suzy's iPhone is 1 year old. So
Brandon's iPhone is 8 x 1 = 8 years old. The answer
is 8.
""",
"""
Q: Jean has 30 lollipops. Jean eats 2 of the
lollipops. With the remaining lollipops, Jean wants
to package 2 lollipops in one bag. How many bags can
Jean fill?
A: Jean started with 30 lollipops. She ate 2 of
them. So she has 28 lollipops left. She wants to
package 2 lollipops in one bag. So she can package
28 / 2 = 14 bags. The answer is 14.
""",
"""
Q: Weng earns $12 an hour for babysitting.
Yesterday, she just did 50 minutes of babysitting.
How much did she earn?
A: Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 =
$<<0.2*50=10>>10. The answer is 10
""",
]
query = """Betty is saving money for a new wallet which
costs $100. Betty has only half of the money she needs.
Her parents decided to give her $15 for that purpose,
and her grandparents twice as much as her parents. How
much more money does Betty need to buy the wallet?"""
generated_responses = asyncio.run(generate_response_batch(query, examples, 1))
scored_responses = asyncio.run(score_responses(query, generated_responses))
scores: dict[int, float] = defaultdict(int)
for response, grade in scored_responses:
scores[response.answer] += grade.get_score()
print(scores)
#> defaultdict(<class 'int'>, {5: 3.5})
answer = max(scores, key=scores.get)
print(answer)
#> 5
References¶
1: Making Language Models Better Reasoners with Step-Aware Verifier