Skip to content

Example: Answering Questions with Validated Citations

For the full code example, check out examples/citation_fuzzy_match.py

Overview

This example shows how to use Instructor with validators to not only add citations to answers generated but also prevent hallucinations by ensuring that every statement made by the LLM is backed up by a direct quote from the context provided, and that those quotes exist!
Two Python classes, Fact and QuestionAnswer, are defined to encapsulate the information of individual facts and the entire answer, respectively.

Data Structures

The Fact Class

The Fact class encapsulates a single statement or fact. It contains two fields:

  • fact: A string representing the body of the fact or statement.
  • substring_quote: A list of strings. Each string is a direct quote from the context that supports the fact.

Validation Method: validate_sources

This method validates the sources (substring_quote) in the context. It utilizes regex to find the span of each substring quote in the given context. If the span is not found, the quote is removed from the list.

from pydantic import Field, BaseModel, model_validator, ValidationInfo
from typing import List


class Fact(BaseModel):
    fact: str = Field(...)
    substring_quote: List[str] = Field(...)

    @model_validator(mode="after")
    def validate_sources(self, info: ValidationInfo) -> "Fact":
        text_chunks = info.context.get("text_chunk", None)
        spans = list(self.get_spans(text_chunks))
        self.substring_quote = [text_chunks[span[0] : span[1]] for span in spans]
        return self

    def get_spans(self, context):
        for quote in self.substring_quote:
            yield from self._get_span(quote, context)

    def _get_span(self, quote, context):
        for match in re.finditer(re.escape(quote), context):
            yield match.span()

The QuestionAnswer Class

This class encapsulates the question and its corresponding answer. It contains two fields:

  • question: The question asked.
  • answer: A list of Fact objects that make up the answer.

Validation Method: validate_sources

This method checks that each Fact object in the answer list has at least one valid source. If a Fact object has no valid sources, it is removed from the answer list.

class QuestionAnswer(BaseModel):
    question: str = Field(...)
    answer: List[Fact] = Field(...)

    @model_validator(mode="after")
    def validate_sources(self) -> "QuestionAnswer":
        self.answer = [fact for fact in self.answer if len(fact.substring_quote) > 0]
        return self

Function to Ask AI a Question

The ask_ai Function

This function takes a string question and a string context and returns a QuestionAnswer object. It uses the OpenAI API to fetch the answer and then validates the sources using the defined classes.

To understand the validation context work from pydantic check out pydantic's docs

from openai import OpenAI
import instructor

# Apply the patch to the OpenAI client
# enables response_model, validation_context keyword
client = instructor.from_openai(OpenAI())


def ask_ai(question: str, context: str) -> QuestionAnswer:
    return client.chat.completions.create(
        model="gpt-3.5-turbo-0613",
        temperature=0,
        response_model=QuestionAnswer,
        messages=[
            {
                "role": "system",
                "content": "You are a world class algorithm to answer questions with correct and exact citations.",
            },
            {"role": "user", "content": f"{context}"},
            {"role": "user", "content": f"Question: {question}"},
        ],
        validation_context={"text_chunk": context},
    )

Example

dd Here's an example of using these classes and functions to ask a question and validate the answer.

question = "What did the author do during college?"
context = """
My name is Jason Liu, and I grew up in Toronto Canada but I was born in China.
I went to an arts high school but in university I studied Computational Mathematics and physics.
As part of coop I worked at many companies including Stitchfix, Facebook.
I also started the Data Science club at the University of Waterloo and I was the president of the club for 2 years.
"""

The output would be a QuestionAnswer object containing validated facts and their sources.

{
    "question": "where did he go to school?",
    "answer": [
        {
            "statement": "Jason Liu went to an arts highschool.",
            "substring_phrase": ["arts highschool"],
        },
        {
            "statement": "Jason Liu studied Computational Mathematics and physics in university.",
            "substring_phrase": ["university"],
        },
    ],
}

This ensures that every piece of information in the answer has been validated against the context.