Unit 5: Getting Confidence in LLM Answers
A throughline of this course has been:
An LLM is like a wild horse: powerful but unpredictable. The main challenge when integrating LLMs into our software is to tame that power.
- In unit 3 we learned how to guarantee that an agent doesn't perform dangerous actions.
- In unit 4 we learned how to guarantee correctness properties of AI-generated code (when using AI as a coding assistant).
One kind of LLM danger/mistake we haven't yet discussed is the problem of hallucinations in their text output. Although with natural-language text it's rarely possible to provide the same hard guarantees as for code, in this lecture we'll see how close we can get.
Q: Give some examples of LLM hallucinations that can have real-life consequences. Can you think of ways those hallucinations could be prevented?
Outline
- A chatbot that couldn't: why we need grounding
- RAG via semantic search: retrieving the right piece of the source of truth
- Adding symbolic reasoning: when retrieval alone isn't enough
Running Example: ChatPHD
This is a real-world example. For those of you who don't know, I'm the CSE department GradCom Chair, so I get a lot of questions from PhD students about how the CSE PhD program works. Wouldn't it be nice if I could just build a chatbot that answers those questions for me?
First try: a vanilla LLM
Don't LLMs already know all this stuff? I can just make a simple chatbot web UI and hook it up to the chat completions API, right? (We'll use a CLI in the lecture notes for simplicity, but imagine this as a web UI.)
MODEL = "gpt-4o-mini"
SYSTEM_PROMPT = (
"You are ChatPHD, a helpful assistant that answers questions about "
"the PhD program in Computer Science and Engineering at UC San Diego."
)
client = OpenAI()
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
while True:
user_input = input("you> ")
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(model=MODEL, messages=messages)
reply = response.choices[0].message.content
messages.append({"role": "assistant", "content": reply})
print(f"chatphd> {reply}")
Let's try it:
chatphd> Hi! I'm ChatPHD. Ask me anything about the UCSD CSE PhD program.
you> Are there some quals I need to pass before my defense?
chatphd> Yes, the PhD program in Computer Science and Engineering at UC San Diego
typically requires students to pass a qualifying examination. This exam is designed
to assess your knowledge and understanding of fundamental concepts in computer
science and engineering. It usually covers core topics such as algorithms, systems,
and theory, among others.
The format and timing of the qualifying exam can vary, so it's important to check
the specific guidelines provided by the department. Generally, students take the
qualifying exam after completing required coursework and before advancing to
candidacy. It's advisable to consult your faculty advisor or the program's
official documentation for the most accurate and detailed information regarding
the qualifying exam.
Hm, this is not great. If you're a CSE PhD student, you probably know that there is no exam that "assesses your knowledge and understanding of fundamental concepts in computer science and engineering." The model doesn't actually know anything specific about our program — it's making up plausible-sounding generalities based on what PhD programs typically look like.
Luckily, we have the source of truth: the CSE PhD Student handbook that GradCom worked very hard on.
Q: How can we use the handbook to improve the chatbot's answers? What are the pros and cons of different approaches?
Strawman solution: prepend the whole handbook text to the prompt.
- Pros: very easy to implement!
- Cons: the whole handbook is ~22k tokens; this is wasteful (cost-wise) and the model might get confused about what to focus on. It also doesn't scale to larger documents (imagine if our "handbook" were a 500-page legal code).
Instead we want to enable the model to search the handbook for relevant information, and only put the relevant pieces into the prompt. This pattern is called retrieval-augmented generation (RAG).
Two ways the LLM could search the handbook:
- Keyword search. LLM comes up with keywords or a regex, we search the handbook text for exact matches and return the surrounding sections. But some information is hard to find with keyword search. For example, if the question is "Who is allowed to be on my committee?", searching for "committee" yields 25 matches — and most of them are about the Graduate Committee, the DEI Committee, etc., not about thesis committees.
- Semantic search. LLM (or rather, an embedding model) maps the question to a vector that captures its meaning, we do the same for all the sections of the handbook, and we return the sections whose vectors are closest to the question's vector.
Semantic search is the dominant approach to RAG, so this is what we'll try next.
Outline
A chatbot that couldn't: why we need grounding[done]- RAG via semantic search: retrieving the right piece of the source of truth
- Adding symbolic reasoning: when retrieval alone isn't enough
RAG via Semantic Search
Background: Embeddings
An embedding is a function that maps a piece of text to a fixed-length vector of numbers. The same kind of LLM that we've been using to generate text can also be used to embed it (in fact, it's a simpler operation — we just stop before the part that turns the internal representation back into words).
┌────────────────────┐
"Roses are red" ────▶│ embedding model │────▶ [ 0.21, -0.04, 0.87, ..., 0.12 ]
└────────────────────┘ (1536 numbers)
By itself, a vector of 1536 numbers isn't very interesting. What makes embeddings useful is a particular property the embedding model is trained to have:
Texts with similar meaning get mapped to vectors that are close together; texts with different meaning get mapped to vectors that are far apart.
We can measure how close two vectors are using cosine similarity — the cosine of the angle between them, which ranges from -1 (opposite) to 1 (identical direction):
"Roses are red" ────▶ [ 0.21, -0.04, 0.87, ... ] ──┐
├──▶ cosine = 0.84 (close)
"Violets are blue" ────▶ [ 0.19, -0.02, 0.85, ... ] ──┘
"Roses are red" ────▶ [ 0.21, -0.04, 0.87, ... ] ──┐
├──▶ cosine = 0.15 (far)
"How do I file taxes?" ────▶ [-0.81, 0.55, -0.10, ... ] ──┘
In NumPy, cosine similarity is one line:
def cosine_sim(u, v):
return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
And if we pre-normalize every vector to unit length (divide by its norm once, when we first compute it), then cosine similarity is just the dot product. We'll use this trick below.
Q: How can we use cosine similarity to search a document?
Semantic search with embeddings
The plan: embed every section of the handbook once, save those vectors to disk, and then at query time embed just the user's question and find the closest sections.
Offline (build the index once):
embedding index
┌────────────────────┐
┌──────────────┐ │ chunk 1 ──▶ v1 │
│ chunk 1 │ │ │
handbook.md ─────────▶│ chunk 2 │ ───▶ embed ───▶ │ chunk 2 ──▶ v2 │
│ chunk N │ │ ... │
└──────────────┘ │ chunk N ──▶ vN │
└────────────────────┘
Online (one search per query):
┌────────────┐ cosine
│ v1 │ sim
"are there quals?" ──▶ embed ──▶ u ─────▶ ┌────┤ ... │────────▶ top-k chunks
│ │ vN │ │
│ └────────────┘ ▼
│ LLM context
embedding
index
The main design decision we have to make here is how to chunk the document.
Q: What would happen if we made each chunk really large (say, 1000 sentences)? Really small (a single sentence)?
The trade-off:
- A few large chunks → back to the same problem we had with the whole document: too much irrelevant context per retrieved chunk.
- Many tiny chunks → each chunk lacks the surrounding context the model needs to actually understand it.
There's also a more subtle issue: we'd like each chunk to be about one topic, so that its embedding actually represents something coherent. If a chunk mixes sentences about completely different things, its vector ends up somewhere in between — close to nothing in particular, and the model will struggle to find it.
Lucky for us, our handbook is already organized into self-contained sections (one per topic), so we can just split along section boundaries (every ## heading).
Building the handbook embedding index
Here's the offline indexing code, in full:
import json, re
import numpy as np
from openai import OpenAI
EMBED_MODEL = "text-embedding-3-small"
def chunk_handbook(text: str) -> list[str]:
"""Split the handbook on every h1/h2 heading. Each chunk = heading + body."""
chunks: list[str] = []
current: list[str] = []
for line in text.splitlines():
if re.match(r"^#{1,2} ", line) and current:
chunks.append("\n".join(current))
current = []
current.append(line)
if current:
chunks.append("\n".join(current))
return chunks
client = OpenAI()
chunks: list[str] = chunk_handbook(open("handbook.md").read())
print(f"Chunked handbook into {len(chunks)} sections.") # 64 for our handbook
# OpenAI's embeddings API takes a list of strings and returns a list of vectors,
# one per input string, all in a single API call.
response = client.embeddings.create(model=EMBED_MODEL, input=chunks)
embeddings: np.ndarray = np.array([d.embedding for d in response.data])
# Pre-normalize so cosine similarity reduces to a dot product later.
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
np.save("index/embeddings.npy", embeddings)
json.dump(chunks, open("index/chunks.json", "w"))
Note that we just dump the embeddings into a NumPy array and save it to disk as one big file. In production, you'd typically store them in a vector database — a specialized database for storing vectors and finding nearest neighbors efficiently. Popular options include FAISS (in-process), pgvector (Postgres extension), and hosted services like Chroma, Pinecone, or Weaviate. These do the same thing we're doing here, just with smarter indexing so they don't have to compare the query against every vector — important when you have millions of chunks. For our small handbook (~60 sections), a NumPy dot product against the whole matrix is fast enough that the extra machinery isn't worth it.
Hooking it up to the chatbot
Q: Now that we have the embedding index, how do we actually give this information to the chatbot?
The simplest and most flexible way: just make semantic search a tool that the LLM can call. (Yes, technically our chatbot is now an agent — recall Unit 3.)
The tool itself is short:
embeddings = np.load("index/embeddings.npy")
chunks = json.load(open("index/chunks.json"))
@function_tool
def search_handbook(query: str, k: int) -> list[str]:
"""Search the UCSD CSE PhD Student Handbook. Returns the top-k most relevant sections."""
response = client.embeddings.create(model=EMBED_MODEL, input=[query])
q = np.array(response.data[0].embedding)
q /= np.linalg.norm(q)
scores = embeddings @ q # cosine sim (vectors are pre-normalized)
top = np.argsort(-scores)[:k]
return [chunks[i] for i in top]
And we wire it up to an agent the same way we did in Unit 3:
INSTRUCTIONS = (
"You are ChatPHD, a helpful assistant that answers questions about "
"the PhD program in Computer Science and Engineering at UC San Diego.\n\n"
"You have access to the search_handbook tool, which searches the official "
"UCSD CSE PhD Student Handbook. You MUST use this tool to answer any "
"substantive question about the program — never answer from your own "
"knowledge. If the search results do not contain the answer, say so "
"plainly rather than guessing."
)
agent = Agent(
name="ChatPHD",
instructions=INSTRUCTIONS,
model=MODEL,
tools=[search_handbook],
)
# ... plus the usual conversation loop from Unit 3 ...
Notice the system prompt is doing some heavy lifting: it tells the model to always call the tool and to refuse to guess if the answer isn't there. This is our first attempt at grounding the model in the source of truth.
Let's take it for a spin:
chatphd> Hi! I'm ChatPHD. Ask me anything about the UCSD CSE PhD program.
you> Are there some quals I need to pass before my defense?
chatphd> Yes, before your defense, you need to pass the **Thesis Proposal**,
also known as the **University Qualifying Exam**. This is a presentation to
your thesis committee where you present your work to date and describe your
plans for completing your Ph.D.
Here are some important details:
- You must complete the thesis proposal by the end of your 4th year
(12th quarter of enrollment, not including summers).
- After passing the thesis proposal, you advance to candidacy for your Ph.D.
- The committee must include your advisor as a chair, at least four members
from UC San Diego faculty, and meet other specific criteria.
The defense presentation itself does not have strict requirements, but you'll
need to prepare a comprehensive presentation summarizing your thesis work.
Hooray! This is the right answer.
The logs show that the model made the following tool call (and got these results):
[search_handbook(query='qualifications before defense', k=5)]
- ## *The Final Defense*
- ## *Support eligibility*
- ## *Filing Fee To Complete Your Degree*
- ## *The Thesis Proposal (University Qualifying Exam)*
- # **Resources For Students with Disabilities**
Two things are worth noticing here:
- The user said "quals before my defense", the model searched for "qualifications before defense", and the relevant section actually talks about "University Qualifying Exam". This is a great example of the power of semantic search: you don't need to guess the exact phrasing.
- The top-k results include some clearly irrelevant sections (Support eligibility, Resources For Students with Disabilities). The model handled this gracefully by simply ignoring them. This is part of why we ask for several results rather than just one — semantic similarity is fuzzy, and giving the model a small menu to pick from is more robust than betting everything on the single closest match.
Q: Do we have a guarantee that that model's answer matches the handbook? If not, what could we do to get one?
Verifying answers with quotes
Not really. Our system prompt is a plea to the model — there is nothing in our pipeline that checks whether what comes out actually matches the handbook. The model could still mix in plausible-sounding details from its training data or mis-remember a number from a retrieved chunk.
One particularly simple and surprisingly effective way to get a real guarantee:
- Ask the model to produce verbatim quotes from the handbook supporting each claim
- Check (with code) that those quotes are actually present in the handbook
- Show the quotes to the user, marked with a ✓ if they check out and an ✗ if not
Our modest but real guarantee is:
Every quote marked with a ✓ corresponds to a verbatim span of the handbook.
(It still up to the user to decide if the quoted text actually supports the claim in the model's answer, but this is arguably an easy task).
To implement this, we need two things: structured output from the model (text + list of quotes), and a verifier that checks each quote against the source text.
For the structured output, we use a Pydantic schema (recall Unit 1):
class Answer(BaseModel):
text: str # the natural-language answer, with [1], [2], ... markers
quotes: list[str] # one verbatim quote per marker, in order
…and we extend the system prompt to require this format, and to be strict about verbatim:
INSTRUCTIONS = (
"You are ChatPHD ...\n\n"
"Your output is an Answer with `text` and `quotes`. Every substantive "
"claim in `text` must be supported by a verbatim quote from the handbook, "
"referenced by a marker like [1], [2], etc. Quotes must be a short contiguous "
"span copied EXACTLY from the handbook."
)
For the verifier, the most naive version would just be substring search:
def find_quote(quote: str) -> bool:
return quote in handbook_text
This basically works, but it's too strict in practice. LLMs love to lightly normalize what they copy: they collapse runs of whitespace, straighten smart quotes ("foo" → "foo"), NFKC-normalize stray unicode, occasionally change capitalization. We don't want to flag a quote as fabricated just because the model changed a " to a ". So we normalize both sides before checking:
def normalize(s: str) -> str:
"""Lowercase, collapse whitespace, straighten smart quotes, NFKC."""
... # implementation left as an exercise to Claude :)
NORM_HANDBOOK = normalize(handbook_text)
def find_quote(quote: str) -> bool:
return normalize(quote) in NORM_HANDBOOK
This is still quite strict — anything beyond cosmetic differences (a paraphrase, a dropped word, a wrong number) will be flagged — but that's exactly what we want. The point of the verifier is to force the model to commit to a specific span of source text.
Finally, when we render the bot's answer we mark each quote with a ✓ if found in the handbook and an ✗ if not. Let's try it on a question where the model is more tempted to make things up:
chatphd> Hi! I'm ChatPHD. Ask me anything about the UCSD CSE PhD program.
you> What's the exam I need to do in my second year?
chatphd> In your second year, you are required to complete the Research Exam, which is
expected to be done by the beginning of your seventh enrolled quarter. This typically
means that you should finish the exam by the end of the spring quarter of your second
year [1]. The exam is an important opportunity for you to receive feedback from faculty
other than your advisor about your work and communication skills. It includes the oral
presentation of a written document, which can either be a draft research paper or a
"creative survey" prepared by you on a defined research area [2].
[1] ✓ "The department expects students to complete the research exam by the beginning of
your seventh enrolled quarter in the Ph.D. program (not counting summer quarters).
Generally, this means you should finish it by the end of the spring quarter of your
second year."
[2] ✗ "The research exam includes the oral presentation of a written document. The
document can be either 1. A published or nearly publication-ready draft research paper...
or 2. A 'creative survey' prepared by the student for the purpose of the exam."
(not found in handbook)
…but wait. Look more carefully at the flagged quote [2]. If you check the handbook, the content the model is claiming is actually right: the research exam really does involve a written document that can be either a draft paper or a creative survey. What the model did was summarize a longer passage, using ... to elide the parts it considered irrelevant. Our verifier doesn't understand ..., so it (correctly!) reports that the exact string doesn't appear verbatim — but the underlying claim is sound.
So this is a false positive: a real limitation of the simple substring check.
Q: How could we fix the verifier (or the quote format) to accept elided quotes like this, without giving up the guarantee that every claim is sourceable?
Still, the general lesson holds. We've moved from "the model said something, I hope it's right" to "if a quote has a checkmark and what model said agrees with the quote, then I know it's right". This is the recurring theme of the last three units in a new costume:
Real guarantees come from verifying the model's output, not from asking it nicely.
Outline
-
A chatbot that couldn't: why we need grounding[done] -
RAG via semantic search: retrieving the right piece of the source of truth[done] -
Adding symbolic reasoning: when retrieval alone isn't enough
Adding Symbolic Reasoning
The quote-checking strategy from the previous section works for lookup questions — facts that live verbatim somewhere in the source. But the handbook also encodes rules, and many of the questions students actually ask are about whether some specific situation satisfies the rules. These need reasoning, not just retrieval, and LLMs are surprisingly unreliable at it even when the rules are right in front of them.
For example, the handbook spells out the requirements for a valid thesis committee:
- Your advisor(s) as the committee (co-)chair(s).
- Minimum of 4 members with UC San Diego faculty appointments
- At least 1 member must have a primary appointment in a different department than your advisor's primary department
- At least 2 members must be from the student's home department or program
- At least 1 member must be tenured or emeritus
Let's see what happens when a student asks ChatPHD to check a proposed committee against these rules:
chatphd> Hi! I'm ChatPHD. Ask me anything about the UCSD CSE PhD program.
you> I want to have the following people on my committee: Zhijian Liu (my advisor),
assistant professor at HDSI; Lilly Weng, assistant professor at HDSI;
Mikhail Belkin, full professor at HDSI;
Philip Guo, full professor in CogSci.
My advisor agrees and they are all willing to serve.
Just wanted to check if this is allowed by UCSD rules.
chatphd> Your proposed committee composition is compliant with UCSD regulations.
According to the handbook, the committee must include your advisor(s) as the committee
(co-)chair(s), a minimum of 4 members with UC San Diego faculty appointments,
at least 1 member with a primary appointment in a different department than your
advisor's department, and at least 2 members from your home department.
Additionally, at least 1 member must be tenured or emeritus.
Given that you have your advisor, Zhijian Liu, and the other members are from HDSI
and CogSci, your committee meets these requirements.[1]
[1] ✓ "The committee should include: * Your advisor(s) as the committee (co-)chair(s).
* Minimum of 4 members with UC San Diego faculty appointments
* At least 1 member must have a primary appointment in a different department than your advisor's primary department
* At least 2 members must be from the student's home department or program
* At least 1 member must be tenured or emeritus."
This is wrong. The proposed committee has zero members from CSE (the student's home department), so it violates the fourth rule. The model retrieved the right rules — you can see them right there in the supporting quote — but failed to apply them. Reasoning, not retrieval, is what broke down.
This is a known weakness of LLMs: even with the rules sitting in front of them, they are unreliable at conjunctively checking several conditions over a small set of typed records. Counting and set intersection are exactly the kinds of operations that programs are great at and language models are notoriously bad at.
Q: Assuming we really care about the model being able to answer questions about committee composition, what could we do to get more guarantees that the answer is right?
The general fix is to give the model access to symbolic reasoning: a separate, deterministic component that performs the rule-checking, which the model is required to consult instead of doing the check itself.
How to set this up depends on the problem:
- The fancy way: constraint solvers. Encode the rules in a declarative language like Datalog or SMT (e.g. Z3), and have the LLM translate the user's situation into facts for the solver to check. This pays off most when you also want the solver to find an answer ("give me a valid committee from these 20 faculty"), which a check-only Python predicate can't do. There is also a striking result in the SatLM paper: LLMs are dramatically more reliable at translating a constraint-satisfaction problem into a declarative SMT formula than at translating it into imperative Python — let alone at solving it themselves. For serious reasoning workloads, this is the route worth investing in.
- The simple way: a hard-coded Python check. Write the constraints as a Python function with a typed input. The LLM's job shrinks to extracting the facts (the committee members and their properties) from the conversation and calling the function.
For one lecture we'll go with the simple route — our committee rules are just five conjuncts, so a Python predicate is more than enough.
First, we model a committee member as a typed record using Pydantic, so that the LLM has to extract exactly the fields the check needs (and nothing else):
HOME_DEPT = "CSE"
class CommitteeMember(BaseModel):
name: str
is_advisor: bool # True for the student's advisor(s)
has_ucsd_faculty_appointment: bool
primary_department: str | None # e.g. "CSE", "ECE"; None if not UCSD faculty
is_tenured_or_emeritus: bool
Then the check itself is just the five rules, one per constraint:
@function_tool
def check_committee(members: list[CommitteeMember]) -> dict:
"""Check whether a proposed thesis committee satisfies CSE department rules."""
violations: list[str] = []
satisfied: list[str] = []
advisors = [m for m in members if m.is_advisor]
ucsd_faculty = [m for m in members if m.has_ucsd_faculty_appointment]
advisor_depts = {a.primary_department for a in advisors}
# Rule 1: advisor(s) as (co-)chair(s)
if not advisors:
violations.append("Must have at least one advisor as committee (co-)chair.")
else:
satisfied.append(f"Advisor(s) serving as chair: {[a.name for a in advisors]}.")
# Rule 2: >= 4 UCSD faculty members
if len(ucsd_faculty) < 4:
violations.append(f"Must have >= 4 UCSD faculty; found {len(ucsd_faculty)}.")
...
# Rule 4: >= 2 members from the home department
home = [m for m in ucsd_faculty if m.primary_department == HOME_DEPT]
if len(home) < 2:
violations.append(f"Must have >= 2 members from {HOME_DEPT}; found {len(home)}.")
...
return {"ok": not violations, "satisfied": satisfied, "violations": violations}
(Rules 3 and 5 are elided for brevity; see the demo repo for the full version.)
Finally, we extend the system prompt to tell the LLM about the new tool — and crucially, to use the tool rather than reason about committee rules itself:
INSTRUCTIONS = (
...
"When the user asks whether a proposed thesis committee is valid, use the "
"check_committee tool to verify it against the department rules — do not "
"reason about the rules yourself. The student's home department is CSE. "
"Ask the user for any missing facts about each member (UCSD faculty? "
"primary department? tenured/emeritus? advisor?) before calling the tool."
)
To get us closer to an actual guarantee, we can also log the tool calls and their outputs, so that the user can see exactly how the facts were interpreted and what the tool returned.
Let's try it:
chatphd> Hi! I'm ChatPHD. Ask me anything about the UCSD CSE PhD program.
you> I want to have the following people on my committee: Zhijian Liu (my advisor), assistant professor at HDSI; Lilly Weng, assistant professor at HDSI; Mikhail Belkin, full professor at HDSI; Philip Guo, full professor in CogSci. My advisor agrees and they are all willing to serve. Just wanted to check if this is allowed by UCSD rules.
chatphd> [check_committee] members:
- {name: "Zhijian Liu", is_advisor: True, has_ucsd_faculty_appointment: True, primary_department: "HDSI", is_tenured_or_emeritus: False}
- {name: "Lilly Weng", is_advisor: False, has_ucsd_faculty_appointment: True, primary_department: "HDSI", is_tenured_or_emeritus: False}
- {name: "Mikhail Belkin", is_advisor: False, has_ucsd_faculty_appointment: True, primary_department: "HDSI", is_tenured_or_emeritus: True}
- {name: "Philip Guo", is_advisor: False, has_ucsd_faculty_appointment: True, primary_department: "CogSci", is_tenured_or_emeritus: True}
[check_committee -> ok=False]
✓ Advisor(s) serving as (co-)chair: Zhijian Liu.
✓ Has 4 UCSD faculty members (>= 4 required).
✓ Outside-department member(s): Philip Guo.
✓ Tenured/emeritus member(s): Mikhail Belkin, Philip Guo.
✗ Must have at least 2 UCSD faculty members from the home department (CSE); found 0.
chatphd> Your proposed thesis committee does not satisfy the requirements set by UCSD. Specifically, the rules indicate that at least two members must be from the student's home department (CSE), but your committee only includes members from HDSI and CogSci. This results in a violation of the requirement that states: "At least 2 members must be from the student's home department or program" [1].
You do have a valid composition otherwise:
- Your advisor, Zhijian Liu, meets the requirement of being one of the committee chairs.
- You have four UCSD faculty members in total, which meets the requirement for a minimum of four members.
- You also have two members that are tenured or emeritus: Mikhail Belkin and Philip Guo.
[1] ✓ At least 2 members must be from the student's home department or program.
The verdict is now correct, and we can see exactly why: the tool reports which constraints are satisfied and which are violated, and the model is required to relay that back to the user. Notice that the LLM still does real work — it parses "assistant professor at HDSI" out of the user's free-text description into the typed CommitteeMember records, and it phrases the final answer in natural language. But the check is no longer the LLM's job, and that's what gives us the guarantee. The extracted facts are also visible right there in the tool-call log, so a careful user can inspect them and catch the LLM if it misread the question (say, marked someone as the advisor who wasn't, or got a department wrong) — the one part of the pipeline where hallucination is still possible.
We pay for this guarantee in flexibility. The chatbot author has to anticipate which kinds of reasoning are worth supporting and write a tool for each — every new question shape is a new engineering task. For a small set of high-stakes questions, that's a price worth paying; for an open-ended assistant, you'd want a more general mechanism (this is where the constraint-solver route from above starts to earn its keep). As always, there's no free lunch — but there is a dial, and you get to choose where on it your application wants to sit.