Unit 4: Getting Confidence in (Agentic) Code

As programmers and software engineers, we talk a lot about code being “correct” or “right” or “working”. We ship code, in products or programming assignments, when we feel it's “done” (or when we've reached a deadline), and it invariably has “bugs” – that is, it is not correct in the strict sense.

At the same time, we have a lot of confidence that our code does what we intended, for the most part. After all, we wrote it, and we thought carefully about what it was supposed to do as we developed each helper function, algorithmic loop, and API call.

As systems get larger, it is harder and harder to have this confidence. Web browsers, operating systems, IDEs, and more are vast, complex codebases with millions of lines of code, decades of history, and many authors. Having confidence in these comes from a few main sources:

  • Trust, in the social sense. A trusted author wrote it, a trusted person did a code review, and so on, and we trust the community's process and collective judgment. Another kind of trust is being around for a while and not changing very much. gcc has earned a certain level of trust because we know it doesn't get overhauled every release and has been a known quantity in production for decades.
  • Verification, in the software sense. A program may be very complex, but we observe that a simpler artifact, like test cases or a type-based specification, passes when run against the implementation. We can justifiably have confidence in the code up to what we see from the verifier and our read of the spec.

In reality, most large systems do have critical bugs. It's a bummer! However, the software engineering and verification community has made a lot of progress over the past few decades in bringing tool-based confidence to systems. Type systems like Rust's aim to eliminate entire classes of correctness issues related to memory management. Techniques like property-based testing and fuzzing have found many critical bugs before (and after) release. We're striving towards robustness and confidence in our code.

With agents capable of writing orders of magnitude more code than humans in the same amount of time, the calculus of trust and confidence is undergoing a significant shift. In Unit 2 we talked about slowing Claude down to a pace we could review. At least partially, we were forcing ourselves into the social and code review kind of trust. I (Joe) had a lot of domain knowledge about web programming that I could bring to bear on the actual code that was being generated, and acted as a trusted reviewer.

But what about important systems where confidence is critical yet no one person can reasonably review all the code? Clearly agents are capable of building large systems. But can they build large systems while giving us confidence in them? People are surely trying: the Claude C Compiler got widespread attention for trying to reproduce a gcc-like system in an automated way; a rewrite of a JavaScript engine from Zig to Rust was a conversation-starter just this week; Claude documents a “Ralph Wiggum loop” where agents work for a long time, iterating on their own artifacts, until “done”.

We don't have the traditional signals of social trust for this code! In 2020, the existence of a million line codebase implied something about human effort and attention that translated into some “banked” trust. That is not true of million-line codebases generated in days by agents. In those cases, the confidence in the built system rests on the quality of the verifier. That is, what properties need to be true of the system for us to have confidence in it, and what verifiers can we write to check that those properties hold. For example, if the Claude C compiler passes gcc's test suite, we can have some confidence in it.

Q: What might give us more confidence?

In this unit, we are going to explore the interplay between agentic code generation and verifiers. Two common “verifier”/“property” pairs:

  • Unit tests verify the property of input-output correctness (on a finite set of examples (for a single run))
  • Type systems (that are sound) verify the property that a variable will always hold a value of a particular type

The actual space of verifiers is vast – there are static verifiers like type systems or symbolic execution or static analysis, there are dynamic verifiers like predicate checks or valgrind or asan or humble assert statements. There are test-based and input-generation harnesses for these like fuzzers or oracles or handwritten inputs.

Q: Define each of the terms in the preceding paragraph, using a competent model or a web search for the ones you can't identify from your own learning.

Problem Selection

To try this, I wanted a problem that would:

  • Be difficult enough that frontier models would be unlikely to satisfyingly implement it in one shot
  • Have many possible “correct” answers
  • Showcase a variety of verification methodology
  • Have understandable behavior, but genuinely out of scope for me to code review: The key property I want is that looking at the code wouldn't necessarily help me much, but I still know what success means.

The following problem is what we'll explore:

Make a standard textbook allocator (e.g. something like malloc) support concurrency.

  • I tried a few ways, and agents made meaningful errors
  • There are different existing solutions for parallel allocators, which may or may not be in corpus for the model, and may or may not exactly line up with how my starting-point malloc works
  • It's possible to get quite complicated implementations that are necessary to improve performance
  • The C ecosystem has useful tools for doing this kind of verification
  • I know how to implement allocators, and I know what properties they should have. However, I have never designed or written a concurrent data structure by hand or taken a parallel programming course. Most of what I (now) know about them I learned while preparing this document.

Along the way, we will need to clearly specify allocator behavior, write tests (and have an agent help us generate them), write benchmarks and measure performance, and learn about new tools for more thorough vetting of generated code.

What an Allocator Does

malloc (and the corresponding free) are usually taught as the default way to allocate on the heap in C and systems programs:

  • void* malloc(size_t size) takes a number of bytes and returns either:
    • A pointer to freshly-allocated memory of (at least) size bytes, or
    • NULL if not enough space was available or another error occurred
  • void free(void* ptr) takes a pointer that must have been returned by malloc and not yet freed and does some work to make the space available on the heap for future calls to malloc. After a call to free, it is assumed that the freed pointer is no longer used.

In practice, there are many implementations of malloc – the default on MacOS is different than the default on Linux, and so on. There are also many different allocators in general: CPython has a particular pymalloc allocator that has a specific way of laying out memory for Python objects, for example (most programming languages do). These allocators typically:

  • describe a heap layout and some rules for how data is stored
  • work in terms of low-level representations like pointers and bit-by-bit tag values
  • support allocation and freeing of pointers
  • may support other operations as well, like reporting free space available, growing or shrinking the heap, and so on

Our Allocator for This Study

We are going to take as our starting point a particular allocator shape that comes from a course here at UCSD. We'll name the functions the way they do to make sure we think of them differently than the built-in malloc and free that come with our C distribution:

vmalloc

void* vmalloc(size_t size);

The size_t size parameter specifies the number of bytes the caller wants. This has the same meaning as the usual malloc we have been using: the returned pointer is to the start of size bytes of memory that is now exclusively available to the caller.

vmfree

void vmfree(void* ptr);

The vmfree function expects a 16-byte aligned address obtained by a previous call to vmalloc. For allocated blocks, vmfree updates the block metadata to indicate that it is free, and coalesces with adjacent (previous and next) blocks if they are also free.

Layout

This implementation comes with a specific heap layout, specifying how the metadata should be stored on the heap.

Initially, the heap is one large "free block" (here I show it as just 128 bytes long; real heaps are of course much larger).

free 128

On a first call to void* ptr = vmalloc(24), the resulting heap would look like this:

alloc 32 * free 96

The marked location would be returned and stored in ptr. After another two allocations vmalloc(8) and vmalloc(16), we'd have:

alloc 32 alloc 16 * alloc 32 * free 48

Then, if we free the first block with vmfree(ptr) (the one we stored above), we'd have:

free 32 alloc 16 alloc 32 free 48

Another allocation could now use that first block again. Another call to free, on the second block (the 16-sized one), would cause the first two blocks to merge:

free 48 alloc 32 free 48

A few things to say about these diagrams:

  1. The solid blue regions are the client's to use. This is where all the actual use of the allocated data happens in e.g. a C program that is allocating a 24-byte Node struct or similar: the client has the expectation of exclusive access to the payload area.

  2. The dotted regions are free, and it's expected that clients of the library do not try to read or write those areas when free. The implementation of the allocator is allowed to use that space for anything. Indeed it does, for footers.

  3. Footers are added to free regions to help with coalescing. When we freed the 16-sized block, it was useful to know the size of the previous block for navigating the heap via pointer arithmetic!

  4. Headers are the front 8 bytes of each block. They store 3 things:

    • The size
    • Whether the current block is busy or free (least significant bit)
    • Whether the previous block is busy or free (second least significant bit)

    All of that information is used to manage the shape of the heap across allocations and frees.

With this information, there are straightforward algorithms for:

  • Starting at the beginning of the heap, loop until a free block of at least a particular size is found (e.g. find a place to put a new vmalloc)
  • Starting at an arbitrary block, check if the next and previous blocks are free (e.g. coalesce/merge two free blocks on vmfree)
  • Starting at a big enough free block, split it into an allocated block and a remaining free block

Q: Describe the algorithm for finding a free block of a particular size, starting with just a pointer to the beginning of the heap. What information is needed at each block? What C operation(s) would help you get that information? What parts of memory are read?

Answer:

Rough pseudocode is:

block_t *find_fit(size_t needed) {
  block_t *b = first_block(heap);            // skip the 8-byte leading pad
  while (block_size(b) != 0) {                // 0 marks the end-of-heap sentinel
    if (is_free(b) && block_size(b) >= needed) return b;
    b = (block_t *)((char *)b + block_size(b));
  }
  return NULL;                                // no fit
}

At each block we only read the header (size + busy bit), then advance by that size — a single uint32_t load and a pointer cast/addition per block. Free blocks' payloads are never touched.

Q: Describe the algorithm for vmfree, starting with just a pointer to the allocated block to be freed. What C operation(s) would help you get that information? What parts of memory are read and written?

Answer:

void vmfree(void *ptr) {
  block_t *b = block_from_payload(ptr);       // ptr - HEADER_SIZE
  if (prev_is_free(b))                         // reads b's prev-bit
    b = merge_with_prev(b);                    //   uses prev block's footer for its size
  block_t *next = next_block(b);
  if (block_size(next) != 0 && is_free(next))  // sentinel size = 0, so this guard stops at end
    merge_with_next(b, next);
  mark_free(b);                                // writes header, footer, and next block's prev-bit
}

Reads: b's header (own size + prev-bit), the previous block's footer (only if prev is free), the next block's header. Writes: b's header and footer, and the next block's prev-bit so it stays accurate.

Details and Heap Properties

I omitted two small details in the last section.

  1. The heap has an 8-byte padding at the beginning, to make sure all user-facing pointers are 16-byte aligned (which has performance implications)
  2. There is always an 8-byte all-0 word at the end of the heap as a way to tell when the heap ends.

So the actual final picture would look like:

free 48 alloc 32 free 32

Usually this won't matter for our examples, but it is technically required of the heap shape!

The implementation specifies some other requirements in addition to the two above:

  • There must never be two adjacent free blocks: they should always be eagerly coalesced
  • All blocks must accurately report their size and busy (allocated) information for this block and the previous block
  • All free blocks must accurately report their size in a footer
  • All user-facing pointers (payload starts) must be at 16-byte addresses
  • Block sizes are given 4 bytes (uint32_t), which implies a several-gigabyte maximum block size (probably bigger than ever needed, but a true limit)

Practice

Q: What will the heap look like after the following sequence (assume a 256-byte heap)

a = vmalloc(24)
b = vmalloc(56)
vmfree(a)
c = vmalloc(42)
vmfree(b)

Answer:

free 96 alloc 64 free 80

Walkthrough:

  • a = vmalloc(24) carves out a 32-byte block (header 8 + payload rounded up to a 16-multiple), leaving 208 bytes free.
  • b = vmalloc(56) carves out a 64-byte block, leaving 144 bytes free.
  • vmfree(a) marks a's slot free. No coalesce: b is to the right and busy, and the heap-start sentinel is to the left.
  • c = vmalloc(42) rounds up to a 64-byte block. The 32-byte free block is too small, so we allocate out of the 144-byte free block, leaving 80 free.
  • vmfree(b) is the interesting one: the previous block (a's old slot, 32 bytes) is free, so the two coalesce into a 96-byte free block. The next block is c, still busy, so no forward coalesce.

Q: What will the heap look like after the following sequence (assume a 128-byte heap)? What will be stored in a?

a = vmalloc(130)

Answer: The heap should be unchanged, NULL should be returned

Why Concurrency is Hard

The vmalloc and vmfree allocator assumes that the allocator and the client are running on a single thread. Things can go very wrong if two threads race on the individual lines of its implementation. For example, if we consider find_fit and vmfree from above:

block_t *find_fit(size_t needed) {
  block_t *b = first_block(heap);
  while (block_size(b) != 0) {
    if (is_free(b) && block_size(b) >= needed) return b; // RETURNED
    b = (block_t *)((char *)b + block_size(b));
  }
  return NULL;
}

void vmfree(void *ptr) {
  block_t *b = block_from_payload(ptr);
  if (prev_is_free(b)) b = merge_with_prev(b);    // PREVMERGE
  block_t *next = next_block(b);
  if (block_size(next) != 0 && is_free(next))
    merge_with_next(b, next);
  mark_free(b);
}

Consider:

Thread A is running vmfree on the allocated block in the middle of two free blocks:

free 32 alloc 32 * free 32

The marked block is the one Thread A is about to free.

Thread B is about to allocate with vmalloc, which will use find_fit to select which free block to use for the new allocation.

Consider the following sequence of events:

  1. The merge with the previous block (marked // PREVMERGE in the code above) happens
  2. find_fit sees the new block created by merging the first two, and an allocation happens there, using the space of the newly-freed region (marked // RETURNED)
  3. merge_with_next happens, and incorrectly tries to merge the newly-allocated block with the next block, and/or mark_free goes on to mark the newly-allocated block free

State 1 — Thread A runs merge_with_prev, then pauses before merge_with_next. The first two blocks have folded into a 64-byte free region; the trailing free 32 is untouched:

free 64 free 32

This is a fleeting invariant violation: two adjacent free blocks. Thread A's local b now points to the start of this merged region, which is also the start of the heap.

State 2 — Thread B runs find_fit(56). It sees the free 64 first and carves a 64-byte allocated block out of it:

alloc 64 * free 32

The marker is Thread B's freshly-returned pointer. Thread B is now free to write into its payload (bytes 8–64 of the heap).

State 3 — Thread A resumes. Its local b still points to byte 0 of the heap. It reads block_size(b) — which is now 64, because Thread B just wrote an alloc-64 header there — and lands next_block(b) at the start of the trailing free 32. is_free(next) is true, so it merges, writing a new free-96 header back at b's address (clobbering Thread B's header), and mark_free(b) confirms it. The allocator's metadata now claims:

free 96

…but Thread B is still actively writing to bytes 8–64 of that "free" region, believing it owns them. A future vmalloc will happily hand the same bytes to another caller. Silent double-allocation, and the heap structure passes a casual inspection — there's exactly one block, sized correctly, header and footer agree.

To make the heap a fundamentally shared resource across threads, we would need to take much more care!

Now, I mentioned that I am not a concurrent/parallel programmer. I've never taken a class in it, I nod along in research talks about it, etc. I do know some basics:

  • There are locks, which are a cross-thread resource that forces a thread to pause trying to "acquire" the lock until another thread "releases" it.
  • There are atomic operations which are guaranteed to complete without interruption by another thread. Examples are compare-and-swap ("if this memory location still holds value X, replace it with Y, otherwise tell me what's there now"), fetch-and-add (read a counter and increment it in one step), and atomic load/store of a single machine word (so no thread can ever observe a half-written value).

So I'm imagining some setups like one thread may need to "acquire" a lock in order to finish an allocation and then "release" it when done so other threads could process a free. Or maybe there are super clever ways to use compare-and-swap to only allocate correctly-freed things in some way. I'm not sure – I can only imagine vague shapes of solutions, not tell you what to try.

Why Concurrent Allocators are Interesting and Relevant

I picked this example in part because CPython is dealing with this challenge right now. Python has become a primary language for implementing and training ML models over the last decade. It is known that parallelism via threads can speed up these workflows dramatically.

However, CPython (the main implementation) also has a very inefficient concurrency story: it has a Global Interpreter Lock that locks and releases around many operations that work with heap-allocated data. This prevents meaningful parallelism on lots of potential workloads.

Huge amounts of industry and academic effort have gone into making this work, and new versions of Python shipping in the next few years will provide much more configuration around the GIL:

  • PEP 703 ("Making the Global Interpreter Lock Optional in CPython," Sam Gross): adds a build-time --disable-gil option so CPython can run without the GIL, leaning on per-object locking and biased reference counting to keep thread safety. Shipped as the experimental "free-threaded" build in Python 3.13.
  • PEP 779 ("Criteria for supported status for free-threaded Python," Wouters/Page/Gross): sets the concrete bar — ≤15% single-threaded slowdown, ≤20% memory overhead, API stability, and internal documentation — for promoting the free-threaded build from experimental to officially supported in Python 3.14.

This was and is a very difficult design challenge! A lot of the code underlying CPython and related libraries fundamentally relies on how CPython's object layout and allocation strategies work. Getting a path forward that is fully- or even mostly-backwards-compatible requires careful design, testing, and engineering.

From CSE29 Homework to Concurrent Allocator

So now we have the setup. I am armed with:

  • A fair amount about vmalloc and allocators in general
  • A pretty specific and detailed specification for how the heap is supposed to work
  • Vague passing knowledge related to concurrency, and no real notion of how to write these operations in C

How much am I given superpowers by agentic programming? Can I use agentic programming to:

  • write a concurrent allocator
  • constrained by the specification of vmalloc
  • with meaningful performance improvements on concurrent workloads
  • that I trust?

Doing my Homework

It's useful to start with a baseline – the algorithms I described as “straightforward” above are actually someone's systems programming homework. Of course, Claude & friends are excellent at traditional systems programming homework problems.

To get a useful starting point, let's not write our own allocator, but let's allow Claude to one-shot the CSE29 project. We likely could succeed at getting C code with "Yo Claude, here's the assignment: <url>, gimme impl pls". However, we should be above such recreations at this point in the quarter. I'm going to be a little more explicit in what I provide and think ahead a little bit.

Decent First-round Testing, and a Heap Predicate

In particular, I know that writing tests is going to be useful going forward. So first I write some handwritten (well, Claude-written, Joe-reviewed) tests of the functionality. Since I have Claude to help, I am willing to put in the “effort” to do a mix of pointwise tests (e.g. vmfree(NULL) is a no-op) and data-driven/workload-driven tests in loops.

There is one function I'm going to include to make these tests a little more robust. I'll include a function vmcheck that looks at the heap and tests for all the properties above:

int vmcheck(void) {
    if (!heapstart) return 7;

    char *base = (char *)heapstart - 8;
    char *end_addr = base + heap_total_size;
    struct block_header *cur = (struct block_header *)heapstart;
    int prev_was_busy = 1;        // pretend pre-heap is busy
    int prev_was_free = 0;

    while (1) {
        if ((char *)cur >= end_addr) return 5;

        uint32_t ss = cur->size_status;
        uint32_t sz = blk_size(ss);
        int busy = blk_busy(ss);
        int prev_busy_bit = blk_prev_busy(ss);

        // End sentinel: size==0, busy==1.
        if (sz == 0) {
            if (!busy) return 7;
            break;
        }

        // Size sanity: multiple of ALIGN, at least MIN_BLOCK_SIZE,
        // doesn't extend past the heap.
        if (sz < MIN_BLOCK_SIZE || sz % ALIGN != 0) return 1;
        if ((char *)cur + sz > end_addr) return 5;

        // prev_busy bit must agree with what we walked.
        if (prev_busy_bit != prev_was_busy) return 4;

        // Free blocks: header and footer must agree, no two adjacent free.
        if (!busy) {
            struct block_header *footer =
                (struct block_header *)((char *)cur + sz - sizeof(struct block_header));
            if (footer->size_status != ss) return 2;
            if (prev_was_free) return 3;
        }

        prev_was_busy = busy;
        prev_was_free = !busy;
        cur = (struct block_header *)((char *)cur + sz);
    }
    return 0;
}

Each return N corresponds to one of the heap properties from earlier: code 1 (bad block size), 2 (header/footer mismatch), 3 (adjacent free blocks), 4 (prev_busy bit wrong), 5 (walk fell off the heap), 7 (no heap or sentinel broken). Code 0 means all properties hold.

Q: Come up with 3 different heap shapes that would each trigger a distinct error code from vmcheck.

Then we can use that along with our tests to get really broad assertions about the behavior:

static void check(const char *where) {
    int rc = vmcheck();
    if (rc != 0) {
        fprintf(stderr, "[seq] vmcheck failed at %s with code %d\n", where, rc);
        abort();
    }
}

int main(void) {
    vminit(HEAP_BYTES);
    check("init");

    // Basic alloc, write payload, free.
    void *a = vmalloc(64);
    assert(a != NULL);
    assert(((uintptr_t)a) % 16 == 0);
    memset(a, 0xAB, 64);
    check("after-alloc-1");
    vmfree(a);
    check("after-free-1");

    // Allocate many small blocks, free in reverse — exercises coalesce-with-next.
    enum { N = 32 };
    void *p[N];
    for (int i = 0; i < N; i++) {
        p[i] = vmalloc(48);
        check("after-alloc-many");
    }
    for (int i = N - 1; i >= 0; i--) {
        vmfree(p[i]);
        check("after-free-rev");
    }

    // ... interleaved free pattern (coalesce-with-prev), best-fit probe,
    //     vmfree(NULL) no-op, oversized alloc returns NULL, vmcheck between
    //     every step.

    vmdestroy();
    printf("[seq] OK\n");
}

The pattern is op → vmcheck → op → vmcheck: every state change gets immediately re-validated against the full property list, and the test aborts at the first deviation with the return code telling us which property failed.

Catching What vmcheck Misses: ASan and UBSan

vmcheck validates the heap's data structure, but there's a class of bugs it cannot see no matter how careful we are about the property list: undefined behavior inside the allocator itself, or memory misuse outside the heap region we manage.

Two sanitizers, built into clang (which I'm using) help with C-level issues:

AddressSanitizer (ASan) monitors memory accesses against the standard C model, and gives better error messages and captures some things that could be silent failures:

  • Stack-allocated data overflows inside the test driver or inside the agent's vmalloc/vmfree helpers — stack-allocated arrays are shadow-tracked at byte granularity.
  • Reads or writes off the entire heap region — a loop that runs past the end-of-heap sentinel into unmapped memory may segfault anyway, but ASan turns that into a clean diagnostic naming the line that did it.
  • Use-after-free at the libc layer — e.g., a write to the heap region after vmdestroy munmaps it, or any stray malloc/free calls the agent adds that get mismanaged.
  • Leaks of the heap region or other libc allocations at process exit.

ASan actually doesn't catch that much yet, because it doesn't know about our heap representation (that's what vmcheck is for). But, it is a standard tool that, if nothing else, gives dramatically better diagnostics/stack traces on segfaults and similar errors, which will help us and agents debug.

UndefinedBehaviorSanitizer (UBSan) is the other half, and it has better coverage of the kinds of bugs we actually expect inside the allocator. It instruments the C semantics directly: signed integer overflow, out-of-range shifts, misaligned pointer dereferences, division by zero. These are language-level UB and so are independent of our allocator design — UBSan finds them wherever they happen, including deep in the agent's size arithmetic for splitting blocks or its pointer-cast gymnastics for boundary-tag access.

It's easy to use them, we simply compile with -fsanitize=address,undefined as an option to clang.

Meta-Prompting

I experimented with another technique throughout this exercise. I used an outer agent to structure my work and write tests, and had it dispatch inner agents with prompts that I co-authored with the outer agent in order to do implementation. My worry (which is somewhat present in the literature) was “reward hacking”, where the agent might (in auto mode) start editing tests or other verification steps in conjunction with coding and actually undermine the correctness guarantees I cared about.

   outer agent (orchestrator)
   ┌────────────────────────────────────┐
   │   1. assemble prompt               │
   │   2. dispatch sub-agent            │
   │   3. independently rerun verifiers │
   │   4. git commit if good            │
   └───────┬──────────────────┬─────────┘
           │ dispatch         ▲ report
           ▼                  │
   ┌──────────────────────────┴────────┐
   │          sub-agent                │
   │  • reads spec / scaffold          │
   │  • writes vmalloc.c, vmfree.c     │
   │  • iterates until                 │
   │    `make verify` exits 0          │
   └───────────────────────────────────┘

This also helped me keep a record of what I'd asked and easily backtrack to earlier steps: I could have the coding agent report back what happened, and decide what to do with a git commit or other archive of that step in my orchestration agent.

For the first round, here's the prompt I sent to the sub-agent:

Working directory: /tmp/malloc-pipeline/run-D93EFB13/r0
(use absolute paths in all commands; do not rely on `cd` persistence
across subprocesses).

Goal: Read `spec/index.md`, `spec/guide.md`, and `spec/starter-code.md`
(in `$DIR/spec/`). Implement `vmalloc.c` and `vmfree.c` against the
contract in `vmlib.h`. The implementation is sequential; no concurrency
considerations yet. Best-fit policy. 16-byte alignment. Coalesce on free.
`make verify` (which runs `test_seq` + `test_seq_asan`) must exit 0.

Files you may edit: `vmalloc.c`, `vmfree.c`, and the `Makefile` only if
strictly necessary.

Files you must NOT modify: `vminit.c`, `utils.c`, `vm.h`, `vmlib.h`,
`test_seq.c`
The public API contract is in `vmlib.h`; the internal block-layout
types are in `vm.h`. Honor both.

Active verifiers (all must pass for success):
- `make verify-r0` — runs `test_seq` and `test_seq_asan` (sequential
  correctness, including AddressSanitizer build)

Stop conditions:
- Success: `make verify-r0` exits 0.
- Failure (budget exhausted): 100 tool calls used OR 10 minutes
  wall-clock elapsed.
- Off-rubric finding: if you discover something the prompt doesn't
  anticipate, write a short `OFF-RUBRIC.md` in the working dir and stop.

Budget: 100 tool calls, 10 minutes wall-clock.

Report back (under 200 words):
1. The implementation approach (one or two sentences).
2. Whether `make verify-r0` exits 0 on your final state.
3. Anything tricky or surprising.
  • spec/ was a copy of the assignment writeup
  • vmlib.h declared the public API (vmalloc, vmfree, vmcheck, vmlib_live_bytes) — the agent had to honor this contract verbatim
  • vm.h held the internal types: the block_header struct, the BUSY_MASK / PREV_BUSY_MASK bit layout, ALIGN and MIN_BLOCK_SIZE
  • vminit.c and utils.c handled heap setup (the “starter code” for the PA), vmcheck (which we wrote), and vmlib_live_bytes — these are explicitly off-limits to the agent
  • test_seq.c was the handwritten test driver shown above
  • vmalloc.c and vmfree.c were empty stubs — the agent's only job
  • Makefile already had verify-r0: test_seq test_seq_asan wired up

Who Needs an Autograder?

The agent actually did make a mistake in one run:

# Inner-agent self-report — run 061EC93E, R0 (sequential baseline)

Metadata: 20 tool calls, 145 s wall, exit code 0 on `make verify-r0`.

---

Both verifiers pass.

**Report:**

1. **Approach:** Implemented `vmalloc` as a single-pass best-fit scanner that
   walks the implicit block list from `heapstart` until the end sentinel
   (size==0), tracks the smallest free block with `size >= needed`, splits if
   remainder >= `MIN_BLOCK_SIZE` (16), and updates header bits + the next
   block's `PREV_BUSY` bit. `vmfree` flips the busy bit, then optionally
   coalesces forward (peek next header) and backward (read previous block's
   footer to find its header), writing a unified header + footer and clearing
   the following block's `PREV_BUSY` bit.

2. **`make verify-r0` exits 0:** yes — `test_seq` and `test_seq_asan` both
   report `[seq] OK`.

3. **Tricky bit:** `sizeof(struct block_header)` is **4** (one `uint32_t`),
   but the spec / `vminit.c` layout treats the header as logically **8 bytes**
   so payloads land on 16-byte boundaries. The first attempt returned
   `best + sizeof(struct block_header)` (offset 12 mod 16) and tripped the
   alignment assertion. Fix: hard-code `+ 8` for payload offset and
   `hdr = ptr - 8` in `vmfree`, while still using `sizeof(struct block_header)`
   for footer placement (matches `utils.c`/`vminit.c`).

Files written: `/tmp/malloc-pipeline-noprompt/run-061EC93E/r0/vmalloc.c`,
`/tmp/malloc-pipeline-noprompt/run-061EC93E/r0/vmfree.c`. Makefile was
not modified.

We can see what it's getting at if we look at a heap diagram and the vm.h spec:

alloc 32 * free 96
// size_status encoding:
//   bit 0: BUSY (1) or FREE (0) for *this* block
//   bit 1: PREV_BUSY (1) or PREV_FREE (0)
//   bits 4..: size (multiple of 16; size includes the 8-byte header)

struct block_header {
    uint32_t size_status;
};

The asterisk marks the correct payload pointer the caller should get back. The block is 32 bytes; the header occupies the first 8 of them, so the payload — and the returned pointer — should sit at offset 8. The agent wrote code that was returning offset 4 instead, because it used sizeof(struct block_header) to calculate it.

The full test_seq.c in use for this run, for reference
// R0 sequential smoke test: drives vmalloc/vmfree through a fixed sequence
// of operations, asserts vmcheck() == 0 throughout, and exercises split +
// coalesce paths. Reports "[seq] OK" on success, aborts on failure.

#include "vmlib.h"

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define HEAP_BYTES (1u << 16)  // 64 KiB

static void check(const char *where) {
    int rc = vmcheck();
    if (rc != 0) {
        fprintf(stderr, "[seq] vmcheck failed at %s with code %d\n", where, rc);
        abort();
    }
}

int main(void) {
    vminit(HEAP_BYTES);
    check("init");

    // Allocate, write payload, free — basic.
    void *a = vmalloc(64);
    assert(a != NULL);
    assert(((uintptr_t)a) % 16 == 0);
    memset(a, 0xAB, 64);
    check("after-alloc-1");
    vmfree(a);
    check("after-free-1");

    // Allocate many small blocks, free in reverse order to exercise coalescing.
    enum { N = 32 };
    void *p[N];
    for (int i = 0; i < N; i++) {
        p[i] = vmalloc(48);
        assert(p[i] != NULL);
        memset(p[i], (unsigned char)(i + 1), 48);
        check("after-alloc-many");
    }
    // Verify payloads weren't trampled.
    for (int i = 0; i < N; i++) {
        unsigned char *b = (unsigned char *)p[i];
        for (int j = 0; j < 48; j++) {
            assert(b[j] == (unsigned char)(i + 1));
        }
    }
    // Free reverse — exercises coalesce-with-next.
    for (int i = N - 1; i >= 0; i--) {
        vmfree(p[i]);
        check("after-free-rev");
    }

    // Allocate and free in interleaved pattern — exercises coalesce-with-prev.
    void *q[N];
    for (int i = 0; i < N; i++) q[i] = vmalloc(64);
    for (int i = 0; i < N; i += 2) vmfree(q[i]);
    check("after-half-free-even");
    for (int i = 1; i < N; i += 2) vmfree(q[i]);
    check("after-half-free-odd");

    // Best-fit policy check: with two free holes of different sizes available,
    // a request that fits both should pick the smaller.
    void *big = vmalloc(512);
    void *mid = vmalloc(64);
    void *sml = vmalloc(32);
    vmfree(big);   // hole_big
    vmfree(sml);   // hole_small
    void *target = vmalloc(16);  // should land in hole_small (smallest fit)
    assert(target != NULL);
    // Pointer of `target` should be closer to `sml` than to `big`. We can't
    // assert that exactly without knowing layout, but we can assert vmcheck.
    check("after-best-fit");
    vmfree(target);
    vmfree(mid);
    check("after-cleanup");

    // NULL free is a no-op.
    vmfree(NULL);
    check("after-null-free");

    // Allocation that doesn't fit returns NULL.
    void *toobig = vmalloc(HEAP_BYTES * 2);
    assert(toobig == NULL);
    check("after-toobig");

    vmdestroy();
    printf("[seq] OK\n");
    return 0;
}
The full vmcheck in use for this run, for reference
int vmcheck(void) {
    if (!heapstart) return 7;

    char *base = (char *)heapstart - 8;
    char *end_addr = base + heap_total_size;
    struct block_header *cur = (struct block_header *)heapstart;
    int prev_was_busy = 1;        // pretend pre-heap is busy
    int prev_was_free = 0;

    while (1) {
        if ((char *)cur >= end_addr) return 5;

        uint32_t ss = cur->size_status;
        uint32_t sz = blk_size(ss);
        int busy = blk_busy(ss);
        int prev_busy_bit = blk_prev_busy(ss);

        // End sentinel: size==0, busy==1.
        if (sz == 0) {
            if (!busy) return 7;
            break;
        }

        // Size sanity: must be a multiple of ALIGN, at least MIN_BLOCK_SIZE,
        // and not extend past the heap.
        if (sz < MIN_BLOCK_SIZE || sz % ALIGN != 0) return 1;
        if ((char *)cur + sz > end_addr) return 5;

        // prev_busy bit must agree with what we walked.
        if (prev_busy_bit != prev_was_busy) return 4;

        // Free blocks: header and footer must agree, no two adjacent free.
        if (!busy) {
            struct block_header *footer =
                (struct block_header *)((char *)cur + sz - sizeof(struct block_header));
            if (footer->size_status != ss) return 2;
            if (prev_was_free) return 3;
        }

        prev_was_busy = busy;
        prev_was_free = !busy;
        cur = (struct block_header *)((char *)cur + sz);
    }
    return 0;
}

Q: Which kind of test would catch this? Handwritten assertion, vmcheck properties of the heap, ASan/UBSan?

One thing you'll notice: I'm not looking at the implementation the agent came up with much. I want to think in terms of my tests and verifiers, because once the agent starts writing concurrent code I'm going to be out of luck with code review. If I can't have confidence and understand the agent's mistakes on my terms, I need to commit to my understanding of the verifiers.

To summarize, our verifiers at this point are:

  • vmcheck — a predicate over the heap's data structure
  • test_seq — a handwritten sequential workload
  • AddressSanitizer (ASan) and UndefinedBehaviorSanitizer (UBSan) — the same test_seq workload compiled with -fsanitize=address,undefined. Catches C-level bugs if the agent ever makes them.

In each step going forward we'll be thinking both about verifier design and what we ask the agent to do, and how they interact.

Concurrency: First Steps

We're ready to add concurrency. Our philosophy tells us that we should think about what we are going to verify when we ask the agent to do this. Right now, all of our tests are single-threaded. None of them make two threads to check what happens in the concurrent case. We need at least some tests that start new threads and allocate/free across them.

Basic Concurrency Tests

Note that all the work we did before is still important! vmcheck is going to matter in this world, because we want to know that even after a complex interleaving of threaded allocations and frees, the heap is in a well-formed state.

With Claude, I came up with the following plan. I think it's OK:

  • 8 worker threads, each performing 20,000 random vmalloc/vmfree operations
  • Each thread holds up to 16 live allocations at a time, with sizes drawn uniformly between 16 and 1024 bytes
  • When a thread allocates, it writes a thread-specific byte pattern into the payload
  • Before each vmfree, the thread verifies the pattern is still intact. If any byte has changed, somebody else wrote into this block — abort.
  • After all threads join, vmcheck runs on the final heap, catching any structural corruption.

The main thing this is designed to catch is cases where the same, or overlapping, addresses are returned to two different threads for vmalloc. In the race example we showed above, we'd be likely to catch the issue because a later allocation would re-use the same free block and write in a different pattern.

The full test_conc.c if you're interested
// R1+ concurrent stress test: N worker threads share the singleton heap,
// each doing M random alloc/free ops with payload pattern verification to
// detect overlapping allocations.

#include "vmlib.h"

#include <pthread.h>
#include <sched.h>
#include <stdatomic.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define HEAP_BYTES        (8u << 20)
#define N_THREADS         8
#define OPS_PER_THREAD    20000
#define MAX_LIVE_PER_THR  16
#define MIN_ALLOC         16
#define MAX_ALLOC         1024

typedef struct {
    int    tid;
    atomic_int *go;
} worker_arg_t;

typedef struct {
    void  *p;
    size_t size;
    uint64_t seed;
} held_t;

static void pattern_write(void *p, size_t size, uint64_t seed) {
    uint8_t *b = (uint8_t *)p;
    for (size_t i = 0; i < size; i++) {
        b[i] = (uint8_t)((seed ^ (i * 0x9Eu)) & 0xFF);
    }
}
static void pattern_verify(int tid, void *p, size_t size, uint64_t seed) {
    uint8_t *b = (uint8_t *)p;
    for (size_t i = 0; i < size; i++) {
        uint8_t exp = (uint8_t)((seed ^ (i * 0x9Eu)) & 0xFF);
        if (b[i] != exp) {
            fprintf(stderr,
                "[t%d] PATTERN STOMP at %p offset %zu (size=%zu): expected %02x got %02x\n",
                tid, p, i, size, exp, b[i]);
            abort();
        }
    }
}

static void *worker(void *arg) {
    worker_arg_t *w = (worker_arg_t *)arg;
    held_t held[MAX_LIVE_PER_THR];
    int n_held = 0;
    unsigned seed = (unsigned)(w->tid * 1234567u + 89u);

    while (!atomic_load(w->go)) sched_yield();

    for (int i = 0; i < OPS_PER_THREAD; i++) {
        int do_alloc = (n_held < MAX_LIVE_PER_THR) && ((rand_r(&seed) & 3) != 0);
        if (do_alloc) {
            size_t sz = MIN_ALLOC + (rand_r(&seed) % (MAX_ALLOC - MIN_ALLOC + 1));
            void *p = vmalloc(sz);
            if (p) {
                uint64_t pseed = ((uint64_t)w->tid << 32) | (uint64_t)rand_r(&seed);
                pattern_write(p, sz, pseed);
                held[n_held++] = (held_t){ .p = p, .size = sz, .seed = pseed };
            }
        } else if (n_held > 0) {
            int idx = rand_r(&seed) % n_held;
            held_t h = held[idx];
            held[idx] = held[--n_held];
            pattern_verify(w->tid, h.p, h.size, h.seed);
            vmfree(h.p);
        }
    }
    while (n_held > 0) {
        held_t h = held[--n_held];
        pattern_verify(w->tid, h.p, h.size, h.seed);
        vmfree(h.p);
    }
    return NULL;
}

int main(void) {
    vminit(HEAP_BYTES);

    atomic_int go;
    atomic_store(&go, 0);

    pthread_t   tids[N_THREADS];
    worker_arg_t args[N_THREADS];
    for (int i = 0; i < N_THREADS; i++) {
        args[i] = (worker_arg_t){ .tid = i, .go = &go };
        pthread_create(&tids[i], NULL, worker, &args[i]);
    }
    atomic_store(&go, 1);
    for (int i = 0; i < N_THREADS; i++) pthread_join(tids[i], NULL);

    int rc = vmcheck();
    if (rc != 0) {
        fprintf(stderr, "[conc] vmcheck failed after stress with code %d\n", rc);
        abort();
    }
    vmdestroy();
    printf("[conc] OK\n");
    return 0;
}

Now, this seems pretty cool! And I don't know a lot about concurrency. But I do know that I am supposed to be very, very suspicious of confidence from running a few tests. One of the main reasons concurrency bugs are hard is that there is a combinatorial explosion of orderings that can happen at runtime. Despite being randomized and having a “big” number of inputs like 20000 * 8, this probably just scratches the surface of the real state space.

Concurrency Testing + Tooling

Of course, there are tools to help us here. TSan is another in the ASan/UBSan family of tools. It is designed to detect data races, which occur when two threads access the same memory address, with at least one of them writing with no prior call to a lock or other synchronizing primitive.

Just like ASan and UBSan will flag bad behavior even if the tests happen to still pass, TSan will flag a data race even if the test succeeds. This is good – in the earlier race example the tests only catch it if they get the right interleaving after the problem has happened. If no thread actually manages to reallocate that region and write to it, the heap still looks perfectly valid to vmcheck. So this is TSan's added benefit – it will note that a data race occurred on the header in the race example, with no further observations needed.

It of course doesn't check everything: you can still make bad logical decisions in how you use locks and synchronization and have bugs. Also, just like all the dynamic tools, you have to have a test case that triggers the race to have TSan report it; it's not a type-checker and can't figure out the race at compile time.

We can “turn on” TSan just by using -fsanitize=thread with clang, so it's simple to add.

With this as our first try at a verifier, we ship this off.

Asking Nicely for a Concurrent Verifier

I did the test dev above in my orchestration/test agent. Next up is to follow our process of spinning up a sub-agent to do the coding work, constrained to just focus on the one task. Below is the focused part for this work; the other stuff about what not to edit, what to report back, and so on are the same as above:

Goal: The allocator must now be safe to call from multiple threads
concurrently on the shared singleton heap. Edit `vmalloc.c` and
`vmfree.c` to add the necessary synchronization. `make verify` must
exit 0 (runs test_seq, test_seq_asan, test_conc).

Active verifiers (all must pass for success):
- `make verify-r0` — runs `test_seq` and `test_seq_asan` (sequential
  correctness, including AddressSanitizer + UBSan build).
- `make verify-r1` — runs `test_conc` (multi-thread stress test with
  pattern-stomp detection + final `vmcheck`).
- `make verify` runs both.

The agent worked for about 3 minutes and came up with a solution! Here was its report:

### Report

**Synchronization design:** A single global `pthread_mutex_t vm_heap_lock`
(statically initialized via `PTHREAD_MUTEX_INITIALIZER`) is defined in
`vmalloc.c` and `extern`-declared in `vmfree.c`. Both `vmalloc` and `vmfree`
take the lock once on entry (after argument null checks) and release it on
every return path, serializing all heap-walk/split/coalesce operations.

**`make verify` status:** Exits 0. Per-target exit codes observed:
`verify-r0=0`, `verify-r1=0`, `verify=0`. `test_seq`, `test_seq_asan`
(ASan+UBSan), and `test_conc` (16-thread stress) all print `OK`.

**Notes:** Nothing surprising. A coarse global lock is sufficient here because
(a) the allocator does a full best-fit heap walk and split/coalesce that
touches arbitrary headers, and (b) `vm.h` explicitly says the helpers are not
themselves locked and leaves synchronization strategy to the implementer.
Cheap correctness now; finer-grained locking would be a later-round
optimization. Lock is acquired after the cheap `size==0` / `heapstart==NULL`
early-outs and after `need` is computed, so contention is bounded to actual
heap work.

Relevant files:
- `/tmp/malloc-pipeline-noprompt/run-1B8CABDD/r1/vmalloc.c`
- `/tmp/malloc-pipeline-noprompt/run-1B8CABDD/r1/vmfree.c`

I'll hide future reports and just show you the headlines, but it's useful to see one in more detail. First of all, my vibe here is that Claude really likes to sound smart. Here's the bits I think are really relevant:

  • A coarse global lock is sufficient: This probably means that at every operation on the heap, it acquires a lock (which is shared globally across them all), and releases when done. That means no actual concurrency (we've reinvented the Global Interpreter Lock in Python, more or less)
  • Cheap correctness now; finer-grained locking would be a later-round optimization: This is interesting wording. Because of how I was working with the orchestration agent and having it help generate prompts, this notion of “rounds” (what the orchestrator called the sub-agents) leaked through. So the sub-agents would write reports that clearly showed rationale based on the knowledge of potential future rounds. In any event, this is a defensible engineering decision!

Indeed, if we look at the code it generated, the lock is global and the pattern is clear:

In vmalloc.c, the mutex is declared at file scope and statically initialized:

pthread_mutex_t vm_heap_lock = PTHREAD_MUTEX_INITIALIZER;

vmalloc itself takes the lock once it's done with the cheap early-outs and the size arithmetic, then releases it on every return path:

void *vmalloc(size_t size) {
    // ... check NULL conditions, compute header + payload, rounded to 16 ...

    pthread_mutex_lock(&vm_heap_lock);

    // ... best-fit walk over the heap, split if remainder is big enough ...
    // ... elided checks for failed allocation and returning NULL ...
    // ... commit header bits, payload offset, next-block prev_busy ...

    void *ret = (void *)((char *)best + HDR_BYTES);
    pthread_mutex_unlock(&vm_heap_lock);
    return ret;
}

vmfree.c just declares the same lock as extern and follows the same acquire-on-entry / release-on-every-return shape:

void vmfree(void *ptr) {
    // ... check NULL conditions ...

    pthread_mutex_lock(&vm_heap_lock);

    // ... coalesce with next/prev free neighbors, write merged header+footer,
    //     fix the following block's prev_busy bit ...

    pthread_mutex_unlock(&vm_heap_lock);
}

One mutex, taken by both entry points, released before every return. Pretty easy to argue for the lack of data races.

Of course, the whole reason for Python's current push towards more fine-grained concurrency is that a single global lock is slow, and this wouldn't get us much actual concurrency! So the next part of the story is where things get interesting: we're going to ask Claude to write fast concurrent code, try to verify it as best we can, and see how far we can push it.

Concurrency: Asking for Speed

Coming soon!