Wednesday, March 4, 2026

"Let's Work on the Next Task": Claude Code, GitHub, and the Most Diligent Project Manager I've Ever Had

In my previous post, I described how working with AI agents felt like managing an infinitely large, infinitely diligent team. I wrote about pairing Claude with GitHub, giving it context files and task lists, and watching it come back with actual deliverables.

After that post, I got questions from a lot of people asking how to actually set this up. Even from people I assumed were already using this kind of workflow. Turns out it was far less common knowledge than I previously thought. (I guess I am spending too much time reading social media.)

So this post is a step-by-step guide for those who still use AI tools in the "chat" form and want to examine a first setup of "agentic AI". In this case, it is not to get the AI to be a software engineer, but rather get the AI to becoming your project manager and your team of research assistants.

We will set up a GitHub repository, configure Claude Code on the Web, and build a workflow where AI plans (or does) the work and you do the reviewing.

One caveat: while you do not need to know how to code, familiarity with software development practices will help. Not the programming itself, but the process: how developers organize projects, track changes, review each other's work. This post will walk you through those practices.

First, though, let me explain why this setup is so powerful.


The real trick: The repo is the context

Here is the problem with using AI through a regular chat interface. Every time you start a new conversation, you are starting from zero. You paste in your document, re-explain what the project is about, remind the AI where you left off, describe what needs to happen next. It is like hiring a brilliant contractor who gets amnesia every morning.

GitHub solves this. When Claude Code connects to your repository, it does not just see your files. It sees everything: the project structure, the notes about what the project is, the task list, the record of what has already been done, the decisions you have made along the way... All of it, sitting right there in the repo, ready to be read.

This means your prompt for most interactions becomes absurdly simple:

"Let's work on the next most important task."

That is all. Claude reads your CLAUDE.md to understand the project. It reads your TASKS.md to figure out what needs doing. It looks at the existing files to understand the current state. And then it gets to work. No pasting. No re-explaining. No "as I mentioned in our previous conversation..." The repository is the conversation. It is the memory. It is the context.

Read about CLAUDE.md and TASKS.md and you are worried that this is some black magic? Nah, these are just regular text files, written in plain English. We will describe them next.


Wait, what is Claude Code on the Web?

First, some context. Claude Code started as a command-line tool. You would install it on your computer, open a terminal, and type commands. Powerful, but intimidating if you are not a developer.

Then Anthropic launched Claude Code on the Web. Now you can do the same thing directly from your browser. You connect a GitHub repository, give Claude a task, and it clones your repo, writes code (or documents, or reports, or whatever you need), and pushes the changes to a branch. You review the changes, approve them, and merge. All from a web interface. No installation.

Claude Code on the Web operates inside a real computing environment called the "sandbox". It can read your files, create new ones, run scripts, and push changes to GitHub. It tends to write software for performing various tasks, instead of replying in plain text. It does work. Real work. The kind you would normally delegate to a research assistant or a junior colleague.


The 10-minute setup: GitHub + Claude Code

OK, let us build this from scratch. I will assume you have zero GitHub experience.

Step 1: Create a GitHub account and a repository.

Go to github.com and sign up. Then create a new repository: click the green "New" button, give it a name (something like my-research-project or quarterly-report), make sure to set it to Private (not Public, unless you want the whole internet reading your drafts), and check "Add a README file." That last part matters. Write a short description of your project in the README. Even a couple of sentences is fine. This initializes the repo so that Claude Code can actually work with it. (An empty, uninitialized repo will cause problems.)

Step 2: Connect your repo to Claude Code.

Go to claude.ai and open Claude Code (it is in the left sidebar, or you can go directly to claude.ai/code). Start a new session and connect your GitHub repository. You can paste your repo URL directly or use the built-in GitHub integration to browse your repositories. Claude will ask you to authenticate with GitHub the first time (a one-time OAuth flow) and install Claude on the Github repo (that allows Claude to write to the repo). Select the repo you just created.

Now Claude Code can see your files, and more importantly, it can change them.

At this point, you can upload files that you have about the project to the repo, or you can defer that step for later and move on to the next step.

Step 3: Let Claude set up your project.

This is where it gets interesting. CLAUDE.md is a special file that Claude reads at the start of every session. It is the project's "master plan": what the project is about, how it is organized, what conventions to follow. But you do not need to know what it should look like. Just describe your project in plain language:

"This repo contains the data and analysis from our AI-powered oral examination system, which I wrote up as a blog post. I want to turn this into a research paper for submission to Communications of the ACM. The data and some initial analysis scripts are already in the repo. Set up the project structure for a CACM submission and create a CLAUDE.md file."

Claude will read through the existing files, figure out what is there, organize everything into a sensible structure, and create a CLAUDE.md that might look something like this:

# Project: AI-Powered Oral Examinations at Scale

## Overview
Research paper for Communications of the ACM describing our system
for conducting and grading oral examinations using conversational AI
agents and a multi-LLM grading approach.

## Submission Details
- **Journal**: Communications of the ACM
- **Format**: ACM `acmart` document class, `acmsmall` style
- **Page limit**: 12,000 words including references
- **Style**: Author-year citations (natbib)

## Structure
- `/paper/` - LaTeX source files and ACM style files
- `/data/` - Exam transcripts, grading data, survey responses
- `/analysis/` - Python scripts for statistical analysis
- `/figures/` - Generated plots (PDF format, generated from scripts)
- `/blog/` - Original blog post and supporting materials

## Conventions
- All figures must be generated from scripts in `/analysis/`,
  never created manually
- Use BibTeX for references (`references.bib`)
- Data files are never edited directly; all transformations
  happen through scripts in `/analysis/`
- Student data must be anonymized in all outputs

## Current Status
See TASKS.md for the current task list and priorities.

Notice: you did not write any of this. You described your project, and Claude produced the project master plan. You review it, maybe tweak a couple of things. Done.

Step 4: Create your TASKS.md file.

This is your project's to-do list. But unlike a regular to-do list, it serves double duty: it tells Claude what needs to be done and keeps a record of what has been completed. Ask Claude to create it:

"Create a TASKS.md file with the following initial tasks..."

Here is what one might look like:

# Tasks

## In Progress
- [ ] E1. Expand blog analysis into formal experimental evaluation
- [ ] E2. Inter-rater reliability analysis (human vs. LLM council grades)

## To Do
- [ ] E3. Create Figure 1 (grade distribution across grading methods)
- [ ] R1. Write Related Work section (AI in assessment, LLM-as-judge)
- [ ] D2. Analyze anti-cheating detection rates
- [ ] Z3. Check word count against CACM 12,000-word limit

## Done
- [x] Z1. Set up project structure from blog post materials
- [x] D1. Anonymize student data
- [x] I1. Write Introduction draft

Now here is the magic. You can point Claude at a specific task and say: "Work on the next task in TASKS.md." Claude reads the file, picks the next item, does the work, updates the task status, and creates a pull request with its changes. If you are not familiar with pull requests, more in a moment.


Pull requests: Redlined documents for coders (and not only)

Now the part that is unfamiliar to people who are not software engineers. The "pull request".

If you have ever received a redlined document from a lawyer, or reviewed tracked changes in a Word file, you already understand pull requests. The concept is that simple: someone proposes changes, you review them before they get incorporated into the main document.

In GitHub, it works like this:

  1. Claude does its work on a separate branch (a parallel copy of your project).
  2. When it is done, it creates a pull request (PR), which says: "Here are the changes I made. Want to incorporate them?"
  3. You see a clean diff view showing exactly what was added, removed, or modified. Green lines are additions. Red lines are deletions.
  4. You review. You can approve, request modifications, or reject.
  5. If you approve, you click "Merge" and the changes become part of the main project.

This is the standard process used by every software team in the world. And it works for any kind of knowledge work that relies on text. Research papers. Reports. Course materials. Business proposals. Anything that lives in files. Ideally, you want the files to be text files and not binary ones; tex good, PowerPoint files, not so much. In the future we may have better tooling for reviewing changes in Office files or other formats, but for now the process works best for text-based files.

Fair warning: the GitHub interface will look busy the first time you open a pull request. Do not panic. Just look for the "Files changed" tab to see the redlines, and the big green "Merge pull request" button when you are ready to accept.

The critical point: you never edit the files directly. You describe what you want, Claude proposes changes, and you review and approve. You are the manager. Claude is the diligent employee who comes back with deliverables for you to inspect. And the audit trail is far better than "Track Changes" in Word ever was.


A real example: From CSV to submission-ready in two hours

Let me show you how this plays out in practice with a real example from last month.

I was working on a paper that had a case study section (say, Section 8) where we discussed results from a partner's dataset, but we only had the final business conclusions, not a full experimental analysis. The rest of the paper (say, Section 7) had a proper, thorough analysis on a different dataset: figures, tables, bootstrap confidence intervals, the works. By comparison, the case study in Section 8 was the weak sibling, and reviewers have flagged that. We have received a detailed dataset from our partners, but it required work. My TASKS.md had this sitting in it:

## Backlog
- [ ] F5. AML dataset analysis
- [ ] G1. Complete §8 rewrite with AML dataset

I uploaded the CSV to the repo and told Claude:

"Here is the AML dataset. Replicate the analysis from Section 7 but now for Section 8. Use the existing details from Section 8 as the background and framing, conduct the full experimental analysis, and generate a new Section 8."

Claude read Section 7 to understand the methodology. It read the existing Section 8 to understand the framing and context. It wrote Python scripts to process the AML data, generated four figures and three tables with bootstrap confidence intervals, wrote the new section text with all quantities pulled from the analysis scripts, and submitted a pull request with everything.

Less than an hour. I spent another hour reviewing the PR, checking the code, leaving comments ("clarify this axis label," "move this paragraph before the table", "I do not think the conclusions follow from the results"), and merging.

Two hours total. For a PhD student, this would have been a few days of work, easily. And here is the part that matters: every single number in that section was generated through a Python script. Every figure had a script that produced it. Reproducibility was built in from the start, not bolted on after the fact. The pull request showed me exactly what was added: the scripts, the outputs, the LaTeX changes. I could trace every claim back to the code that produced it.

Needless to say, I remain fully accountable for any bugs or errors. At the end of the day, I have reviewed the scripts, the results, and the text. What I can say is that even if there are errors, these are not "hallucinations" where the LLM filled in random numbers or references in the text. The figures are Python-generated from the raw data, the tables and the numbers in the text the same. The errors can come from bugs, or other oversights. But we should stop calling all AI errors "hallucinations". At this point, the errors are not the errors of a "bullshitter in chief" (a title aptly earned by early LLMs); they are the same types of errors that a junior colleague may make when carefully executing a well-defined task: misreading a specification, applying a method slightly outside its intended scope, or missing an edge case that a more seasoned eye would have caught.


Beyond software: Why this works for all knowledge work

I want to be explicit about something: this is not just for code. GitHub repositories can hold any kind of file. Markdown documents, LaTeX papers, CSV data files, images, PDFs. The pull request workflow works for anything.

Writing a consulting report? Put the markdown draft in /report/, the supporting analysis in /data/, the charts in /figures/. Claude generates the analysis, creates the figures, and drafts sections of the report, all as reviewable pull requests.

Same idea for course materials (I use this with my exit tickets workflow), business plans, grant proposals. You define the project structure, you maintain a task list, and you let the agent do the work while you review proposals. Standard software engineering practice, applied to everything.


Leveling up: More files for better project management

Once you get comfortable with CLAUDE.md and TASKS.md, you can add more structure. The files I have found most universally useful are these three:

  • SCHEDULE.md — Deadlines and milestones. "The submission deadline is March 15" becomes a constraint that shapes which tasks get prioritized first.
  • DECISIONS.md — Key choices and their rationale. "We decided to use three LLMs in the grading council instead of five because the marginal improvement was negligible." Prevents you and Claude from relitigating settled questions two weeks later.
  • STYLEGUIDE.md — Your writing preferences. "Never use em-dashes," "Never use fluffy adjectives," "Avoid claims not supported by data or citations." Good trick: give Claude a few pieces of your favorite writing and ask it to generate a style guide that mimics your voice. Then drop it in the repo.

Beyond these, there are files worth adding for specific situations:

  • CHANGELOG.md — Human-readable log of what changed each session. Especially useful when preparing a response to reviewers.
  • BLOCKERS.md — Things waiting on someone external. Makes it easy to send a collaborator a list of "here is what I need from you."
  • FEEDBACK.md — Running log of all feedback received, formal and informal, with status: pending, accepted, or rejected with rationale.
  • SOURCES.md — Annotated bibliography: what each source is useful for, how reliable it is, which sections cite it.
  • GLOSSARY.md — Keeps terminology consistent across a long document. Claude consults it and adds new terms as they come up.
  • DEPENDENCIES.md — Maps how artifacts depend on each other. Lets Claude flag when an upstream change invalidates something downstream.

You do not need all of these on day one. Start with CLAUDE.md and TASKS.md. Add CHANGELOG.md when editing a paper that came back with revisions. Add the rest as your project grows and you find yourself needing them.

To be fair, this is a bit of a hack. We are simulating standard project management tools using plain markdown files. Scanning text files for task lists and decisions is not exactly elegant. And I have serious doubts that this can scale for projects involving hundreds of people. But it works for now, with tools that exist today, for the projects that I am working on.

In the future, agents will have proper interfaces: structured databases, purpose-built PM tools designed for agents to read and write directly, not markdown files they have to parse every session. We are in the duct-tape-and-baling-wire phase. It is fine. The duct tape holds.


The awkward part (and why it is worth it)

If you are not a software engineer, this workflow feels strange at first. You are used to opening a document and typing. Now you are writing instructions, waiting for an AI to propose changes, and clicking "Merge" on a pull request. It is indirect. It feels like you are adding a middleman.

But here is what happens after a week: you realize the middleman can do 80% of the work. And the 20% you are doing (reviewing, giving feedback, making decisions) is the work that you would have done with any apprentice. But you are not fixing typos, you are not formatting tables, you are not wrestling with matplotlib's axis labels. You are reading the output and deciding if it is good and trustworthy enough.


Coming next

This post covered the basics: one repo, one project, Claude Code on the Web doing the work. The whole secret is that now the chatbots can write down what they have done, and look up the notes next time you start working together. And it is ridiculously powerful.

But this is just the beginning.

In upcoming posts, I will describe my "master repo, satellite repos" setup, where I maintain a central task management repository that coordinates work across multiple projects with different collaborators. Think of it as the command center. I will also walk through my MCP (Model Context Protocol) configuration for integrating Gmail and Google Calendar directly into Claude Code, so the agent can check my schedule, draft emails, and coordinate meetings as part of its workflow.

Beyond that: deploying resources on Google Cloud, spinning up virtual machines for heavy computation, and the "council of LLMs" approach where Claude, Gemini, and GPT deliberate together on evaluation tasks (something I have been using for grading oral exams and am now extending to research).

At some point (in the not so distant future, probably by the end of March or so) Claude will be scheduling my meetings, answering my emails, and assigning me tasks from my own task list. I am not entirely sure who is managing whom anymore.

Sunday, February 15, 2026

Listening to My Students at Scale: Exit Tickets, NotebookLM, and the Tightest Feedback Loop I've Ever Built

It started at a teaching workshop, last semester: Craig Kapp and Rob Egan presented a seminar at the NYU Center for Teaching and Learning called "Real-Time Insights: Leveraging AI for Responsive Teaching in Large Classrooms." They (re-)introduced a deceptively simple concept: the exit ticket. The idea is that at the end of every class session, you ask students three quick questions, each with a different shape metaphor:

  • 🔵 Circle: What is still circling in your mind? (What are you confused about?)
  • 🟥 Square: What "squared" with your understanding? (What clicked today?)
  • 🔺 Triangle: What are three key takeaways from today's session?

Then, take these answers, and use LLMs to process them quickly and get feedback before the next session.

Getting structured feedback from students after every single session? Not at the end of the semester when it's too late to change anything, but right now, while you can still do something about it? I immediately wanted to try it.

Below I describe the details of the approach presented by Craig and Rob, and my own adjustments to the recipe. Hope you will find it useful.


The setup: Making it required (and why that matters)

It starts by setting up the exit ticket surveys as auto-graded quizzes on Brightspace (NYU's LMS). The auto-grading part is a nice little trick: one of the questions is simply "Select True in this question to get your points." Students complete the survey, they get their credit. No manual processing of ~50 submissions on my end.

We do tell students upfront: write something substantive. Don't game the system. We reserve the right to deduct points if someone slacks through the exit tickets all semester. And here's the nice irony: since we're already running AI-powered analysis on the responses, identifying freeriders who type "asdf" every week is trivial. The same pipeline that processes the feedback also flags the people not providing any.

The critical design decision: make it part of the grade, not optional. Optional feedback gets ~30% response rates and self-selected complainers. Required feedback gets everyone. And because this is formative feedback (not evaluative), students have every reason to be honest and detailed. They're not rating me. They're telling me what they need.

Compare this to the end-of-semester evaluation. Students fill it out in December, the professor reads it in January (maybe), and any changes happen next year for a completely different group of students. The feedback loop is so long that it barely qualifies as a loop. Exit tickets close that loop within days. Sometimes hours.


From exit ticket to next session: the processing pipeline

So now I have all this feedback. ~50 students, after every session, telling me what confused them, what clicked, and what they're taking away. The question becomes: how do you actually process all of that quickly enough to act on it?

NYU IT built an official path for this, which Rob demonstrated in the seminar. You export the exit ticket responses into the Brightspace Insights Portal (which Rob's team manages) and run AI-powered analysis using a prompt like this:

You are an expert Instructional Designer and Data Scientist assisting
a professor with the course "AI/ML Product Management" at NYU Stern
School of Business (undergraduate).

Your goal is to analyze student feedback survey data to improve course
delivery. The survey questions and student answers are provided below.
Please perform the following two steps:

### Step 1: Thematic Analysis
Analyze the responses to identify key themes. Do not just look for
keywords; look for semantic similarities and underlying sentiment. For
each theme, provide:
1. **Theme Name**: A concise title.
2. **Prevalence**: The approximate number of students who mentioned this.
3. **Explanation**: A brief summary of the sentiment or issue.
4. **Evidence**: A direct, representative quote from the data.

### Step 2: Actionable Pedagogy (Bloom's Taxonomy)
For each theme identified above, propose a short course activity.
* If the theme represents a **knowledge gap/pain point**, propose a
  remedial activity.
* If the theme represents a **strength/interest**, propose an activity
  to deepen understanding.
* **Constraint**: The activity must be supported by Bloom's Taxonomy.
  Explicitly state which level of Bloom's Taxonomy the activity targets
  (e.g., Application, Analysis, Evaluation).

**Format**:
Start the suggestion section for each theme with the label: "PRACTICE IDEA".

I attach the survey data.

It's a well-designed prompt. Thematic coding, prevalence counts, representative quotes, remedial activities aligned with Bloom's Taxonomy. The output is genuinely useful.

But I prefer to do something slightly different. I use the same prompt from the Insights Portal, but I run it inside NotebookLM with just the student feedback as input. For those unfamiliar: NotebookLM is Google's AI-powered research assistant. You upload your own documents, and it generates analysis, summaries, explainer videos, and podcast-style audio overviews grounded entirely in your uploaded sources. NYU provides institutional access through Google Workspace, so the data never trains any AI models, which matters when you're working with student feedback.

Why NotebookLM over the Insights Portal? Because the exit ticket analysis is just the starting point. What I really need is to prepare the follow-up material. Once NotebookLM identifies the themes and suggests activities, I take those suggestions and combine them with my lecture slides, readings, and case studies (which are already loaded in the same notebook). Then I ask it to generate explainers, videos, infographics, and targeted activities that address the confusion, all grounded in my actual course content.

The Insights Portal gives me a diagnosis. NotebookLM gives me the diagnosis and helps me build the treatment.

My workflow after every class:

  1. Students complete the exit ticket on Brightspace (takes them 2-3 minutes)
  2. I export the responses and upload them into a NotebookLM notebook, together with the materials for that session
  3. NotebookLM identifies the themes: what's confusing people, what clicked, what they found most valuable
  4. Based on those themes, I generate explainer materials, short videos, and targeted activities for the next session

(As an example, here is the NotebookLM that we use for the Zillow Offers case, which we use to discuss leading and lagging metrics, model and output monitoring, concept drift, adverse selection and other product-management-related topics. Note: this notebook contains only course materials for preparing the case discussion, not student feedback data.)

One small but annoying wrinkle: NotebookLM's default slide output has that unmistakable "AI-generated" aesthetic. You know the one. (Yes, they are visually gorgeous compared to my own slides, but after a while it starts feeling a bit like slop.) So I started uploading the NYU brand style guide as an additional source in my notebooks, and prompting NotebookLM to follow it when generating visual materials. The results are noticeably closer to proper NYU-branded slides. Not perfect, but much better than the generic AI look. I'm still waiting for NotebookLM to support custom templates or branding natively, but that's a different story.

The per-session overhead is maybe 15-20 minutes.


Why this actually works

The circle/square/triangle structure does something clever: it gives students permission to be confused. "What is still circling in your mind?" is a much less intimidating question than "What don't you understand?" And the three-takeaways question forces them to reflect, even briefly, which helps consolidate their learning.

But the real reason students engage is that they see the results. When I open the next class by saying "Several of you mentioned you were confused about X, so let's spend 15 minutes on this before we move on," students learn that their feedback actually matters. It creates a virtuous cycle: they write thoughtful responses because they know I'll respond, and I can respond because NotebookLM makes processing all the responses feasible. Without the AI assist, no professor has time to synthesize free-text responses from 50 students after every class and create targeted follow-up materials. Definitely not after every single session. The economics just don't work.

With NotebookLM doing the heavy lifting? The economics suddenly work beautifully.

The exit ticket has been around for decades. Craig and Rob simply showed how to supercharge it with AI. The hard part was never getting students to talk. It was finding the time to listen. Once students realize someone is actually listening, they start saying things worth hearing. That's the loop. That's the whole trick.

Wednesday, February 11, 2026

Everybody Is a CEO Now (And What Exactly Am I Doing Here?)

It's hard to pinpoint the exact moment when something fundamentally shifts. There's no day when you wake up and say, "Today, everything is different." It's more like boiling a frog. Except in this case, the frog is me, and the water feels amazing.

Over the last few weeks, a confluence of AI developments crossed an invisible threshold. None of them is dramatic on their own. All of them, together, are profoundly changing how I work, how I teach, and honestly, how I think about what comes next.


Claude stopped being a chatty know-it-all

Let me start with the most concrete thing. Around December, Claude became... different. Not in some flashy, press-release way. It just started being right. Consistently, reliably right. The suggestions were spot on. The reasoning was good. The writing did not feel like fluffy AI slop. The output needed minimal editing.

I know, I know—"AI is getting better" isn't exactly breaking news. People have been saying this for years. But there's a qualitative difference between "impressive compared to what we had before, but I still need to direct and edit this very carefully" and "I now trust this thing with real work." We crossed that line.

Here's the moment it hit me. Yesterday, I had a brainstorming session with a student. We shared documents, exchanged ideas, sketched out some research directions. Normal academic stuff. Afterwards, I dumped my messy meeting notes into Claude and asked it to organize them.

What came back was not just a cleaned-up document with better formatting.

It was a research program.

Legitimate research questions, well thought out, properly scoped, organized into a coherent agenda with clear methodological approaches. I sat there staring at my screen. I did not feel like I was a professor advising a student and making some progress. It felt like we were in reality two grad students who had been goofing around with half-baked ideas, and then our wise, respected senior professor walked into the room, sat down, and said: "OK, here's how research is actually done. Here's how you think about this. Here's how you organize your work."

Not a helpful assistant anymore. Claude was setting the agenda this time around. It was the senior colleague. It was the advisor.


The Agent That Puts PhD Students to Shame

And then there's the agent setup, which is where things get truly surreal.

When you pair Claude with GitHub for memory, an AGENTS.md file for context, and a TODO.md for task tracking, something clicks. The AI labs have been saying for a while that their agents were reaching "PhD student level." I've supervised PhD students for 20 years. I love them. Truly. But let me be blunt: I have never worked with a PhD student this organized and this diligent.

None of them have ever created a table mapping every data-driven claim in the LaTeX code to the specific code and data files that support each claim. None of them has had a full pipeline for the data analysis and the figures in a makefile, ready to repeat everything if necessary. None of them has had a reproducibility package ready before we even sent out the first manuscript.

The only downside? I will not be able to have drinks with this PhD student in the future and feel happy seeing them be so much more successful than I am.

A paper is about to go out. I started writing in earnest on Saturday. It took a total of four days of work to get to a submittable manuscript. The experimental analysis, the writing, the polishing. Four days. This would have taken four weeks minimum with a human collaborator, and that's being generous. And the quality isn't "good enough for a draft." It's "ready for submission with minor tweaks."

I find myself glued to my screen all day. I am not doing busy work. I write down what needs to be done, and this is happening behind the scenes. I am getting back the next iteration in an hour, I look at it, I give feedback, we cross things out from the TODO.md and we move forward. This is real work being done. Not just coding. Paper writing. Report preparation. Coding practices leak into other types of work, and things are moving. My real work is getting done, not just my academic software prototypes.

It's like having an infinite pool of employees, each one eager, competent, and ready to come back with actual deliverables. Not drafts that need to be rewritten. Not outlines that need to be fleshed out. Deliverables.


Teaching as Curation: The NotebookLM Story

Let me tell you about another shift that's been happening in parallel, this one in our classroom.

We teach an AI Product Management course at Stern, and starting in November, something strange happened to how we prepare. We stopped creating content. We started curating it.

Here's our workflow now: After every class session, we collect student feedback. What clicked, what didn't, what questions came up, what topics generated the most energy. We dump all of this (the feedback forms, our own notes, relevant articles, the previous session's materials) into NotebookLM.

And then we ask it to help us design the next session.

NotebookLM digests the student feedback, identifies the gaps, suggests educational activities, and creates new explainer material that directly addresses what students found confusing or wanted to explore further. It connects themes across sessions that we might not have noticed. It proposes case studies that are relevant to the questions students actually asked, not the ones we assumed they'd ask.

The result? The course is absurdly adaptive. Every session builds on what students actually need, not on a syllabus we wrote in August. We're not creating lectures from scratch anymore. We're curating a learning experience, with AI as our editorial partner. The student feedback loop, which used to inform maybe the next semester's version of the course, now informs the next class.

We feel like careful curators, because we're still the ones making the final calls. For now. For how long? No idea. Perhaps in Summer even the curation will be something the AI does better than us.

Education is changing. Bloom's two sigma problem, the finding that one-on-one tutoring outperforms classroom instruction by two standard deviations, is solvable. Now. What is our role? No clue. Perhaps the future of education does not need professors. But the future of education is bright. We will not believe how bad we are. Almost like going from writing with a marker on transparencies to having an interactive demo of the concept. That transition took 30 years. Let's see where we will be in 30 months.


So... Everybody's a CEO Now?

Here's where I start to feel a little dizzy. The marginal cost of competence is hitting zero.

If I can supervise an AI agent the way I'd supervise a research team (giving it direction, reviewing output, iterating on results) and if this scales to writing papers, analyzing data, building prototypes, designing courses... then what am I? I'm a manager. A director. A CEO of a one-person company with an arbitrarily large AI workforce.

But here's the question: What happens when everyone can do this?

When every professor can produce research at 10x the speed. When every consultant can deliver analyses that used to require a team of five. When every entrepreneur can build and ship products without hiring engineers. When every student can produce work indistinguishable from an expert's.

Do we still need employees? Is it even feasible for everyone to operate like a one-person business? And if so, who are the customers? If everyone is a CEO, who is buying?

I don't have answers. The words people have been saying for the last few years, "AI will change everything," "this is the new industrial revolution," "knowledge work will be transformed," those words haven't changed.

But the feeling has.

It used to feel like a prediction. The prediction is here. You will feel it soon, if you have not felt it already. It will be a mix of awe and fear. Impostor syndrome to the fullest. What exactly am I adding here?

I'd love to tell you that the human role is now "taste, judgment, direction-setting" and that AI just handles the execution. That's the comforting version. But I just told you that Claude set the research agenda, not me. So even that may not hold for long.


Bye now

And for now, if you'll excuse me, I need to go review the deliverables my AI team just submitted. Four papers in the queue, a course redesign in progress, and a blog post that, unlike this one, I didn't write myself.

OK fine, I didn't write this one myself either.

(Kidding. Mostly.)

Monday, December 29, 2025

Fighting Fire with Fire: Scalable Personalized Oral Exams with an ElevenLabs Voice AI Agent

It all started with cold calling.

In our new "AI/ML Product Management" class (co-taught with Konstantinos Rizakos), the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good.

And let's be clear: We have zero problems with students using AI for their work. (Banning AI in an AI course? That would be... special.) We actively encourage it. But here's the distinction that matters: using AI to enhance your thinking versus outsourcing your thinking entirely and learning nothing at the end. One of these is education. The other is expensive credential theater.

So we started cold calling students randomly during class.

The result was... illuminating. Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all. This gap was too consistent to blame on nerves or bad luck. If you cannot defend your own work live, then the written artifact is not measuring what you think it is measuring.

Brian Jabarian has been doing interesting work on this problem, having shown that AI is actually better than humans at conducting job interviews. Why? Humans get tired, have biases, and are less consistent at following a script. His results both inspired us and gave us the confidence to try something that would have sounded absurd two years ago: running the final exam with a Voice AI agent.


Why oral exams? And why now?

The core problem is almost embarrassingly simple: students now have immediate access to LLMs that can handle most exam questions we traditionally use for assessment. The old equilibrium—where take-home work could reliably measure understanding—is dead. Gone. Kaput.

OK, so we go pen and paper in the classroom. We did exactly that for the midterm. Problem solved, right?

Well, not quite. We also needed to ensure that students had done deep work on their group projects. In the past, our worry was freeriding: students offloading their work to teammates. But then, in the middle of our class, the AI landscape shifted dramatically. Gemini 3.0 dropped, and NotebookLM started generating flawless presentations. Suddenly, a student could deliver a polished, sophisticated presentation about a project they barely touched.

And we'd have no way to tell.

Oral exams were the natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. No LLM whispering in your ear. No "let me just check something real quick" while ChatGPT generates your answer. Just you, your knowledge, and an evaluator.

The problem? Oral exams are a logistical nightmare.

With 36 students and two instructors, we could maybe manage. But even at that scale, the accommodation requests started piling up immediately. "I have a flight on the 15th." "I have three other finals that day." "I'm traveling for a family event." All legitimate! But multiply that by a factor of ten for a larger class, and you're looking at a month-long hostage situation.

So: oral exams don't scale. Everyone knows this. It's why we abandoned them in the first place.

Unless you cheat.


Enter the Voice Agent

We used ElevenLabs Conversational AI to build the examiner. The platform bundles the messy parts (speech-to-text, text-to-speech, turn-taking, interruption handling, …) into something usable. And here is the thing that surprised me: a basic version for a low-stakes setting (e.g., an assignment) can be up and running in literally minutes. Minutes. Just write a prompt describing what the agent should ask the student, and you are done.

Two features mattered a lot for our setup:

  • Dynamic variables: pass the student's name, project details, and other per-student context into the conversation as parameters, to allow personalized exams
  • Workflows: build a structured flow with sub-agents instead of a single "chatty" agent trying to do everything

What the exam looked like

We ran a two-part oral exam.

Part 1: "Talk me through your project." The agent asks about the student's capstone project: goals, data, modeling choices, evaluation, failure modes. This is where the "LLM did my homework" strategy dies. You can paste an assignment into ChatGPT. It is much harder to improvise consistent answers about specific decisions when someone is drilling into details.

Part 2: "Now do a case." The agent picks one of the cases we discussed in class and asks questions spanning the topics we covered: basically testing whether students absorbed the material or just showed up.

To handle this structure, we split the exam into sub-agents in a workflow:

  1. Authentication agent: Asks for the student's ID and refuses to proceed without a valid one. (In a more productized version, we would integrate with NYU SSO instead of checking against a list.)
  2. Project discussion agent: Gets project context injected via parameters. The prompt includes details of each project so the agent can ask informed questions. The next step is obvious: connect retrieval over the student's submitted slides and reports so the agent can quote and probe precisely.
  3. Case discussion agent: Selects a case and runs structured questioning. Again, RAG would help with richer case details.

This "many small agents" approach is not just aesthetic. It prevents the system from drifting into unbounded conversation, and it makes debugging possible.

If you want to try: Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")


By the Numbers

  • 36 students examined over 9 days
  • 25 minutes average (range: 9–64)
  • 65 messages per conversation on average
  • 0.42 USD per student (15 USD total), but also the $99/month ElevenLabs subscription
  • 89% of LLM grades within 1 point
  • Shortest exam (9 min) → highest score (19/20)

The economics

Let's talk money.

Total cost for 36 students: 15 USD.

That's 8 USD for Claude (the chair and heaviest grader), 2 USD for Gemini, 0.30 USD for OpenAI, and roughly 5 USD for ElevenLabs voice minutes. Forty-two cents per student.

The alternative? 36 students × 25-minute exam × 2 graders = 30 hours of human time. At TA rates (~$25/hour), that's $750. At faculty rates, it's "we don't do oral exams because they don't scale."

For $15, we got: real-time oral examination, a three-model grading council with deliberation, structured feedback with verbatim quotes, a complete audit trail, and—as you'll see—a diagnosis of our own teaching gaps.

The unit economics in terms of cost work. We will see next that the real benefit is in the value that is delivered, not in the 50x cost savings.


What broke (and how we fixed it)

The first version had problems. Here is what we learned.

1) The voice was intimidating

A few students complained that the agent sounded severe. We had cloned Foster Provost's voice because, frankly, his clone was much more accurate than the clones of our own voices. But the students found it... intense. Here is an email from a student:

I had prepared thoroughly and felt confident in my understanding of the material, but the intensity of the interviewer's voice during the exam unexpectedly heightened my anxiety and affected my performance. The experience was more triggering than I anticipated, which made it difficult to fully demonstrate my knowledge. Throughout the course, I have actively participated and engaged with the material, and I had hoped to better demonstrate my knowledge in this interview.

And here is another:

Just got done with my oral exam. [...] I honestly didn't feel comfortable with it at all. The voice you picked was so condescending that it actually dropped my confidence. [...] I don't know why but the agent was shouting at me.

Fix: We are split on that. We love FakeFoster. But next time we will A/B test, and we will try to test other voices. At the end of the day, we want to optimize for comprehension, not charisma. ElevenLabs has guidance on voice and personality tuning: they treat this as a product design problem, and probably a good idea.

2) The agent stacked questions

This was the biggest real issue. The agent would ask something like: "Explain your metric choice, and also tell me what baselines you tried, and why you did not use X, and what you would do next."

That is not one question. That is four questions wearing a trench coat. The cognitive load for an oral exam is already high. Stacking questions makes it brutal.

Fix: Hard rule in the prompt: one question at a time. If you want multi-part probing, chain it across turns. For grading the exam, we included an "interference protocol": students received full credit if they had questions stacked like that and answered only some of them.

3) Clarifications became moving targets

Student: "Can you repeat the question?"
Agent: paraphrases the question in a subtly different way

Now the student is solving a different problem than the one they were asked. Very frustrating.

Fix: Explicit instruction in the prompt: repeat verbatim when asked to repeat. No paraphrasing. Same words.

4) The agent did not let students think

Humans rush to fill silence. Agents do too. Students would pause to think, and the agent would jump in with follow-up probes or worse: interpret the silence as confusion and move on.

Fix: Tell the agent to allow think-time without probing aggressively. It made the exam feel less like an interrogation. We also increased the time-out before the agent asks "Are you there?" from 5 to 10 seconds.

5) Lack of randomization

We asked the agent to "randomly select" a case study. It did not.

From December 12–18, when Zillow was in the case list, the agent picked Zillow 88% of the time. After we removed Zillow from the prompt on December 18, the agent immediately latched onto Predictive Policing—picking it for 16 out of 21 exams on December 19 alone.

LLMs are not random. They have implicit preferences and ordering biases. Asking an LLM to "pick randomly" is like asking a human to "think of a number between 1 and 10"—you're going to get a lot of 7s.

Fix: Pass an explicit random number as a parameter and map it to cases deterministically. Do the randomization in code, not in the prompt.


Grading: the council deliberation actually worked

OK, so here is where things got interesting.

We graded using a "council of LLMs" approach, an idea we borrowed from Andrej Karpathy. Three models (Claude, Gemini, ChatGPT) assessed each transcript independently. Then they saw each other's assessments and revised. Finally, the chair (Claude) synthesized the final grade with evidence.

Round 1 was a mess. When the models graded independently, agreement was poor: 0% of grades matched exactly, and only 23% were within 2 points. The average maximum disagreement was nearly 4 points on a 20-point scale.

And here's the kicker: Gemini was a softie: It averaged 17/20. Claude averaged 13.4/20. That's a 3.6-point gap—the difference between a B+ and a B-.

Meanwhile, Claude and OpenAI were already aligned: 70% of their grades were within 1 point of each other in Round 1.

Model Round 1 Mean Round 2 Mean Change
Claude 13.4/20 13.9/20 +0.5
OpenAI 14.0/20 14.0/20 +0.0
Gemini 17.0/20 15.0/20 -2.0

Then came consultation. After each model saw the others' assessments and evidence, agreement improved dramatically:

Metric Round 1 Round 2 Improvement
Perfect agreement 0% 21% +21 pp
Within 1 point 0% 62% +62 pp
Within 2 points 23% 85% +62 pp
Mean max difference 3.93 pts 1.41 pts -2.52 pts

Gemini lowered its grades by an average of 2 points after seeing Claude's and OpenAI's more rigorous assessments. It couldn't justify giving 17s when Claude was pointing to specific gaps in the experimentation discussion.

Grade convergence chart

But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.

Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.

The grading was stricter than my own default. That's not a bug. Students will be evaluated outside the university, and the world is not known for grade inflation. (Just in case you are wondering, I graded all exams myself and I asked the TA to also grade the exams; we mostly agreed with the LLM grades, and I aligned mostly with the softie Gemini. However, when examining the cases when my grades disagreed with the council, I found that the council was more consistent across students and I often thought that the council graded more strictly but more fairly.)

The feedback was better than any human would produce. The system generated structured "strengths / weaknesses / actions" summaries with verbatim quotes from the transcript. Sample feedback from the highest scorer:

"Your understanding of metric trade-offs and Goodhart's Law risks was exceptional—the hot tub example perfectly illustrated how optimizing for one metric can corrupt another."

Sample from a B- student:

"Practice articulating complete A/B testing designs: state a hypothesis, define randomization unit, specify guardrail metrics, and establish decision criteria for shipping or rolling back."

Specific. Actionable. Tied to evidence. No human grader has the time to generate that for every student.


It diagnosed our teaching gaps

Ha! This one stung.

Topic performance chart

When we analyzed performance by topic, one bar stuck out like a sore thumb: Experimentation. Mean score: 1.94 out of 4. Compare that to Problem Framing at 3.39.

The breakdown was brutal:

  • 3 students (8%) scored 0—couldn't discuss it at all
  • 7 students (19%) scored 1—superficial understanding
  • 15 students (42%) scored 2—basic understanding
  • 0 students scored 4—no one demonstrated mastery

We had rushed through A/B testing methodology in class. The external grader made it impossible to ignore.

The grading output became a mirror reflecting our own weaknesses as instructors. Ooof.

Duration ≠ Quality

One finding I found strangely fascinating: exam duration had zero correlation with score (r = -0.03). The shortest exam—9 minutes—got the highest score (19/20). The longest—64 minutes—scored 12/20.

Taking longer doesn't mean you know more. If anything, it signals struggling to articulate. Confidence is efficient.


Anti-cheating (or: trust but verify)

We asked students to record themselves while taking the exam (webcam + audio). This discourages blatantly outsourcing the conversation, having multiple people in the room, or having an LLM in voice mode whispering answers. It also gives us a backup record in case something goes really badly.

And here is an underrated benefit of this whole setup: the exam is powered by guidelines, not by secret questions. We can publish exactly how the exam works—the structure, the skills being tested, the types of questions. No surprises. The LLM will pick the specific questions live, and the student will have to handle them.

This reduces anxiety and pushes students toward actual preparation instead of guessing what the instructor "wants." And it eliminates the leaked-exam problem entirely. Practice all you want—it will only make you better prepared.


What the students said

We surveyed students before releasing grades to capture their experience. Some of the results:

  • Only 13% preferred the AI oral format. 57% wanted traditional written exams. 83% found it more stressful.
  • But here's the thing: 70% agreed it tested their actual understanding: the highest-rated item. They accepted the assessment but not the delivery.
  • At the same time, they almost universally liked the flexibility of taking the exam at their own place and time. Yes, many of them would have also preferred a take-home exam instead of the oral exam, but this format is dead now.
  • 83% of students found the oral exam framework more stressful than a written exam.
  • The fix is clear: one question at a time, slower pacing, calmer tone. The concept works. The execution needs iteration.
Student survey results

Try it yourself

If you want to experiment with this approach, here are some resources:

  • Prompt for the voice agent
  • Prompt for the grading council
  • Link to try the voice agent (use Konstantinos as the name and kr888 as the net id to authenticate; the project was a "LinkedIn Recruiter, an agent that scans profiles and automatically sends personalized DMs to candidates on behalf of a recruiter. It engages in the first 3 turns of chat to answer basic questions (salary, location) before handing off to a human.")

What I would change next time

  1. Slower pacing and a calmer voice: We love you FakeFoster, but GenZ is not ready for you. Perhaps we will deploy FakePanos next time. Too bad ElevenLabs hasn't perfected thick accents yet to deliver a real Panos experience.
  2. RAG over student artifacts (slides, reports, notebooks). ElevenLabs supports this directly. If the agent can quote the student's own submission, the exam becomes much harder to game and much more diagnostically useful.
  3. Better case randomization with explicit seeding and tracking. Randomness that "feels random" is not enough. Pass explicit parameters.
  4. Audit triggers in grading. If the LLM committee disagrees beyond a threshold, flag for human review. The point of a committee is not to pretend the result is always certain; it is to surface uncertainty.
  5. Accessibility defaults. Offer practice runs, allow extra time, and provide alternatives when voice interaction creates unnecessary barriers.

The bigger point

Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression. In our case, we wanted to check that the students who worked in the team projects actually contributed and understood what they submitted; we would not be able to do that with pen-and-paper exams in the classroom.

We need assessments that evolve towards formats that reward understanding, decision-making, and real-time reasoning. Oral exams used to be standard until they could not scale. Now, AI is making them scalable again.

And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

Fight fire with fire.


Thanks to Brian Jabarian for the inspiration and for giving us confidence that these interviews will work, Foster Provost for lending his voice to create the FakeFoster agent (sorry, students found you intimidating!), and Andrej Karpathy for the council-of-LLMs idea.

Saturday, March 22, 2025

Training LLaMA using LibGen: Hack, a Theft, or Just Fair Use?

Imagine you're building a Large Language Model. You need data—lots of it. If you can find text data of high quality, vetted, truthful, and useful, it would be... great! So, naturally, you head online and find a treasure trove of books neatly indexed, conveniently downloadable, and completely free. The catch? You're looking at LibGen—one of the most infamous pirate libraries on the internet.

This isn't hypothetical. Recently, Meta made headlines for allegedly training their flagship LLM, LLaMA, on content from LibGen. But—can you even do that?

Let's unpack the legal mess behind the scenes, step-by-step.

First: Is Using LibGen Even Legal?

Short answer: Absolutely not. Downloading copyrighted books from LibGen is textbook piracy. Think of it like grabbing a handful of snacks at the supermarket without paying—it's convenient but totally illegal.

Second: Does Training an AI Change the Equation?

Here's where it gets fuzzy. In the U.S., you can claim "fair use"—the idea that some copying is permissible if you're transforming the original work into something new and valuable. (We covered this in an earlier blog post.)

Remember the Google Books case? Google scanned millions of books without permission. Authors sued, but courts sided with Google, citing fair use. The logic was that indexing books for search purposes created something valuable without substituting the original.

Consider another example: the Authors Guild v. HathiTrust case. Libraries scanned books to help visually impaired readers and enable text search. Courts also ruled this fair use, emphasizing the transformative nature and public benefit. However, both these cases involved legally acquired copies—not pirated ones.

So, could Meta's training of LLaMA fall under the same umbrella? Possibly, yes, claiming the same fair use approach. There is a subtle difference: Google used legally accessible copies (from libraries), while Meta reportedly took a different route. Legally speaking, when we talk about copyright and fair use in the US, the source of the copyrighted data does not directly affect the outcome. (Although it can affect the attitude of a jury or a judge if they believe that the defendant acted in bad faith.)

Third: What About the EU?

If you thought U.S. law was tricky, the EU adds another layer of complexity. They don't have a broad "fair use" policy, but they've introduced exceptions specifically for Text and Data Mining (TDM). Good news for researchers and AI developers, right? Except there's a big "BUT": EU law explicitly requires lawful access. Pirate libraries like LibGen don't qualify.

In other words, in Europe, using LibGen isn't just risky—it's explicitly illegal.

Fourth: Is there a Legal Defense for using LibGen?

There is a very reasonable argument that training an AI is transformative—after all, an LLM doesn't copy books; it learns from them. Consider also the LAION case from Germany. LAION, a nonprofit, scraped images from stock photo sites to train AI models. The court allowed it, but crucially because LAION had legitimate access and was a non-commercial entity. The outcome might differ sharply for a commercial giant sourcing pirated content.

There is also the counterargument from authors and publishers that LLMs themselves create (for competitive reasons) a market for licensing content, as the different LLM providers try to get access to exclusive, licensed content as a differentiating factor, in the same way that various streaming companies compete to get exclusive access to films, shows, and TV series. It is a bit of a circular argument (without free training of LLMs, can the LLMs get good enough to create a licensing market?), but we will have to wait for the courts to decide.

Fifth: What's the Risk Here?

For researchers at universities or small startups, casually using LibGen might seem harmless. The risks escalate quickly when you're a global company. Training on "presumed free" copyrighted data differs from "willful infringement"—the legal term for knowingly breaking copyright law.

The fact that LLaMA is Open Source is a significant factor here, as there is less of a profit factor here, but when the trainer is a trillion-dollar company, the courts may behave differently. We will see...

After all, while pirates make great movie characters, they're generally less popular in courtrooms.