From Paper to Practice: How We Parse Handwritten Worksheets with Vision AI
Most math practice happens on paper, invisible to any digital system. Here's how we use vision AI to read handwritten worksheets — mistakes and all — and feed that data into a skill model that actually learns from it.
The Missing Data Problem
Most math practice doesn't happen on a screen. It happens on paper — workbook pages, printed worksheets, flashcard drills at the kitchen table. A kid might do thirty addition problems before dinner and none of it shows up in any digital system.
Our app uses a skill model (conjunctive BKT) to figure out what a student knows and what they should practice next. But a skill model can only work with data it has. If a student nails two-digit addition on paper all week, the system doesn't know — it might keep serving those same problems online, wasting time on skills the student has already mastered.
Parents and teachers need a way to feed offline work into the system. Not as a rough summary ("she did okay on subtraction"), but as structured data: every problem, every answer, every mistake. That's what the worksheet parser does.
How It Works
The pipeline is straightforward: snap a photo, crop it, send it to a vision AI, review the results, and feed the structured data into the skill model.
Each step matters. Cropping removes margins and non-math content so the AI focuses on problems. The AI extracts structured data — operands, operators, the student's answer, bounding box coordinates — not just raw text. Human review catches the inevitable misreads. And the skill model update is the whole point: the system now knows about practice it couldn't see before.
Teaching an AI to Be Honest
Here's a problem we didn't expect: LLMs want to be helpful. If a student writes "42" as their answer, the AI has a tendency to read the problem in a way that makes 42 correct. It will subtly misread a "35" as "30" or interpret a smudged digit in whatever way produces a matching answer.
This is exactly backwards from what we need. The mistakes are the most valuable data in the entire worksheet. A wrong answer tells the skill model precisely which sub-skills need work. If the AI "helpfully" corrects errors, we lose that signal entirely.
The fix is aggressive prompt engineering. We frame the task as pure transcription, not evaluation. The prompt explicitly warns the model that students make mistakes and that those mistakes are expected and valuable:
YOU ARE A TRANSCRIBER, NOT AN EVALUATOR.
Read the PRINTED problem terms FIRST — these are typeset numbers, completely independent of any handwriting. Read the HANDWRITTEN student answer SEPARATELY. NEVER let the student's answer influence how you read the printed terms.
We call this the anti-sycophancy framing. It took several iterations to get right. The key insight is that you have to tell the model why faithful transcription matters — "we NEED this data to help them improve" — not just demand accuracy. Models respond better to understanding the purpose behind the constraint.
The Minus Sign Problem
In a vertical math layout, the difference between addition and subtraction is literally one small dash. The plus sign has a vertical and horizontal stroke; the minus sign is just the horizontal stroke. At the resolution of a phone camera photo, with a child's handwriting nearby, that's a genuinely hard visual discrimination task.
Get it wrong and the consequences cascade: 45 + 17 = 62 is correct, but 45 − 17 = 62 is very wrong. The skill model would record a completely fabricated error in subtraction and a phantom success in addition.
The prompt handles this by instructing the model to look for the operator mark specifically and report its confidence separately. If the model isn't sure whether it's seeing a plus or minus sign, it says so — and the review step catches it.
Not All Readings Are Equal
The AI reports confidence scores for each problem on two dimensions: how confident it is about the printed problem terms, and how confident it is about the student's handwritten answer. Each score is a number from 0 to 1.
Clear, printed problems with neat handwriting: high confidence on both, auto-approved. Smudged digits, messy handwriting, cropped edges, ambiguous operators: low confidence, flagged for human review.
The threshold is 0.7. Below that on either dimension, the problem gets flagged. This is deliberately conservative — we'd rather ask a parent to confirm a reading than silently record wrong data in the skill model.
The model also writes brief notes when confidence is low: "Digit could be 7 or 1", "Answer box appears empty", "Smudge obscuring tens digit." These notes make the review step fast because the reviewer knows exactly where to look and what the ambiguity is.
Watching the AI Think
Parsing doesn't happen in a black box. The system streams results in real-time using Server-Sent Events. As the vision model works through each problem, you see two things updating live: a reasoning stream showing the model's thinking process, and a problem list that populates as problems are detected.
The reasoning stream is particularly useful for debugging. You can watch the model say things like "I see a vertical subtraction problem with a borrowing notation mark" or "This digit is smudged, I think it's a 6 but it could be an 8." When something goes wrong, you can read the reasoning to understand why the model made a particular choice.
Problems also light up on the worksheet image as they're found. We extract completed problem objects from the partial JSON stream as it arrives, so the UI can highlight bounding boxes progressively — you can literally watch the AI work its way down the page.
Finding the Problems on the Page
Every detected problem gets a bounding box: a rectangle in normalized coordinates (0 to 1 across both dimensions of the image) that marks where the problem appears on the worksheet. Each problem actually gets two boxes — one for the entire problem region and a tighter one around just the answer area.
These bounding boxes drive the review UI. Tap a problem in the results list and the corresponding region highlights on the photo. See something wrong? You can adjust the box by dragging. The coordinates are normalized so they work at any display size — the same box data works whether you're reviewing on a phone or a tablet.
Bounding boxes also enable targeted re-parsing. If the model got one problem wrong but the rest are fine, you can re-parse just that problem with an adjusted crop and optionally a hint about the handwriting style. No need to re-process the entire worksheet.
Closing the Loop
Everything up to this point is infrastructure. The actual goal is simple: feed parsed problems into the same skill model that powers online practice.
Before worksheet parsing, the skill model only sees problems done in the app. For a student who does most of their practice on paper, that might be ten problems a week — not enough data to make reliable skill estimates.
After worksheet parsing, the model sees everything: the thirty workbook problems from Monday, the timed drill from Wednesday, the online session on Thursday. More data means more accurate skill estimates, which means the system can target practice where it actually matters instead of guessing.
The parsed problems flow through the same grading pipeline as online problems. Each one updates the student's mastery profile, adjusting estimates for relevant sub-skills (single-digit addition, regrouping, multi-digit subtraction, and so on). The next time the student opens the app, their practice session reflects what they've already demonstrated on paper.
What We Learned
Anti-sycophancy is a real prompt engineering challenge. It's not enough to say "be accurate." You have to explicitly tell the model that mistakes are expected and valuable, explain why, and structure the task as transcription rather than evaluation. Even then, you need confidence scores and human review as a safety net.
Streaming and reasoning make AI feel transparent. When users can watch the model think through each problem in real-time, they trust the results more — and they catch errors faster. The streaming infrastructure is more complex to build than a simple request/response, but it's worth it for the user experience.
Confidence thresholds beat binary pass/fail. Problems aren't either "parsed correctly" or "parsed incorrectly." There's a spectrum, and the system should handle different confidence levels differently. High confidence gets auto-approved. Low confidence gets flagged with a specific note about what's ambiguous. This keeps the review burden low while catching the problems that actually need human eyes.
Bounding boxes make human review fast. Without spatial anchoring, reviewing AI output means cross-referencing a list of parsed problems against the original photo, scanning back and forth to check each one. With bounding boxes, you tap a problem and see it highlighted on the image. What could be a five-minute review task becomes thirty seconds.
The system isn't perfect — handwriting recognition never is — but it doesn't need to be. It needs to be good enough that the correction step is fast and the resulting data is accurate. With confidence routing and spatial review, we're there.