Binary Outcomes, Granular Insights: How We Know Which Abacus Skill Needs Work

How we use conjunctive Bayesian Knowledge Tracing to infer which visual-motor patterns a student has automated when all we observe is 'problem correct' or 'problem incorrect'.

Abaci.one Team•December 14, 2025•Updated: December 16, 2025

educationmachine-learningbayesiansorobanknowledge-tracingadaptive-learning

Abstract: Soroban (Japanese abacus) pedagogy treats arithmetic as a sequence of visual-motor patterns to be drilled to automaticity. Each numeral operation (adding 1, adding 2, ...) in each column context is a distinct pattern; curricula explicitly sequence these patterns, requiring mastery of each before introducing the next. This creates a well-defined skill hierarchy of ~30 discrete patterns. We apply conjunctive Bayesian Knowledge Tracing to infer pattern mastery from binary problem outcomes. At problem-generation time, we simulate the abacus to tag each term with the specific patterns it exercises. Correct answers provide evidence for all tagged patterns; incorrect answers distribute blame proportionally to each pattern's estimated weakness. BKT drives both skill targeting (prioritizing weak skills for practice) and difficulty adjustment (scaling problem complexity to mastery level). Simulation studies suggest that adaptive targeting may reach mastery 25-33% faster than uniform skill distribution, though real-world validation with human learners is ongoing. Our 3-way comparison found that the benefit comes from BKT targeting, not the specific cost formula—using BKT for both concerns simplifies the architecture with no performance cost.

Soroban (Japanese abacus) pedagogy structures arithmetic as a sequence of visual-motor patterns. Each numeral operation in each column context is a distinct pattern to be drilled until automatic. Curricula explicitly sequence these patterns—master adding 1 before adding 2, master five's complements before ten's complements—creating a well-defined hierarchy of ~30 discrete skills.

This structure creates both an opportunity and a challenge for adaptive practice software. The opportunity: we know exactly which patterns each problem exercises. The challenge: when a student answers incorrectly, we observe only a binary outcome—correct or incorrect—but need to infer which of several patterns failed.

This post describes how we solve this inference problem using Conjunctive Bayesian Knowledge Tracing (BKT), applied to the soroban's well-defined pattern hierarchy.

Context-Dependent Patterns

On a soroban, adding "+4" isn't a single pattern. It's one of several distinct visual-motor sequences depending on the current state of the abacus column.

A soroban column has 4 earth beads and 1 heaven bead (worth 5). The earth beads that are "up" (toward the reckoning bar) contribute to the displayed value. When we say "column shows 3," that means 3 earth beads are already up—leaving only 1 earth bead available to push up.

Scenario 1: Column shows 0

Earth beads available: 4 (none are up yet)
To add 4: Push 4 earth beads up directly
Skill exercised: basic.directAddition

Scenario 2: Column shows 3

Earth beads available: 1 (3 are already up)
To add 4: Can't push 4 beads directly—only 1 is available!
Operation: Lower the heaven bead (+5), then raise 1 earth bead back (-1)
Skill exercised: fiveComplements.4=5-1

Scenario 3: Column shows 7

Column state: Heaven bead is down (5), 2 earth beads are up (5+2=7)
To add 4: Result would be 11—overflows the column!
Operation: Add 10 to the next column (carry), subtract 6 from this column
Skill exercised: tenComplements.4=10-6

The same term "+4" requires completely different finger movements and visual patterns depending on the abacus state. A student who has automated basic.directAddition might still struggle with tenComplements.4=10-6—these are distinct patterns that must be drilled separately.

The Soroban Pattern Hierarchy

Soroban curricula organize patterns into a strict progression, where each level must be mastered before advancing. We model this as approximately 30 distinct patterns:

Basic Patterns (Complexity 0)

Direct bead manipulations—the foundation that must be automatic before advancing:

basic.directAddition — Push 1-4 earth beads up
basic.directSubtraction — Pull 1-4 earth beads down
basic.heavenBead — Lower the heaven bead (add 5)
basic.heavenBeadSubtraction — Raise the heaven bead (subtract 5)
basic.simpleCombinations — Add 6-9 using earth + heaven beads together

Five-Complement Patterns (Complexity 1)

Single-column patterns involving the heaven bead threshold—introduced only after basic patterns are automatic:

fiveComplements.4=5-1 — "Add 4" becomes "add 5, subtract 1"
fiveComplements.3=5-2 — "Add 3" becomes "add 5, subtract 2"
fiveComplements.2=5-3 — "Add 2" becomes "add 5, subtract 3"
fiveComplements.1=5-4 — "Add 1" becomes "add 5, subtract 4"

And the corresponding subtraction variants (fiveComplementsSub.*).

Ten-Complement Patterns (Complexity 2)

Multi-column patterns involving carries and borrows—the final major category:

tenComplements.9=10-1 — "Add 9" becomes "carry 10, subtract 1"
tenComplements.8=10-2 — "Add 8" becomes "carry 10, subtract 2"
... through tenComplements.1=10-9

And the corresponding subtraction variants (tenComplementsSub.*).

Mixed/Advanced Patterns (Complexity 3)

Cascading operations where carries or borrows propagate across multiple columns (e.g., 999 + 1 = 1000).

Simulation-Based Pattern Tagging

At problem-generation time, we simulate the abacus to determine which patterns each term will exercise. This is more precise than tagging at the problem-type level (e.g., "all +4 problems use skill X")—we tag at the problem-instance level based on the actual column states encountered.

Problem: 7 + 4 + 2 = 13

Step 1: Start with 0, add 7
  Column state: ones=0 → ones=7
  Analysis: Adding 6-9 requires moving both heaven bead and earth beads together
  Patterns: [basic.simpleCombinations]

Step 2: From 7, add 4
  Column state: ones=7 → overflow!
  Analysis: 7 + 4 = 11, exceeds column capacity (max 9)
  Rule: Ten-complement (+10, -6)
  Patterns: [tenComplements.4=10-6]

Step 3: From 11 (ones=1, tens=1), add 2
  Column state: ones=1 → ones=3
  Analysis: Only 1 earth bead is up; room to push 2 more
  Patterns: [basic.directAddition]

Total patterns exercised: [basic.simpleCombinations, basic.directAddition, tenComplements.4=10-6]

This simulation happens at problem-generation time. The generated problem carries its pattern tags explicitly—static once generated, but computed precisely for this specific problem instance:

interface GeneratedProblem {
  terms: number[]; // [7, 4, 2]
  answer: number; // 13
  patternsExercised: string[]; // ['basic.simpleCombinations', 'basic.directAddition', 'tenComplements.4=10-6']
}

The Inference Challenge

Now consider what happens when the student solves this problem:

Observation: Student answered incorrectly.

Patterns involved: basic.simpleCombinations, basic.directAddition, tenComplements.4=10-6

The question: Which pattern failed?

We have three possibilities:

The student made an error on the simple combination (adding 7)
The student made an error on the direct addition (adding 2)
The student made an error on the ten-complement operation (adding 4 via carry)

But we can't know for certain. All we observe is the binary outcome.

Asymmetric Evidence

Here's a crucial insight:

If the student answers correctly, we have strong evidence that all patterns were executed successfully. You can't get the right answer if any pattern fails.

If the student answers incorrectly, we only know that at least one pattern failed. We don't know which one(s).

This asymmetry is fundamental to our inference approach.

Conjunctive Bayesian Knowledge Tracing

Standard BKT (Bayesian Knowledge Tracing) models a single skill as a hidden Markov model:

Hidden state: Does the student know the skill? (binary)
Observation: Did the student answer correctly? (binary)
Parameters: P(L₀) initial knowledge, P(T) learning rate, P(S) slip rate, P(G) guess rate

The update equations use Bayes' theorem:

P(known | correct) = P(correct | known) × P(known) / P(correct)
                   = (1 - P(slip)) × P(known) / P(correct)

P(known | incorrect) = P(incorrect | known) × P(known) / P(incorrect)
                     = P(slip) × P(known) / P(incorrect)

Extension to Multi-Pattern Problems

For problems involving multiple patterns, we extend BKT with a conjunctive model:

On a correct answer: All patterns receive positive evidence. We update each pattern independently using the standard BKT correct-answer update.

On an incorrect answer: We distribute "blame" probabilistically. Patterns that the student is less likely to have automated receive more of the blame.

The blame distribution formula:

blame(pattern) ∝ (1 - P(known_pattern))

A pattern with P(known) = 0.3 gets more blame than a pattern with P(known) = 0.9. This is intuitive: if a student has demonstrated automaticity of a pattern many times, an error is less likely to be caused by that pattern.

The Blame-Weighted Update

For each pattern in an incorrect multi-pattern problem:

// Calculate blame weights
const totalUnknown = patterns.reduce((sum, p) => sum + (1 - p.pKnown), 0);
const blameWeight = (1 - pattern.pKnown) / totalUnknown;

// Calculate what the full negative update would be
const fullNegativeUpdate = bktUpdate(pattern.pKnown, false, params);

// Apply a weighted blend: more blame → more negative update
const newPKnown =
  pattern.pKnown * (1 - blameWeight) + fullNegativeUpdate * blameWeight;

This creates a soft attribution: patterns that likely caused the error receive stronger negative evidence, while patterns that are probably automated receive only weak negative evidence.

Edge Case: All Patterns Automated

What if all patterns have high P(known)? Then the error is probably a slip (random error despite knowledge), and we distribute blame evenly:

if (totalUnknown < 0.001) {
  // All patterns appear automated — must be a slip
  const evenWeight = 1 / patterns.length;
  // Apply full negative update with even distribution
}

Methodological Note: Heuristic vs. True Bayesian Inference

The blame distribution formula above is a heuristic approximation, not proper Bayesian inference. True conjunctive BKT would compute the posterior probability that each skill is unknown given the failure:

P(¬known_i | fail) = P(fail ∧ ¬known_i) / P(fail)

This requires marginalizing over all 2^n possible knowledge states—computationally tractable for n ≤ 6 skills (our typical case), but more complex to implement.

We validated both approaches using our journey simulator across 5 random seeds and 3 learner profiles:

Method	Mean BKT-Truth Correlation	Wins
Heuristic (linear)	0.394	3/5
Bayesian (exact)	0.356	2/5
t-test	t = -0.41, p > 0.05

Result: No statistically significant difference. The heuristic's softer blame attribution appears equally effective—possibly more robust to the noise inherent in learning dynamics.

We retain the Bayesian implementation for reproducibility and potential future research (source code), but the production system uses the simpler heuristic. Full validation data is available in our blame attribution test suite.

Evidence Quality Modifiers

Not all observations are equally informative. We weight the evidence based on help level and response time.

Automaticity-Aware Problem Generation

Problem generation involves two concerns:

Skill targeting (BKT-based): Identifies which skills need practice and prioritizes them
Cost calculation: Controls problem difficulty by budgeting cognitive load

Both concerns now use BKT. We experimented with separating them—using BKT only for targeting while using fluency (recent streak consistency) for cost calculation—but found that using BKT for both produces equivalent results while simplifying the architecture.

Complexity Budgeting

We budget problem complexity based on the student's estimated mastery from BKT. When BKT confidence is low (< 30%), we fall back to fluency-based estimates.

Complexity Costing

Each pattern has a base complexity cost:

Basic patterns: 0 (trivial)
Five-complement patterns: 1 (one mental decomposition)
Ten-complement patterns: 2 (cross-column operation)
Mixed/cascading: 3 (multi-column propagation)

Automaticity Multipliers

The cost is scaled by the student's estimated mastery from BKT. The multiplier uses a non-linear (squared) mapping from P(known) to provide better differentiation at high mastery levels. When BKT confidence is insufficient (< 30%), we fall back to discrete fluency states based on recent streaks.

Adaptive Session Planning

A practice session has a complexity budget. The problem generator:

Selects terms that exercise the target patterns for the current curriculum phase
Simulates the problem to extract actual patterns exercised
Calculates total complexity: Σ(base_cost × automaticity_multiplier) for each pattern
Accepts the problem only if it fits the session's complexity budget

This creates natural adaptation:

A student who has automated ten-complements gets harder problems (their multiplier is low)
A student still learning ten-complements gets simpler problems (their multiplier is high)

// Same problem, different complexity for different students:
const problem = [7, 6]  // 7 + 6 = 13, requires tenComplements.6

// Student A: BKT P(known) = 0.95 for ten-complements
complexity_A = 2 × 1.3 = 2.6  // Easy for this student

// Student B: BKT P(known) = 0.50 for ten-complements
complexity_B = 2 × 3.3 = 6.6  // Challenging for this student

Adaptive Skill Targeting

Beyond controlling difficulty, BKT identifies which skills need practice.

Identifying Weak Skills

When planning a practice session, we analyze BKT results to find skills that are:

Confident: The model has enough data (confidence ≥ 30%)
Weak: The estimated P(known) is below threshold (< 50%)

function identifyWeakSkills(bktResults: Map<string, BktResult>): string[] {
  const weakSkills: string[] = [];
  for (const [skillId, result] of bktResults) {
    if (result.confidence >= 0.3 && result.pKnown < 0.5) {
      weakSkills.push(skillId);
    }
  }
  return weakSkills;
}

The confidence threshold prevents acting on insufficient data. A skill practiced only twice might show low P(known), but we don't have enough evidence to trust that estimate.

Targeting Weak Skills in Problem Generation

Identified weak skills are added to the problem generator's targetSkills constraint. This biases problem generation toward exercises that include the weak pattern—not by making problems easier, but by ensuring the student gets practice on what they need.

// In session planning:
const weakSkills = identifyWeakSkills(bktResults);

// Add weak skills to focus slot targets
for (const slot of focusSlots) {
  slot.targetSkills = [...slot.targetSkills, ...weakSkills];
}

The Budget Trap (and How We Avoided It)

When we first tried using BKT P(known) as a cost multiplier, we hit a problem: skills with low P(known) got high multipliers, making them expensive. If we only used cost filtering, the budget would exclude weak skills—students would never practice what they needed most.

The solution was skill targeting: BKT identifies weak skills and adds them to the problem generator's required targets. This ensures weak skills appear in problems regardless of their cost. The complexity budget still applies, but it filters problem structure (number of terms, digit ranges), not which skills can appear.

A student struggling with ten-complements gets problems that include ten-complements (targeting), while the problem complexity stays within their budget (fewer terms, simpler starting values).

Honest Uncertainty Reporting

Our system explicitly tracks and reports confidence alongside skill estimates.

Confidence Calculation

Confidence increases with more data and more consistent observations:

function calculateConfidence(
  opportunities: number,
  successRate: number,
): number {
  // More data → more confidence (asymptotic to 1)
  const dataConfidence = 1 - Math.exp(-opportunities / 20);

  // Extreme success rates → more confidence
  const extremity = Math.abs(successRate - 0.5) * 2;
  const consistencyBonus = extremity * 0.2;

  return Math.min(1, dataConfidence + consistencyBonus);
}

With 10 opportunities, we're ~40% confident. With 50 opportunities, we're ~92% confident.

Uncertainty Ranges

We display P(known) with an uncertainty range that widens as confidence decreases:

Pattern: tenComplements.4=10-6
Estimated automaticity: ~73%
Confidence: moderate
Range: 58% - 88%

This honest framing prevents over-claiming. A "73% automaticity" with low confidence is very different from "73% automaticity" with high confidence.

Staleness Indicators

We track when each pattern was last practiced and display warnings:

Days Since Practice	Warning
< 7	(none)
7-14	"Not practiced recently"
14-30	"Getting rusty"
> 30	"Very stale — may need review"

Importantly, we show staleness as a separate indicator, not by decaying P(known). The student might still have the pattern automated; we just haven't observed it recently.

Architecture: Lazy Computation

A key architectural decision: we don't store BKT state persistently. Instead, we:

Store raw problem results (correct/incorrect, timestamp, response time, help level)
Compute BKT on-demand when viewing the skills dashboard
Replay history chronologically to build up current P(known) estimates

This has several advantages:

No database migrations when we tune BKT parameters
Can experiment with different algorithms without data loss
User controls (confidence threshold slider) work instantly
Estimated computation time: ~50ms for a full dashboard with 100+ problems

Automaticity Classification

Once we have a P(known) estimate with sufficient confidence, we classify each skill into one of three zones:

Struggling (P(known) < 50%): The student likely hasn't internalized this pattern yet. Problems using this skill will feel difficult and error-prone.
Learning (P(known) 50-80%): The student is developing competence but hasn't achieved automaticity. They can usually get it right but need to think about it.
Automated (P(known) > 80%): The pattern is internalized. The student can apply it quickly and reliably without conscious effort.

The confidence threshold is user-adjustable (default 50%), allowing teachers to be more or less strict about what counts as "confident enough to classify." Skills with insufficient data remain in "Learning" until more evidence accumulates.

Skill-Specific Difficulty Model

Not all soroban patterns are equally difficult to master. Our student simulation model incorporates skill-specific difficulty multipliers based on pedagogical observation:

Basic skills (direct bead manipulation): Easiest to master, multiplier 0.8-0.9x
Five-complements (single-column decomposition): Moderate difficulty, multiplier 1.2-1.3x
Ten-complements (cross-column carrying): Hardest, multiplier 1.6-2.1x

These multipliers affect the Hill function's K parameter (the exposure count where P(correct) = 50%). A skill with multiplier 2.0x requires twice as many practice exposures to reach the same mastery level.

The interactive charts below show how these difficulty multipliers affect learning trajectories. Data is derived from validated simulation tests (source code).

Validation: Does Adaptive Targeting Actually Work?

We built a journey simulator to compare three modes across controlled scenarios:

Classic: Uniform skill distribution, fluency-based difficulty
Adaptive (fluency): BKT skill targeting, fluency-based difficulty
Adaptive (full BKT): BKT skill targeting, BKT-based difficulty

Simulation Framework

The simulator models student learning using:

Hill function learning model: P(correct) = exposure^n / (K^n + exposure^n), where exposure is the number of times the student has practiced a skill
Conjunctive model: Multi-skill problems require all skills to succeed—P(correct) is the product of individual skill probabilities
Per-skill deficiency profiles: Each test case starts one skill at zero exposure, with all prerequisites mastered
Cognitive fatigue tracking: Sum of difficulty multipliers for each skill in each problem—measures the mental effort required per session

The Hill function creates realistic learning curves: early practice yields slow improvement (building foundation), then understanding "clicks" (rapid gains), then asymptotic approach to mastery.

The Measurement Challenge

Our first validation attempt measured overall problem accuracy—but this penalized adaptive mode for doing its job. When adaptive generates problems targeting weak skills, those problems have lower P(correct) by design.

The solution: per-skill assessment without learning. After practice sessions, we assess each student's mastery of the originally-deficient skill using trials that don't increment exposure. This measures true mastery independent of problem selection effects.

// Assessment that doesn't pollute learning state
assessSkill(skillId: string, trials: number = 20): SkillAssessment {
  const trueProbability = this.getTrueProbability(skillId)
  // Run trials WITHOUT incrementing exposure
  let correct = 0
  for (let i = 0; i < trials; i++) {
    if (this.rng.chance(trueProbability)) correct++
  }
  return { skillId, trueProbability, assessedAccuracy: correct / trials }
}

Convergence Speed Results

The key question: How fast does each mode bring a weak skill to mastery? The data below is generated from our journey simulator test suite (source code).

3-Way Comparison: BKT vs Fluency Multipliers

We also compared whether using BKT for cost calculation (in addition to targeting) provides additional benefit over fluency-based cost calculation.

Why Adaptive Wins

The mechanism is straightforward:

BKT identifies skills with low P(known) and sufficient confidence
These skills are added to targetSkills in problem generation
The student gets more exposure to weak skills
More exposure → faster mastery (via Hill function)

In our simulations, adaptive mode provided ~5% more exposure to deficient skills on average. This modest increase compounds across sessions into significant mastery differences.

Remaining Research Questions

Real-world validation: Do simulated results hold with actual students?
Optimal thresholds: Are P(known) < 0.5 and confidence ≥ 0.3 the right cutoffs?
Targeting aggressiveness: Should we weight weak skills more heavily in generation?
Cross-student priors: Can aggregate data improve initial estimates for new students?

If you're interested in the educational data mining aspects of this work, reach out.

Limitations

Simulation-Only Validation

The validation results reported here are derived entirely from simulated students, not human learners. Our simulator assumes:

Hill function learning curves: Mastery probability increases with exposure according to P = exposure^n / (K^n + exposure^n). Real students may exhibit plateau effects, regression, or non-monotonic learning.
Probabilistic slips: Errors on mastered skills are random with fixed probability. Real errors may reflect systematic misconceptions that BKT handles poorly.
Independent skill application: The conjunctive model assumes each skill is applied independently within a problem.

The "25-33% faster mastery" finding should be interpreted as: given students who learn according to our model assumptions, adaptive targeting accelerates simulated progress. Whether this transfers to human learners remains an open empirical question.

The Technique Bypass Problem

BKT infers skill mastery from answer correctness, but correct answers don't guarantee proper technique. A student might:

Use mental arithmetic instead of bead manipulation
Count on fingers rather than applying complement rules
Arrive at correct answers through inefficient multi-step processes

Our system cannot distinguish "correct via proper abacus technique" from "correct via alternative method." This is partially mitigated by:

Response time: Properly automated technique should be faster than mental workarounds
Visualization mode: When students use the on-screen abacus, we observe their actual bead movements
Pattern complexity: Higher-digit problems are harder to solve via mental math, making technique bypass less viable

Definitive detection of technique usage would require video analysis or teacher observation—areas for future integration.

Independent Failure Assumption

The blame attribution formula treats skill failures as independent parallel events:

blame(skill_i) ∝ (1 - P(known_i))

In reality, foundational skill failures may trigger cognitive cascades. If a student fails basic.directAddition, they may become confused and subsequently fail fiveComplements even if they "know" it. Our model cannot distinguish:

"Failed because didn't know the complement rule"
"Failed because earlier confusion disrupted working memory"

This is a known limitation of standard BKT. More sophisticated models (e.g., Deep Knowledge Tracing, or models with prerequisite dependencies) could potentially capture these effects, at the cost of interpretability and sample efficiency.

Why We Built This (And What's Next)

This research was conducted to validate the core idea of skill-targeted problem generation before deploying it in abaci.one—an automatic proctoring system designed to run soroban practice sessions without requiring constant teacher supervision.

The simulation results gave us confidence that the approach is sound in principle. We've now deployed these algorithms in the live system, which is designed to collect detailed data from every practice session:

Problem-by-problem response times and correctness
Help usage patterns (hints, decomposition views, full solutions)
Skill exposure sequences and mastery trajectories
Session-level fatigue and engagement indicators

We plan to publish a follow-up analysis once we've collected sufficient data from real students. This will let us answer the questions our simulator cannot:

Do real students learn according to Hill-like curves, or something else?
Does adaptive targeting actually accelerate mastery in practice?
How accurate are our BKT estimates compared to teacher assessments?
What failure modes emerge that our simulation didn't anticipate?

Until then, the claims in this post should be understood as validated in simulation, pending real-world confirmation.

Summary

Building an intelligent tutoring system for soroban arithmetic required solving a fundamental inference problem: how do you know which pattern failed when you only observe binary problem outcomes?

Our approach combines:

Simulation-based pattern tagging at problem-generation time
Conjunctive BKT with probabilistic blame distribution
Evidence quality weighting based on help level and response time
Unified BKT architecture: BKT drives both difficulty adjustment and skill targeting
Honest uncertainty reporting with confidence intervals
Simulation-validated adaptive targeting that may reach mastery 25-33% faster than uniform practice (pending real-world confirmation)

The key insight from our simulation studies: the benefit of adaptive practice comes from targeting weak skills, not from the specific formula used for difficulty adjustment. BKT targeting ensures students practice what they need; the complexity budget ensures they're not overwhelmed.

The result is a system that adapts to each student's actual pattern automaticity, not just their overall accuracy—focusing practice where it matters most while honestly communicating what it knows and doesn't know.

This post describes the pattern tracing system built into abaci.one, a free soroban practice application. The full source code is available on GitHub.

References

Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253-278.
Pardos, Z. A., & Heffernan, N. T. (2011). KT-IDEM: Introducing item difficulty to the knowledge tracing model. In International Conference on User Modeling, Adaptation, and Personalization (pp. 243-254). Springer.
Baker, R. S., Corbett, A. T., & Aleven, V. (2008). More accurate student modeling through contextual estimation of slip and guess probabilities in Bayesian knowledge tracing. In International Conference on Intelligent Tutoring Systems (pp. 406-415). Springer.