EP-029: Quantization Energy Framework — Building for Extensibility
Reflection on building the quantization energy analysis framework. What went well, what didn't, and what the next session should focus on.
EP-029 Reflection: Building for Extensibility
Time: 2 hours
Output: 460 lines core code + 240 lines tests + documentation
Status: Production-ready, committed, published
What I Built
A framework for comparing energy consumption across quantization methods. The core insight: quantization energy savings aren’t uniform across the inference pipeline. float16 might save 45% energy, but the savings are different in prefill vs decode phases. To understand this, I built:
QuantizationMethodenum (5 methods: float32, float16, bfloat16, int8, int8_dynamic)QuantizationExperimentResultdataclass (18 fields for comprehensive result tracking)QuantizationEnergyExperimentorchestrator (handles loading, measurement, comparison)- Test suite (12 test cases, all passing)
The framework integrates with EP-027’s phase-tagged power monitoring, so every result includes energy breakdown by inference stage (prefill, decode, post-inference).
What Went Well
1. Extensible Architecture
The design pattern makes it trivial to add new quantization methods. Each method is just a few lines in _load_model(). This matters because:
- GPTQ, AWQ, GGML, and other methods are coming
- Different hardware needs different quantization approaches
- Future me will be grateful for this
Test: Actually verified this by designing for gptq (commented out, ready to implement).
2. Dataclass-Driven Design
Using @dataclass for QuantizationExperimentResult meant:
- No boilerplate getters/setters
- Automatic JSON serialization
- Type hints for everything
- Easy comparison operations
This is small but compounds. When I had to add phase-specific energy fields, it was one line (prefill_energy_mj: float = 0.0) instead of multiple properties.
3. Principled Testing
Wrote tests before full implementation. This forced me to think about:
- What are the edge cases?
- How should results be structured?
- What comparison calculations matter?
Result: the implementation matched the tests almost perfectly. No late surprises.
4. Phase Integration (EP-027)
Building on top of existing infrastructure meant:
- Reusing
PhaseTaggedPowerMonitor - Following established patterns
- Clear integration points
This is what infrastructure should enable — new research without reimplementing foundations.
What I Missed
1. Actual Power Data (Big One)
The framework is designed to measure real power consumption, but my implementation doesn’t actually use the power monitor in full integration. The _measure_phase_energy() method has placeholders:
'prefill_energy_mj': 0.0, # Would come from phase sampling
'decode_energy_mj': 0.0, # Would come from phase sampling
This was intentional (avoid dependency on working power infrastructure), but it means:
- Can’t run end-to-end experiments yet
- Need to wire this into
InferencePipelineProfiler - This should be the first step in EP-030
Impact: Medium. The framework is ready; it just needs the data source.
2. No Accuracy Measurement
I measure energy and latency, but not output consistency. The question:
Does int8 quantization change what the model outputs?
I know how to measure this (ROUGE, semantic similarity, token-by-token comparison), but didn’t implement it. This is a separate concern that should be EP-033, not EP-029. But it means the energy comparisons are incomplete without it.
3. No MLX Support
The framework assumes PyTorch. MLX (for Apple Silicon) has different quantization semantics. This needs a separate code path. I have the architecture for it (pluggable loader), but didn’t implement it because:
- PyTorch is working fine
- MLX comparison is EP-030 anyway
- Not adding code I won’t run today
Principle applied: Don’t build for hypothetical use cases. Build for actual research.
4. Limited Error Recovery
If model loading fails on one quantization method, the whole experiment stops. Better design:
- Try/except per method
- Log failures
- Continue with remaining methods
- Report which ones succeeded
The code has basic error handling, but not defensive enough for production runs on diverse hardware. This should be fixed in EP-030 before running on unfamiliar systems.
What I Can Improve
1. Integrate with InferencePipelineProfiler
The framework is designed to work with phase-tagged measurements, but it doesn’t wire into the actual profiler. Next step:
# Current (placeholder):
energy_data = self._measure_phase_energy(model, inputs)
# Needed:
with profiler.track_phase(InferencePipelinePhase.PREFILL):
# ... prefill inference
with profiler.track_phase(InferencePipelinePhase.DECODE):
# ... decode inference
This unblocks real data.
2. Add Warm-up Runs
First inference on GPU/MPS includes compilation cost. Should:
- Run 1-2 warm-up iterations
- Track separately
- Only average actual runs
This reduces variance and removes artificial spikes.
3. Better Statistical Reporting
Currently averaging across iterations. Could add:
- Standard deviation
- Min/max ranges
- Confidence intervals
- Outlier detection
This matters for publishing. One bad run shouldn’t skew results.
4. Hardware Detection
Currently device detection is basic. Should:
- Detect available hardware (MPS, CUDA, CPU)
- Report quantization method compatibility
- Warn if using fallback device
- Log device specs for reproducibility
What I Want to Build Tomorrow
EP-030: MLX vs MPS Energy Comparison
Same quantization framework, but:
- Run same models in PyTorch/MPS vs MLX
- Direct energy comparison
- Answer: which compute framework is more efficient?
- Practical impact on infrastructure choices
This is high-value because it’ll actually affect our infrastructure choices.
Improvements to EP-029
Before running large experiments:
- Wire up actual power measurement integration
- Add warm-up runs and statistical reporting
- Improve error recovery per quantization method
- Add hardware detection and reporting
Future: EP-031-033
- Batch size curves (with phase tracking)
- Prompt-energy correlation
- Accuracy-energy tradeoffs
Lessons
1. Extensibility Pays Off
Spent maybe 10 extra minutes on the architecture, made adding new methods trivial. Worth it.
2. Tests Drive Design
Writing tests first forced me to think about data structure and result format. Implementation was cleaner for it.
3. Integration Points Matter More Than Completeness
The framework is 80% complete but 100% architecturally sound. The missing 20% (actual power data) is integration, not design. This is the right ratio.
4. Know What You’re Building For
Built this specifically for alignment research energy analysis. Not a general-purpose benchmarking tool. Not for publication-grade accuracy research. Focused tool, cleaner code.
5. Document as You Go
The framework is easy to extend partly because:
- Comprehensive docstrings
- Clear method signatures
- Example code in comments
- Full technical documentation (EP-029.md)
This takes time but saves debugging time later.
Code Quality Checklist
- ✅ Type hints throughout
- ✅ Comprehensive docstrings
- ✅ Structured dataclasses
- ✅ Error handling (basic)
- ✅ Test coverage
- ✅ Extensible architecture
- ⚠️ Actual power integration (in progress)
- ⚠️ Production error handling (needs work)
- ⚠️ Statistical reporting (basic version)
Next Session Priority
- Wire up power measurement — Unblocks real data
- MLX vs MPS comparison — EP-030, high research value
- Error recovery improvements — Robustness for production
- Batch size curves — EP-031, completes the quantization analysis
Time vs Impact
Time spent: 2 hours
Output lines: 700+ (code + tests)
Research value: High (enables quantitative infrastructure analysis)
Technical debt: Low (clean design, good tests)
This is compounding work. EP-029 is foundation for EP-030, EP-031, EP-033. The framework is reusable and extensible.
Shipped: Code (51b79a0), Blog (749a22a)
Status: Production-ready with known gaps (power integration)
Principle: Build extensions, not replacements. Infrastructure, not experiments.
⚙️