Diagnostics¶
The adaptive recovery machinery: inline probe, failure-mode taxonomy, reward refiner.
curatorkit.diagnostic.failure_modes ¶
Failure mode taxonomy for the CuratorKIT diagnostic loop.
FailureMode classifies WHY a sample was rejected by a quality gate (hallucination, reward, or diversity), so rejections become actionable: each mode maps to a concrete fix (lower the temperature, tighten the grounding prompt, regenerate the question, ...) instead of an opaque drop.
Recovery is INLINE: DiagnosticProbe attempts each probe path and stores the passing sample in FailureDiagnosis.recovered_sample if any path succeeds. There is no separate pass 2 — the pipeline routes recovered samples forward immediately.
FailureDiagnosis
dataclass
¶
FailureDiagnosis(mode: FailureMode, evidence: list[bool] = list(), probe_calls: int = 0, notes: dict[str, Any] = dict(), recovered_sample: DataSample | None = None)
Result of DiagnosticProbe.diagnose(). Attached to RejectedSample.diagnosis.
Fields¶
mode : FailureMode — the diagnosed cause evidence : list[bool] — Probe 1 pass/fail pattern [T=0.3, T=0.5] probe_calls : int — total LLM calls consumed by all probes notes : dict — extra info (e.g. which prompt variant succeeded) recovered_sample : DataSample | None — the passing re-generation from the probe, if any probe path succeeded. Pipeline routes this back into the accepted pool inline. None means all probes were exhausted.
Typical uses¶
mode + evidence aggregate into per-mode rejection breakdowns (see PipelineDiagnostics and diagnostic_summary.json); probe_calls tracks the LLM budget the probe consumed, so recovery yield can be cost-normalised; recovered_sample is not None marks an actual inline recovery.
curatorkit.diagnostic.diagnostics ¶
PipelineDiagnostics — run-level accumulator for failure diagnoses.
Held by the Pipeline instance when the probe is active. Passed through PipelineResult to Curator, then accessible to the caller via result.diagnostics.
Recovery is INLINE: probe_recovery_count() reports samples where the DiagnosticProbe actually produced a passing re-generation. This replaces the old hypothetical recovery_rate() which counted RECOVERABLE dict flags.
Typical uses
mode_counts() and probe_recovery_count() feed the per-mode rejection breakdown written to diagnostic_summary.json; total_probe_calls() tracks the LLM budget the probe consumed, so recovery yield can be cost-normalised.