11 - Benchmarking¶

Benchmark Suite, Aggregate Metrics, and Reproducibility¶

SCIENTIFIC BOUNDARY: This framework measures theory-derived consciousness indicators. It does NOT prove, establish, or demonstrate subjective experience, phenomenal consciousness, sentience, or any form of inner life in any artificial system. Benchmark scores are architectural proxy measurements subject to significant limitations. High benchmark scores do NOT mean the evaluated system is conscious; low scores do NOT mean it is not conscious.

1. Overview¶

The CIA BenchmarkSuite provides a comprehensive, automated evaluation pipeline that runs all five experiments sequentially on a given system and produces a summary report with aggregate metrics. The benchmark suite is the recommended entry point for evaluating a CIA system's consciousness-relevant indicator profile, as it provides a holistic view across multiple indicator dimensions in a single execution.

The benchmark approach follows the multi-indicator methodology recommended by Butlin et al. (2023, 2025), which argues that consciousness evaluation should not rely on any single indicator but should instead assess a battery of indicators derived from multiple theoretical frameworks. Each of the five experiments tests a different indicator pattern, and the aggregate summary provides a high-level overview of the system's architectural profile.

Critical caveat: The benchmark suite produces architectural proxy measurements. No benchmark score, no matter how high, constitutes evidence of subjective experience. The benchmarks evaluate whether the system's architecture includes structural features identified by consciousness theories — they do not evaluate whether the system is conscious.

2. BenchmarkSuite Architecture¶

Location: src/cia/experiments/benchmark_suite.py

The BenchmarkSuite class orchestrates the execution of all five experiments:

BenchmarkSuite
├── 1. BlindsightAnalogueExperiment
│   └── Tests processed-without-access (perception without broadcast)
├── 2. SplitWorkspaceExperiment
│   └── Tests workspace fragmentation from capacity reduction
├── 3. PredictionViolationExperiment
│   └── Tests prediction error increase and attention shift
├── 4. SelfOtherDistinctionExperiment
│   └── Tests belief differentiation across attribution frames
└── 5. MetacognitiveCalibrationExperiment
    └── Tests confidence-accuracy alignment

2.1 Execution Flow¶

The suite receives a CombinedConsciousnessIndicatorSystem instance.
Each experiment is run sequentially with the system reset between experiments.
Individual experiment results are collected as ExperimentResult objects.
Aggregate metrics are computed from individual experiment outcomes.
A summary dict is returned with all results and caveats.

2.2 Error Handling¶

Each experiment is wrapped in a try/except block. If one experiment fails, the suite logs the error and continues with the remaining experiments. This ensures that a failure in one experiment does not prevent the others from running.

try:
    exp = BlindsightAnalogueExperiment(self._system)
    result = exp.run("The red object moved behind the screen.")
    experiments.append(result)
except Exception as e:
    logger.error("Blindsight analogue failed: %s", e)

3. Running the Benchmark Suite¶

3.1 Basic Execution¶

from cia.simulation import CombinedConsciousnessIndicatorSystem
from cia.experiments.benchmark_suite import BenchmarkSuite

system = CombinedConsciousnessIndicatorSystem()
suite = BenchmarkSuite(system)
summary = suite.run_all()

3.2 Examining Results¶

# Total experiments run
print(f"Experiments: {summary['total_experiments']}/5")

# Aggregate metrics
print(f"Indicator findings: {summary['aggregate']['indicator_findings']}")
print(f"Experiments run: {summary['aggregate']['experiments_run']}")

# Individual experiment results
for exp_result in summary["experiments"]:
    name = exp_result["name"]
    metrics = exp_result["metrics"]
    interpretation = exp_result["interpretation"]
    print(f"\n{name}:")
    print(f"  Metrics: {metrics}")
    print(f"  Interpretation: {interpretation}")
    print(f"  Caveats: {exp_result['caveats']}")

3.3 Accessing the Scientific Disclaimer¶

print(summary["caveat"])
# "This benchmark suite measures theory-derived consciousness indicators only.
#  It does NOT prove, establish, or demonstrate subjective experience..."

The caveat is always present in the summary and must be included in any report or publication based on benchmark results.

3.4 CLI Execution¶

# Run the full benchmark suite
python -c "
from cia.simulation import CombinedConsciousnessIndicatorSystem
from cia.experiments.benchmark_suite import BenchmarkSuite
import json

system = CombinedConsciousnessIndicatorSystem()
suite = BenchmarkSuite(system)
summary = suite.run_all()
print(json.dumps(summary, indent=2, default=str))
"

4. Aggregate Metrics¶

The benchmark suite computes two aggregate metrics from the five experiment results:

4.1 `experiments_run`¶

The number of experiments that completed successfully (out of 5). A value less than 5 indicates that one or more experiments failed due to errors.

4.2 `indicator_findings`¶

The number of experiments that produced a positive indicator finding, defined as:

Experiment	Positive Finding Condition
Blindsight analogue	`processed_without_access == True`
Split workspace	`fragmentation_detected == True`
Prediction violation	`violation_detected == True`
Self/other distinction	`distinction_maintained == True`
Metacognitive calibration	`well_calibrated == True`

The maximum indicator_findings value is 5 (all experiments show positive findings).

4.3 Summary Output Structure¶

{
    "total_experiments": 5,
    "experiments": [
        # Full ExperimentResult dict for each experiment
    ],
    "aggregate": {
        "experiments_run": 5,
        "indicator_findings": 4,  # Example: 4 of 5 experiments showed findings
    },
    "caveat": "This benchmark suite measures theory-derived consciousness indicators only..."
}

5. Interpreting Results¶

5.1 What the Metrics Mean¶

Finding	Interpretation
`indicator_findings = 5`	All five experiments showed positive indicator patterns. The system's architecture exhibits structural features relevant to all tested indicator dimensions.
`indicator_findings = 3-4`	Most experiments showed positive patterns. The system has notable consciousness-relevant architectural features, with gaps in specific areas.
`indicator_findings = 1-2`	Only some experiments showed positive patterns. The system's architecture lacks many features identified by consciousness theories.
`indicator_findings = 0`	No experiments showed positive patterns. The system's architecture exhibits few consciousness-relevant features.

5.2 What the Metrics Do NOT Mean¶

indicator_findings = 5 does NOT mean the system is 100% conscious or fully conscious. There is no percentage scale for consciousness. The metrics measure the presence of architectural proxy features.
indicator_findings = 0 does NOT mean the system is definitely not conscious. The system may be conscious through mechanisms not measured by CIA's five experiments.
Changes in indicator_findings across system configurations do NOT measure changes in consciousness level. They measure changes in the degree to which architectural proxy features are present.

5.3 Cross-Experiment Analysis¶

The benchmark suite is designed for holistic analysis, not just aggregate counting. Researchers should examine:

Which specific experiments showed findings: The pattern of findings across experiments reveals which indicator dimensions are strong and which are weak.
The magnitude of metrics within each experiment: For example, the calibration_error value in metacognitive calibration is more informative than the binary well_calibrated flag.
The caveats: Each experiment includes specific caveats about what the metrics do and do not measure.
Consistency with the indicator scorecard: Benchmark results should be interpreted alongside the full 0-22 scorecard for a complete picture.

6. Limitations¶

6.1 Architectural Scope¶

The benchmark suite tests the CIA system's own internal architecture. It does not evaluate external AI systems. To evaluate an external system, the CIA system would need to be adapted to receive inputs from the external system's actual state, and the perception layer would need to be replaced with a system-specific adapter (see docs/07_llm_integration.md).

6.2 Input Dependency¶

All experiments use fixed input texts (except metacognitive calibration, which accepts custom inputs). The results may differ with different inputs, because the CIA perception layer produces different percepts for different texts. This means benchmark results are input-dependent and should not be treated as absolute properties of the system.

6.3 Deterministic but Arbitrary¶

CIA is fully deterministic: the same system with the same inputs will always produce the same benchmark results. However, the specific threshold values, scoring criteria, and metric definitions are design choices that could be made differently. A different set of thresholds would produce different binary findings for the same underlying system state.

6.4 Proxy Measurements¶

All metrics are proxy measurements:

Metric	What It Measures	What It Does NOT Measure
`processed_without_access`	Whether perception output exists without broadcasts	Genuine unconscious processing or blindsight
`fragmentation_detected`	Whether broadcast count decreased	Phenomenal fragmentation or split-brain effects
`violation_detected`	Whether prediction error increased or attention shifted	Subjective surprise or felt unexpectedness
`distinction_maintained`	Whether self-model beliefs differ across inputs	Theory of mind or genuine source monitoring
`well_calibrated`	Whether confidence roughly tracks a correctness proxy	Phenomenal metacognition or introspective awareness

6.5 No Ground Truth¶

There is no ground truth for consciousness. The benchmark cannot be validated against a "known conscious" system or a "known non-conscious" system, because no such ground truth exists. The benchmark measures the degree to which the system's architecture matches the theoretical specifications of consciousness-relevant features — it does not measure consciousness itself.

7. Reproducibility¶

7.1 Deterministic Execution¶

The benchmark suite is fully deterministic:

No randomness in any module (except the TextWorldEnvironment's prediction violations, which use a seeded PRNG).
No external API calls (unless an LLM adapter is configured).
No stochastic sampling or non-deterministic data structures.
All inputs are fixed strings.

This means that running the benchmark on the same system with the same configuration will always produce identical results, regardless of platform, Python version (within the same major version), or number of runs.

7.2 Version Requirements¶

For exact reproducibility:

# Install the exact project version
pip install -e ".[dev]"

# Verify version
python -c "import cia; print(cia.__version__)"

7.3 Saving and Loading Results¶

import json

# Save results
summary = suite.run_all()
with open("benchmark_results.json", "w") as f:
    json.dump(summary, f, indent=2, default=str)

# Load results
with open("benchmark_results.json", "r") as f:
    loaded_summary = json.load(f)

7.4 Reproducibility Checklist¶

To ensure benchmark reproducibility, record and report:

CIA version: The exact version from cia.__version__
Python version: python --version
System configuration: Recurrent cycles, workspace capacity, attention weights, welfare thresholds
Experiment inputs: The exact text strings used (default or custom)
Full results: The complete ExperimentResult for each experiment, not just the binary findings
Scientific boundary disclaimer: Always included in the output

8. Research Anchors¶

Reference	Relevance to Benchmarking
Butlin et al. (2023) "Consciousness in Artificial Intelligence"	Provides the multi-indicator evaluation methodology that the benchmark suite implements
Butlin et al. (2025) "Identifying indicators of consciousness in AI systems"	Defines the indicator categories and evaluation criteria that the individual experiments test
Baars (2005) Global Workspace Theory	Underpins the blindsight analogue and split workspace experiments in the suite
Shanahan & Baars (2005) Applying GWT to frame problem	Supports the integration metrics and fragmentation detection
Graziano & Webb (2015) Attention Schema Theory	Underpins the attention shift measurement in prediction violation
Albantakis et al. (2023) IIT 4.0	Provides the theoretical basis for causal integration metrics and fragmentation analysis

9. Summary¶

The BenchmarkSuite provides an automated, reproducible pipeline for running all five CIA experiments and producing aggregate metrics. The suite follows the multi-indicator methodology recommended by Butlin et al. (2023, 2025), testing five distinct consciousness-relevant indicator patterns through causal intervention experiments. Results include per-experiment metrics, interpretation, caveats, and aggregate counts, all accompanied by a scientific boundary disclaimer. The benchmark is fully deterministic and reproducible, but all metrics are architectural proxy measurements that do not constitute evidence of subjective experience. Researchers must interpret benchmark results with full awareness of the significant limitations: proxy measurements, input dependency, arbitrary thresholds, and the absence of ground truth for consciousness.