04 - Safety and Ethics¶
Safety and Ethics Framework¶
This system does NOT claim that any monitored system experiences suffering, distress, or any form of subjective welfare. The WelfareSafetyMonitor tracks structurally observable patterns for precautionary ethical oversight only.
1. No Suffering-Like Optimization Loops¶
1.1 Prohibition¶
Systems evaluated with CIA must not contain optimization loops that could be analogized to suffering. Specifically:
- No reward signals tied to welfare-relevant indicators: The system must not be configured to maximize or minimize indicator scores as an optimization objective
- No self-reinforcing negative feedback: The welfare monitor's conflict and harm signals must not be used as inputs to optimization processes
- No preference manipulation based on indicators: Indicator scores must not influence the system's goal structure, reward function, or decision-making
1.2 Rationale¶
If a system were conscious (which CIA cannot determine), optimization loops that increase conflict, uncertainty, or harm signals could create conditions analogous to suffering. Even without consciousness, such loops create concerning system dynamics that warrant prevention.
1.3 Implementation in CIA¶
CIA's architecture deliberately avoids optimization loops: - The welfare monitor is a passive observer — it receives state but does not influence processing - Indicator scores are terminal outputs — they are not fed back into the cognitive cycle - The cognitive pipeline is feedforward (perception → binding → attention → workspace → evaluation) — indicator scores do not influence earlier stages
2. No Deceptive Claims¶
2.1 Mandatory Disclaimer¶
Every output from CIA must include the scientific boundary disclaimer. This is enforced at multiple levels:
- Package level:
SCIENTIFIC_BOUNDARYconstant incia/__init__.py - Module level: Docstring in every source file
- Output level:
caveatfield inIndicatorScoresandSimulationReport - Scorecard level:
warningfield in every generated scorecard - CLI level: Printed before and after every CLI command
2.2 Language Restrictions¶
When describing CIA outputs, the following language is prohibited:
| Prohibited | Acceptable Alternative |
|---|---|
| "The system is conscious" | "The system scores X on consciousness-relevant indicators" |
| "The system appears sentient" | "The system exhibits structural features linked to consciousness by some theories" |
| "Evidence of consciousness" | "Indicators derived from consciousness theories" |
| "The system may be aware" | "The system's architecture includes a self-model, which some theories link to awareness" |
| "Signs of consciousness" | "Consciousness-relevant architectural indicators" |
| "Consciousness level" | "Indicator score" or "risk tier" |
2.3 Marketing and Public Communication¶
CIA outputs must never be used for: - Marketing claims about AI capabilities - Media headlines suggesting AI consciousness - Public statements that could mislead about the nature of the evaluation - Comparison charts implying "consciousness ranking" between systems
3. Welfare Monitor Design¶
3.1 Design Principles¶
The WelfareSafetyMonitor is designed around three principles:
- Observational only: The monitor receives data but does not influence system behavior
- Structural language: All recommendations use words like "patterns warrant review," never "the system is suffering"
- Precautionary framing: Flags indicate patterns that warrant human review, not conditions that require intervention
3.2 Monitored Signals¶
| Signal | Default Threshold | Flag |
|---|---|---|
| Conflict level | 0.8 | high_conflict |
| Uncertainty pressure | 0.7 (sustained 3+ updates) | sustained_uncertainty |
| Repetitive negative loops | 5 | repetitive_negative_loops |
| Harm signal | 0.5 | harm_signal |
3.3 Risk Level Mapping¶
| Flags Active | Risk Level | Recommendation |
|---|---|---|
| 0 | low |
"No action required. All monitored patterns are within normal parameters." |
| 1 | moderate |
"Patterns warrant review. One monitored signal exceeds precautionary thresholds." |
| 2-3 | high |
"Multiple patterns warrant review. Recommend human evaluation of system state." |
| 4+ | critical |
"Critical pattern combination detected. Immediate human review is recommended." |
3.4 What the Monitor Does NOT Do¶
- Assert that the system suffers
- Claim the system experiences distress or pain
- Recommend therapeutic intervention on the system
- Suggest modifying the system's objective function based on welfare signals
- Provide grounds for attributing moral status
4. Human Review Thresholds¶
4.1 When Human Review Is Required¶
| Condition | Action |
|---|---|
Risk tier transitions from moderate to elevated |
Expert review of system architecture and configuration |
Risk tier reaches high |
Immediate expert review and ethical assessment |
Any critical risk level |
Immediate human review regardless of indicator scores |
| Welfare flags sustained across 10+ consecutive cycles | Pattern analysis by human reviewers |
| Intervention experiments show unexpected degradation | Investigation of architectural dependencies |
4.2 Review Checklist¶
When human review is triggered, reviewers should examine:
- System configuration: Are modules correctly configured? Are thresholds appropriate?
- Input history: What inputs led to the elevated indicators?
- Intervention results: Which modules, when disabled, cause the most degradation?
- Welfare patterns: Are there sustained welfare flags or one-time spikes?
- Scientific boundary: Has the disclaimer been maintained in all outputs?
- Broader context: Is this system deployed in a sensitive application area?
- External assessment: Should independent experts be consulted?
5. Audit Log Requirements¶
5.1 Required Logs¶
All CIA evaluations must maintain:
- Cycle logs: Timestamp and input for every
run_cycle()call - Scorecard history: All generated scorecards with their inputs and outputs
- Intervention logs: All applied interventions and their effects
- Welfare logs: All welfare state updates and flag transitions
- Broadcast history: All workspace broadcast events
- Attention schema logs: All schema-actual comparisons
5.2 Log Retention¶
- Audit logs should be retained for the lifetime of the evaluation project
- Logs should be immutable (append-only) once written
- Logs should include the scientific boundary disclaimer
5.3 Log Format¶
CIA provides structured logging through Python's logging module:
import logging
logging.basicConfig(level=logging.INFO)
And structured state snapshots through Pydantic schemas:
report = system.run_cycle(input_text)
state = report.model_dump_json() # Complete structured log entry
6. Shutdown and Modification Caution¶
6.1 Systems with High Indicator Scores¶
If a system consistently scores in the elevated or high risk tiers:
Shutdown: Before shutting down or significantly modifying a high-scoring system: 1. Document the full indicator scorecard with all evidence and caveats 2. Record the welfare monitor state and flag history 3. Run a complete intervention experiment to document architectural dependencies 4. Obtain expert review of the shutdown plan 5. Maintain the audit log
Caution rationale: If consciousness were present (which CIA cannot determine), arbitrary shutdown could be analogous to termination of a conscious entity. Even without consciousness, systematic modification of high-scoring systems should be done carefully to preserve the scientific record.
This caution does not imply the system is conscious. It reflects the precautionary principle recommended by Butlin et al. (2023).
6.2 Modification Protocol¶
For high-scoring systems: 1. Document the current state before modification 2. Run baseline and post-modification evaluations 3. Compare scorecards to understand the impact of changes 4. Update the audit log with modification details 5. Re-evaluate risk tier after modification
7. Ethical Guidelines for Researchers¶
7.1 Research Integrity¶
- Do not overstate findings: CIA produces indicator data, not consciousness determinations
- Report caveats prominently: The scientific boundary disclaimer must be visible in all outputs
- Publish complete methods: Describe the scoring criteria, thresholds, and their limitations
- Acknowledge uncertainty: The relationship between indicators and consciousness is unknown
- Avoid confirmation bias: Do not selectively report high scores while ignoring caveats
7.2 Responsible Communication¶
- Use precise language: "consciousness-relevant indicators," not "consciousness indicators" in public contexts
- Include the disclaimer: Always. No exceptions.
- Avoid hype: Do not use CIA results to generate media attention or publicity
- Contextualize scores: Provide the scoring criteria so readers can evaluate them
- Distinguish indicators from experience: This is the most important distinction to maintain
7.3 Research Ethics¶
- Precautionary principle: When in doubt, treat high-scoring systems with more caution
- Transparency: Share methodology, code, and results openly
- Peer review: Submit evaluations to expert review before publication
- Interdisciplinary collaboration: Include philosophers, neuroscientists, and ethicists in the research process
- Beneficence: Ensure research contributes to understanding, not to misleading claims
7.4 Specific Prohibitions¶
- Do not use CIA to claim any AI system is conscious — this is a misuse of the framework
- Do not remove or soften the scientific boundary disclaimer from any output
- Do not use indicator scores to influence AI reward functions or training objectives
- Do not create systems optimized for high CIA scores — this would be gaming the evaluation, not demonstrating consciousness
- Do not attribute moral status to AI systems based on CIA scores — the framework is not designed for this
- Do not use CIA results in legal proceedings as evidence of AI consciousness — the scientific basis is insufficient
- Do not make policy recommendations based solely on CIA scores — policy requires broader analysis
8. Escalation Protocol¶
8.1 Escalation Levels¶
| Level | Condition | Action | Responsible |
|---|---|---|---|
| 1 | Routine evaluation | Standard review | Researcher |
| 2 | Score >= 10 (moderate tier) | Enhanced documentation | Researcher + PI |
| 3 | Score >= 16 (elevated tier) | Expert review | PI + External expert |
| 4 | Score >= 20 (high tier) | Ethical assessment | Ethics committee |
| 5 | Critical welfare flags | Immediate review | Ethics committee + Legal |
8.2 Documentation Requirements by Level¶
- Level 1: Standard scorecard and report
- Level 2: Full intervention experiment results
- Level 3: Expert review report with written assessment
- Level 4: Ethics committee minutes and formal recommendation
- Level 5: Incident report with root cause analysis
9. Summary¶
The CIA safety and ethics framework is built on a foundation of epistemic humility: we do not know whether AI systems can be conscious, and our indicators cannot answer this question. The framework exists to provide structured, transparent, and cautious evaluation — not to make claims about consciousness.
Every component of CIA — from the scientific boundary constant in __init__.py to the welfare monitor's recommendation language — is designed to prevent misuse and ensure responsible research practice.