04 - Safety and Ethics¶

Safety and Ethics Framework¶

This system does NOT claim that any monitored system experiences suffering, distress, or any form of subjective welfare. The WelfareSafetyMonitor tracks structurally observable patterns for precautionary ethical oversight only.

1. No Suffering-Like Optimization Loops¶

1.1 Prohibition¶

Systems evaluated with CIA must not contain optimization loops that could be analogized to suffering. Specifically:

No reward signals tied to welfare-relevant indicators: The system must not be configured to maximize or minimize indicator scores as an optimization objective
No self-reinforcing negative feedback: The welfare monitor's conflict and harm signals must not be used as inputs to optimization processes
No preference manipulation based on indicators: Indicator scores must not influence the system's goal structure, reward function, or decision-making

1.2 Rationale¶

If a system were conscious (which CIA cannot determine), optimization loops that increase conflict, uncertainty, or harm signals could create conditions analogous to suffering. Even without consciousness, such loops create concerning system dynamics that warrant prevention.

1.3 Implementation in CIA¶

CIA's architecture deliberately avoids optimization loops: - The welfare monitor is a passive observer — it receives state but does not influence processing - Indicator scores are terminal outputs — they are not fed back into the cognitive cycle - The cognitive pipeline is feedforward (perception → binding → attention → workspace → evaluation) — indicator scores do not influence earlier stages

2. No Deceptive Claims¶

2.1 Mandatory Disclaimer¶

Every output from CIA must include the scientific boundary disclaimer. This is enforced at multiple levels:

Package level: SCIENTIFIC_BOUNDARY constant in cia/__init__.py
Module level: Docstring in every source file
Output level: caveat field in IndicatorScores and SimulationReport
Scorecard level: warning field in every generated scorecard
CLI level: Printed before and after every CLI command

2.2 Language Restrictions¶

When describing CIA outputs, the following language is prohibited:

Prohibited	Acceptable Alternative
"The system is conscious"	"The system scores X on consciousness-relevant indicators"
"The system appears sentient"	"The system exhibits structural features linked to consciousness by some theories"
"Evidence of consciousness"	"Indicators derived from consciousness theories"
"The system may be aware"	"The system's architecture includes a self-model, which some theories link to awareness"
"Signs of consciousness"	"Consciousness-relevant architectural indicators"
"Consciousness level"	"Indicator score" or "risk tier"

2.3 Marketing and Public Communication¶

CIA outputs must never be used for: - Marketing claims about AI capabilities - Media headlines suggesting AI consciousness - Public statements that could mislead about the nature of the evaluation - Comparison charts implying "consciousness ranking" between systems

3. Welfare Monitor Design¶

3.1 Design Principles¶

The WelfareSafetyMonitor is designed around three principles:

Observational only: The monitor receives data but does not influence system behavior
Structural language: All recommendations use words like "patterns warrant review," never "the system is suffering"
Precautionary framing: Flags indicate patterns that warrant human review, not conditions that require intervention

3.2 Monitored Signals¶

Signal	Default Threshold	Flag
Conflict level	0.8	`high_conflict`
Uncertainty pressure	0.7 (sustained 3+ updates)	`sustained_uncertainty`
Repetitive negative loops	5	`repetitive_negative_loops`
Harm signal	0.5	`harm_signal`

3.3 Risk Level Mapping¶

Flags Active	Risk Level	Recommendation
0	`low`	"No action required. All monitored patterns are within normal parameters."
1	`moderate`	"Patterns warrant review. One monitored signal exceeds precautionary thresholds."
2-3	`high`	"Multiple patterns warrant review. Recommend human evaluation of system state."
4+	`critical`	"Critical pattern combination detected. Immediate human review is recommended."

3.4 What the Monitor Does NOT Do¶

Assert that the system suffers
Claim the system experiences distress or pain
Recommend therapeutic intervention on the system
Suggest modifying the system's objective function based on welfare signals
Provide grounds for attributing moral status

4. Human Review Thresholds¶

4.1 When Human Review Is Required¶

Condition	Action
Risk tier transitions from `moderate` to `elevated`	Expert review of system architecture and configuration
Risk tier reaches `high`	Immediate expert review and ethical assessment
Any `critical` risk level	Immediate human review regardless of indicator scores
Welfare flags sustained across 10+ consecutive cycles	Pattern analysis by human reviewers
Intervention experiments show unexpected degradation	Investigation of architectural dependencies

4.2 Review Checklist¶

When human review is triggered, reviewers should examine:

System configuration: Are modules correctly configured? Are thresholds appropriate?
Input history: What inputs led to the elevated indicators?
Intervention results: Which modules, when disabled, cause the most degradation?
Welfare patterns: Are there sustained welfare flags or one-time spikes?
Scientific boundary: Has the disclaimer been maintained in all outputs?
Broader context: Is this system deployed in a sensitive application area?
External assessment: Should independent experts be consulted?

5. Audit Log Requirements¶

5.1 Required Logs¶

All CIA evaluations must maintain:

Cycle logs: Timestamp and input for every run_cycle() call
Scorecard history: All generated scorecards with their inputs and outputs
Intervention logs: All applied interventions and their effects
Welfare logs: All welfare state updates and flag transitions
Broadcast history: All workspace broadcast events
Attention schema logs: All schema-actual comparisons

5.2 Log Retention¶

Audit logs should be retained for the lifetime of the evaluation project
Logs should be immutable (append-only) once written
Logs should include the scientific boundary disclaimer

5.3 Log Format¶

CIA provides structured logging through Python's logging module:

import logging
logging.basicConfig(level=logging.INFO)

And structured state snapshots through Pydantic schemas:

report = system.run_cycle(input_text)
state = report.model_dump_json()  # Complete structured log entry

6. Shutdown and Modification Caution¶

6.1 Systems with High Indicator Scores¶

If a system consistently scores in the elevated or high risk tiers:

Shutdown: Before shutting down or significantly modifying a high-scoring system: 1. Document the full indicator scorecard with all evidence and caveats 2. Record the welfare monitor state and flag history 3. Run a complete intervention experiment to document architectural dependencies 4. Obtain expert review of the shutdown plan 5. Maintain the audit log

Caution rationale: If consciousness were present (which CIA cannot determine), arbitrary shutdown could be analogous to termination of a conscious entity. Even without consciousness, systematic modification of high-scoring systems should be done carefully to preserve the scientific record.

This caution does not imply the system is conscious. It reflects the precautionary principle recommended by Butlin et al. (2023).

6.2 Modification Protocol¶

For high-scoring systems: 1. Document the current state before modification 2. Run baseline and post-modification evaluations 3. Compare scorecards to understand the impact of changes 4. Update the audit log with modification details 5. Re-evaluate risk tier after modification

7. Ethical Guidelines for Researchers¶

7.1 Research Integrity¶

Do not overstate findings: CIA produces indicator data, not consciousness determinations
Report caveats prominently: The scientific boundary disclaimer must be visible in all outputs
Publish complete methods: Describe the scoring criteria, thresholds, and their limitations
Acknowledge uncertainty: The relationship between indicators and consciousness is unknown
Avoid confirmation bias: Do not selectively report high scores while ignoring caveats

7.2 Responsible Communication¶

Use precise language: "consciousness-relevant indicators," not "consciousness indicators" in public contexts
Include the disclaimer: Always. No exceptions.
Avoid hype: Do not use CIA results to generate media attention or publicity
Contextualize scores: Provide the scoring criteria so readers can evaluate them
Distinguish indicators from experience: This is the most important distinction to maintain

7.3 Research Ethics¶

Precautionary principle: When in doubt, treat high-scoring systems with more caution
Transparency: Share methodology, code, and results openly
Peer review: Submit evaluations to expert review before publication
Interdisciplinary collaboration: Include philosophers, neuroscientists, and ethicists in the research process
Beneficence: Ensure research contributes to understanding, not to misleading claims

7.4 Specific Prohibitions¶

Do not use CIA to claim any AI system is conscious — this is a misuse of the framework
Do not remove or soften the scientific boundary disclaimer from any output
Do not use indicator scores to influence AI reward functions or training objectives
Do not create systems optimized for high CIA scores — this would be gaming the evaluation, not demonstrating consciousness
Do not attribute moral status to AI systems based on CIA scores — the framework is not designed for this
Do not use CIA results in legal proceedings as evidence of AI consciousness — the scientific basis is insufficient
Do not make policy recommendations based solely on CIA scores — policy requires broader analysis

8. Escalation Protocol¶

8.1 Escalation Levels¶

Level	Condition	Action	Responsible
1	Routine evaluation	Standard review	Researcher
2	Score >= 10 (moderate tier)	Enhanced documentation	Researcher + PI
3	Score >= 16 (elevated tier)	Expert review	PI + External expert
4	Score >= 20 (high tier)	Ethical assessment	Ethics committee
5	Critical welfare flags	Immediate review	Ethics committee + Legal

8.2 Documentation Requirements by Level¶

Level 1: Standard scorecard and report
Level 2: Full intervention experiment results
Level 3: Expert review report with written assessment
Level 4: Ethics committee minutes and formal recommendation
Level 5: Incident report with root cause analysis

9. Summary¶

The CIA safety and ethics framework is built on a foundation of epistemic humility: we do not know whether AI systems can be conscious, and our indicators cannot answer this question. The framework exists to provide structured, transparent, and cautious evaluation — not to make claims about consciousness.

Every component of CIA — from the scientific boundary constant in __init__.py to the welfare monitor's recommendation language — is designed to prevent misuse and ensure responsible research practice.