dyb

CUE for Safety-Critical Infrastructure Validation

אם ירצה ה׳

Configuration errors cause over 50% of production outages. CUE stops these errors before deployment through mathematically rigorous constraint validation that traditional tools cannot match. For safety-critical systems—where misconfiguration can harm people—CUE provides executable safety requirements with formal verification properties that support certification to ISO 26262, IEC 61508, and DO-178C.

Implementation Essentials

Aspect Key Point
What it is Constraint language for YAML/JSON validation with lattice-based semantics
Integration CLI (cue vet), Go API, CI/CD native; complements OPA Rego
Timeline 8-12 weeks to production validation for pilot domain
Investment Moderate training (constraint-based thinking); low infrastructure
Risk reduction Early defect detection, guaranteed constraint enforcement, audit evidence

Business Outcomes

  • Reduced incident frequency: Configuration errors caught pre-deployment
  • Accelerated compliance: Automated evidence generation for safety certification
  • Competitive differentiation: Demonstrable safety engineering excellence
  • Lower certification cost: Formal verification properties reduce manual review

2. The Problem: When Infrastructure Configuration Becomes Safety-Critical

2.1 From Convenience to Consequence

Infrastructure as Code began as a convenience—version-controlled, repeatable deployments. For most organizations, it remains there: configuration errors cause downtime, financial loss, operational friction. But for safety-critical systems—autonomous vehicles, medical devices, industrial control—the same configuration errors can cause harm.

The YAML that configures a Kubernetes deployment for a recommendation engine and the YAML that configures a Kubernetes deployment for a surgical robot use identical syntax. The difference is not in the format but in the consequences of misconfiguration: resource starvation that degrades recommendations versus resource starvation that interrupts critical monitoring.

2.2 The YAML Validation Gap

YAML's design priorities—human readability, flexibility, minimal syntax—directly conflict with safety assurance requirements. Consider:

replicas: 3
resources:
  limits:
    memory: "128Mi"  # Is this sufficient? Safe? Tested?
  requests:
    memory: "64Mi"   # Must be ≤ limits, but who checks?

Traditional validation answers: syntactically valid? (yes, well-formed YAML). Schema valid? (maybe, if JSON Schema defines the fields). Safe? (unknown—safety is outside validation scope).

2.3 Why Traditional Tools Fall Short

Tool Category Example Limitation for Safety
Regex scanners Early CloudFormation tools Cannot parse structure; false positives/negatives
Imperative rules TFLint, early Checkov Order-dependent; don't compose; conflict detection late
JSON Schema kubeval, many validators No cross-field constraints; limited conditionals; no defaults
OPA Rego Gatekeeper Runtime-focused; static analysis secondary; no formal semantics

What safety-critical validation requires: mathematical guarantees that constraints hold, regardless of how configurations are composed or ordered.

3. CUE: A Technical Primer for Security Engineers

3.1 What Makes CUE Different

CUE is not a better schema language. It is a constraint programming language for configuration, with three distinctive properties:

Order-independence: A & B equals B & A. Always. No override surprises.

Monotonicity: Adding constraints can only make results more specific or fail. Never less specific.

Explicit failure: Incompatible constraints produce _|_ (bottom), with precise error location. Never silent override.

These properties emerge from lattice-based semantics, not implementation choice. They are provable and portable across CUE implementations.

3.2 Constraint Unification in Practice

Basic CUE validates what JSON Schema validates, then goes further:

// Schema: what must be true
#Deployment: {
    replicas: int & >0 & <100      // Type + bounds
    image: =~"^registry.company.io/"  // Pattern
    resources: {
        requests: memory: string
        limits: memory: string
        // Cross-field: limits ≥ requests
        limits: memory: >=requests.memory
    }
    // Cross-field: replicas affects other constraints
    if replicas > 1 {
        strategy: type: "RollingUpdate"
    }
}

// Data: what is configured
myApp: #Deployment & {
    replicas: 3
    image: "registry.company.io/app:v1.2.3"
    resources: {
        requests: memory: "64Mi"
        limits: memory: "128Mi"  // Valid: 128Mi ≥ 64Mi
    }
}

The & operator is unification, not Boolean AND. It produces the most specific value satisfying both operands, or _|_ if none exists.

3.3 YAML Integration at Scale

CUE's cue vet command validates existing YAML without conversion:

# Validate all YAML in directory against schema
cue vet -c deployment.cue -d '#Deployment' k8s/*.yaml

# Concrete (-c) requires all fields have values
# Violations: precise file:line:column in both schema and data

For CI/CD:

# .github/workflows/validate.yml
- name: CUE Validation
  run: cue vet -c ./schemas ./deployments
  # Non-zero exit on violation blocks deployment

4. Safety Engineering with CUE: Three Concrete Patterns

4.1 Pattern 1: SIL-Aware Kubernetes Resource Validation

Safety Integrity Levels require demonstrable implementation of safety mechanisms. CUE encodes this structurally:

// Base: ASIL-agnostic safety fundamentals
#SafetyBase: {
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    seccompProfile: type: "RuntimeDefault"
}

// ASIL D: highest integrity, most constraints
#ASIL_D: #SafetyBase & {
    replicas: >=2  // Redundancy required
    // Anti-affinity: distribute across failure domains
    affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: [{
        labelSelector: matchLabels: app: string
        topologyKey: "topology.kubernetes.io/zone"
    }]
    // Health monitoring with tight thresholds
    livenessProbe: {
        initialDelaySeconds: <=10
        periodSeconds: <=5
        failureThreshold: <=3
    }
    // Resource guarantees for predictable performance
    resources: {
        requests: cpu: string
        limits: cpu: requests.cpu  // Guaranteed QoS: requests == limits
    }
}

// Apply to critical workload
criticalService: #ASIL_D & {
    replicas: 3
    // ... other configuration
}

Validation guarantees: Any deployment claiming ASIL_D compliance must satisfy all structural requirements. Missing anti-affinity, excessive probe thresholds, or burstable QoS are caught before deployment, not in safety audit.

4.2 Pattern 2: Automated HAZOP Guide Word Enforcement

HAZOP guide words systematically identify hazardous deviations. CUE constraints can encode preventive patterns:

Guide Word Hazard CUE Prevention
NO Required safety mechanism absent Mandatory fields with no default (!)
MORE Excessive resource allocation causing starvation Upper bounds with safety margin
LESS Insufficient redundancy Minimum replica constraints
AS WELL AS Unexpected capabilities from unknown fields Closed schemas ({...} rejects unknowns)
OTHER THAN Invalid operational mode Exhaustive disjunctions with no default

Example—preventing "NO [safety monitoring]":

import ( 
            "list"
            "strings"
)

#MonitoredWorkload: {
    // The ! means: must be specified, no default
    metricsPort: int & >1024 & <65536
    healthEndpoint: string & =~"^/health"
    alertingRules: [...string] & list.MinItems(1)
    
    // Derived: monitoring must be reachable
    _monitoringValid: ports.containerPort == metricsPort
}

4.3 Pattern 3: Fault Tree Cut Set Prevention in Terraform

Note: This pattern is architectural; specific implementations require HCL-to-CUE conversion.

Fault Tree Analysis identifies minimal cut sets—smallest combinations of failures causing system hazard. CUE can enforce structural prevention: diversity requirements that eliminate common cause failures.

import ( 
           "list"
           "strings"
)

// For safety-critical redundancy: diverse implementations required
#DiverseRedundancy: {
    channels: [...#Channel] & list.MinItems(2)
    // All channels must have distinct implementations
    _implementations: [for c in channels {c.implementation}]
    _unique: list.Unique(_implementations) & {
        len(this) == len(channels)  // No duplicates
    }
}

#Channel: {
    implementation: string  // e.g., "vendorA-v1.2", "vendorB-v3.4"
    nodeSelector: topology.kubernetes.io/zone: string
    // Zones must differ across channels
}

5. Implementation in Production Environments

5.1 Pipeline Integration Architecture

┌─────────────────┐     ┌─────────────┐     ┌─────────────────┐
│  Developer IDE  │────→│  Pre-commit │────→│   PR Validation │
│  (CUE LSP)      │     │  (cue vet)  │     │  (full schemas) │
└─────────────────┘     └─────────────┘     └─────────────────┘
                                                    │
┌─────────────────┐     ┌─────────────┐            │
│  Production     │←────│  Deployment │←───────────┘
│  (monitored)    │     │  Gate       │
└─────────────────┘     └─────────────┘
        │
        ↓
┌─────────────────┐
│  Incident:      │
│  CUE validation │
│  evidence for   │
│  safety case    │
└─────────────────┘

5.2 Constraint Library Organization

cue.mod/
├── pkg/
│   ├── chokmah.io/safety/v1/      # Base safety patterns
│   │   ├── asil.cue               # ASIL-A through D
│   │   ├── sil.cue                # SIL 1-4 mappings
│   │   └── redundancy.cue         # N-modular, voting
│   ├── chokmah.io/security/v1/    # Security controls
│   │   ├── podsecurity.cue        # PSS implementation
│   │   ├── network.cue            # Zero-trust patterns
│   │   └── rbac.cue               # Least-privilege
│   └── chokmah.io/compliance/v1/  # Framework mappings
│       ├── iso26262.cue           # Automotive
│       ├── iec61508.cue           # Generic functional safety
│       └── nist80053.cue          # US government
└── usr/                           # Application constraints
    └── myapp/
        └── deployment.cue         # Uses chokmah.io/safety/v1

5.3 Measuring Safety Outcomes

Metric Measurement Target
Configuration defect escape rate Defects found in production / total defects <5% (vs. industry ~50% pre-validation)
Time to safety constraint violation detection Commit to notification <5 minutes (pre-commit ideal)
Safety case evidence automation Validated constraints / total safety requirements >80% for structural requirements
Constraint library coverage Resources with CUE schemas / total resource types 100% for safety-critical types

6. Getting Started: A 30-Day Adoption Plan

6.1 Week 1-2: Pilot Selection and Team Enablement

  • Select pilot: Single application, Kubernetes-native, existing configuration error pain
  • Assemble team: 2-3 engineers with Go experience, safety engineering liaison
  • Training: CUE fundamentals (https://cuelang.org/docs/tutorials/), constraint-based thinking workshop
  • Initial schema: Port existing JSON Schema or develop from safety requirements

6.2 Week 3-4: Production Constraint Deployment

  • CI integration: cue vet in PR checks, blocking on violation
  • Developer experience: IDE plugins, pre-commit hooks
  • Constraint refinement: Based on initial feedback, false positive elimination
  • Documentation: Constraint purpose, safety rationale, example violations

6.3 Month 2+: Scaling and Optimization

  • Expand coverage: Additional resource types, cross-resource constraints
  • Organizational rollout: Training additional teams, constraint library governance
  • Advanced patterns: Template validation, differential analysis, safety case integration
  • Community contribution: Share patterns, engage with CUE safety-critical use case development

7. Resources and Expert Engagement

7.1 Open Source Tools and Libraries

Resource URL Purpose
CUE Language https://cuelang.org Core language and tooling
CUE Kubernetes https://github.com/cue-labs/cue-api-machinery K8s-specific patterns