אִם יִרְצֶה הַשֵּׁם
CUE does one thing that JSON Schema and OPA cannot: it makes constraint composition provably order-independent and monotone. For a safety-critical system, that property is the difference between a validation framework you can reason about formally and one you have to test exhaustively. Start with a pilot, instrument the defect escape rate, and you will have the evidence your safety case needs.
Prologue
Configuration errors cause over 50% of production outages (Puppet State of DevOps 2023). CUE stops these errors before deployment through mathematically rigorous constraint validation that traditional tools cannot match. For safety-critical systems, where misconfiguration can harm people, CUE provides executable safety requirements with formal verification properties that support certification to ISO 26262, IEC 61508, and DO-178C.
This guide proves a specific claim: CUE's lattice-based semantics enables safety constraint verification properties that JSON Schema and OPA cannot provide. We demonstrate this across three production patterns, with executable code. The primary reader is a defense or automotive systems engineer choosing configuration validation tooling.
Summary
| Aspect | Key Point |
|---|---|
| What it is | Constraint language for YAML/JSON validation with lattice-based semantics |
| Integration | CLI (cue vet), Go API, CI/CD native; complements OPA Rego |
| Timeline | 4-8 weeks to pilot validation for one domain; 6+ months to full production rollout |
| Investment | Moderate training (constraint-based thinking); low infrastructure |
| Risk reduction | Pre-deployment defect detection, guaranteed constraint enforcement, audit evidence |
Business Outcomes
- Reduced incident frequency: Configuration errors caught pre-deployment
- Accelerated compliance: Automated evidence generation for safety certification
- Competitive differentiation: Demonstrable safety engineering excellence
- Lower certification cost: Formal verification properties reduce manual review
2. The Problem: When Infrastructure Configuration Becomes Safety-Critical
2.1 From Convenience to Consequence
Infrastructure as Code began as a convenience, version-controlled and repeatable. For most organizations, it stays there: configuration errors cause downtime and financial loss. For safety-critical systems, autonomous vehicles, medical devices, industrial control, the same errors can cause harm.
The YAML that configures a Kubernetes deployment for a recommendation engine and the YAML that configures one for a surgical robot use identical syntax. The difference is not format but consequences of misconfiguration: resource starvation that degrades recommendations versus resource starvation that interrupts critical monitoring.
2.2 The YAML Validation Gap
YAML's design priorities, human readability, flexibility, and minimal syntax, directly conflict with safety assurance requirements. Consider:
replicas: 3 resources: limits: memory: "128Mi" # Is this sufficient? Safe? Tested? requests: memory: "64Mi" # Must be <= limits, but who checks?
Traditional validation answers: syntactically valid? (yes, well-formed YAML). Schema valid? (maybe, if JSON Schema defines the fields). Safe? (unknown, safety is outside validation scope).
2.3 Why Traditional Tools Fall Short
| Tool Category | Example | Limitation for Safety |
|---|---|---|
| Regex scanners | Early CloudFormation tools | Cannot parse structure; false positives/negatives |
| Imperative rules | TFLint, early Checkov | Order-dependent; don't compose; conflict detection late |
| JSON Schema | kubeval, many validators | No cross-field constraints; limited conditionals; no defaults |
| OPA Rego | Gatekeeper | Datalog-derived semantics; no lattice ordering; static analysis is secondary to runtime policy |
OPA's Rego is powerful at runtime policy enforcement but lacks the formal monotonicity and order-independence guarantees that emerge from lattice semantics. See the CUE specification (cuelang.org/docs/references/spec/) versus OPA's policy language reference for the contrast.
What safety-critical validation requires: mathematical guarantees that constraints hold, regardless of how configurations are composed or ordered.
3. CUE: A Technical Primer for Systems Engineers
3.1 What Makes CUE Different
CUE is not a better schema language. It is a constraint programming language for configuration, with three distinctive properties:
Order-independence: A & B equals B & A. Always. No override surprises.
Monotonicity: Adding constraints can only make results more specific or fail. Never less specific.
Explicit failure: Incompatible constraints produce _|_ (bottom), with precise error location. Never silent override.
These properties emerge from lattice-based semantics, not implementation choice. They are provable and portable across CUE implementations.
3.2 Constraint Unification in Practice
Basic CUE validates what JSON Schema validates, then goes further:
// Schema: what must be true #Deployment: { replicas: int & >0 & <100 // type + bounds image: =~"^registry.company.io/" // pattern resources: { requests: memory: =~"^[0-9]+Mi$" limits: memory: =~"^[0-9]+Mi$" // Cross-field: limits must be at least as large as requests. // In production CUE, use integer extraction via strconv.Atoi // or enforce the policy at the CI layer via cue vet + a Go validator. // The constraint below is conceptual pseudocode: // limits.memory >= requests.memory } if replicas > 1 { strategy: type: "RollingUpdate" } } // Data: what is configured myApp: #Deployment & { replicas: 3 image: "registry.company.io/app:v1.2.3" resources: { requests: memory: "64Mi" limits: memory: "128Mi" } }
The & operator is unification, not Boolean AND. It produces the most specific value satisfying both operands, or _|_ if none exists.
Note on string-valued memory comparisons: CUE cannot natively compare "128Mi" >= "64Mi" because these are strings, not integers. Production enforcement of memory limit >= request requires either (a) converting to integers in a Go wrapper, or (b) enforcing the rule at a CI step outside CUE. Do not treat the cross-field comment above as executable; it is architectural intent.
3.3 YAML Integration at Scale
CUE's cue vet command validates existing YAML without conversion:
# Validate all YAML in directory against schema cue vet -c deployment.cue -d '#Deployment' k8s/*.yaml # -c (concrete) requires all fields to have values # Violations: precise file:line:column in both schema and data
Directory layout matters. See Section 5.2 for the recommended library structure before wiring this into CI.
For CI/CD:
# .github/workflows/validate.yml - name: CUE Validation run: cue vet -c ./schemas ./deployments # Non-zero exit on violation blocks deployment
4. Safety Engineering with CUE: Three Patterns
4.1 Pattern 1: SIL-Aware Kubernetes Resource Validation
Safety Integrity Levels require demonstrable implementation of safety mechanisms. CUE encodes this structurally:
// Base: ASIL-agnostic safety fundamentals #SafetyBase: { runAsNonRoot: true readOnlyRootFilesystem: true allowPrivilegeEscalation: false seccompProfile: type: "RuntimeDefault" } // ASIL D: highest integrity, most constraints #ASIL_D: #SafetyBase & { replicas: >=2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: [{ labelSelector: matchLabels: app: string topologyKey: "topology.kubernetes.io/zone" }] livenessProbe: { initialDelaySeconds: <=10 periodSeconds: <=5 failureThreshold: <=3 } resources: { requests: cpu: string limits: cpu: requests.cpu // Guaranteed QoS: requests == limits } } criticalService: #ASIL_D & { replicas: 3 // ... other configuration }
Validation guarantees: Any deployment claiming ASIL_D compliance must satisfy all structural requirements. Missing anti-affinity, excessive probe thresholds, or burstable QoS are caught before deployment, not in a safety audit.
4.2 Pattern 2: Automated HAZOP Guide Word Enforcement
HAZOP guide words systematically identify hazardous deviations. CUE constraints encode preventive patterns:
| Guide Word | Hazard | CUE Prevention |
|---|---|---|
| NO | Required safety mechanism absent | Mandatory unbound fields (no default means CUE emits _\|_ if unset) |
| MORE | Excessive resource allocation causing starvation | Upper bounds |
| LESS | Insufficient redundancy | Minimum replica constraints |
| AS WELL AS | Unexpected capabilities from unknown fields | Closed schemas ({...} rejects unknowns) |
| OTHER THAN | Invalid operational mode | Exhaustive disjunctions with no default |
Example, preventing "NO [safety monitoring]":
import ( "list" ) #MonitoredWorkload: { // A field with a type constraint but no value means CUE requires it; // omitting it produces _|_ at vet time. metricsPort: int & >1024 & <65536 healthEndpoint: string & =~"^/health" alertingRules: [...string] & list.MinItems(1) }
4.3 Pattern 3: Fault Tree Cut Set Prevention (Conceptual)
The following is architectural intent. Mapping it to executable Terraform HCL requires HCL-to-CUE conversion tooling.
Fault Tree Analysis identifies minimal cut sets, the smallest failure combinations that cause a system hazard. CUE can enforce structural diversity requirements that eliminate common-cause failures:
import ( "list" ) // Two or more channels required; all must have distinct implementations #DiverseRedundancy: { channels: [...#Channel] & list.MinItems(2) // list.UniqueItems is the correct CUE built-in _implCheck: list.UniqueItems([for c in channels {c.implementation}]) } #Channel: { implementation: string // e.g., "vendorA-v1.2", "vendorB-v3.4" nodeSelector: "topology.kubernetes.io/zone": string }
5. Implementation in Production Environments
5.1 Pipeline Integration Architecture
+-----------------+ +-------------+ +-----------------+
| Developer IDE |---->| Pre-commit |---->| PR Validation |
| (CUE LSP) | | (cue vet) | | (full schemas) |
+-----------------+ +-------------+ +-----------------+
|
+-----------------+ +-------------+ |
| Production |<----| Deployment |<-----------+
| (monitored) | | Gate |
+-----------------+ +-------------+
|
v
+-----------------+
| Incident: |
| CUE validation |
| schema + vet |
| output stored |
| as audit trail |
| for ISO 26262 |
| Table 9 (SW |
| V&V evidence) |
+-----------------+
ISO 26262 Table 9 (Part 6) lists software verification and validation methods. CUE's cue vet output, combined with version-controlled schema files, constitutes documented evidence that structural safety requirements were checked at every deployment. Store schema files in git; store cue vet logs in your artifact repository. Cite schema version and log hash in the safety case.
5.2 Constraint Library Organization
cue.mod/
+-- pkg/
| +-- chokmah.io/safety/v1/
| | +-- asil.cue
| | +-- sil.cue
| | +-- redundancy.cue
| +-- chokmah.io/security/v1/
| | +-- podsecurity.cue
| | +-- network.cue
| | +-- rbac.cue
| +-- chokmah.io/compliance/v1/
| +-- iso26262.cue
| +-- iec61508.cue
| +-- nist80053.cue
+-- usr/
+-- myapp/
+-- deployment.cue
5.3 Measuring Safety Outcomes
| Metric | Measurement | Target |
|---|---|---|
| Configuration defect escape rate | Defects found in production / total defects | <5% (Puppet State of DevOps 2023 reports ~50% without pre-deployment validation) |
| Time to safety constraint violation detection | Commit to notification | <5 minutes (pre-commit ideal) |
| Safety case evidence automation | Validated constraints / total safety requirements | >80% for structural requirements |
| Constraint library coverage | Resources with CUE schemas / total resource types | 100% for safety-critical types |
6. Getting Started
6.1 Weeks 1-2: Pilot Selection and Team Enablement
- Select pilot: Single application, Kubernetes-native, existing configuration error pain
- Assemble team: 2-3 engineers with Go experience, safety engineering liaison
- Training: CUE fundamentals (cuelang.org/docs/tutorials/), constraint-based thinking workshop
- Initial schema: Port existing JSON Schema or derive from safety requirements
6.2 Weeks 3-4: Production Constraint Deployment
- CI integration:
cue vetin PR checks, blocking on violation - Developer experience: IDE plugins, pre-commit hooks
- Constraint refinement: Based on initial feedback, false positive elimination
- Documentation: Constraint purpose, safety rationale, example violations
6.3 Month 2 and Beyond: Scaling
Production rollout for a domain with diverse resource types typically takes 6+ months. Expect three phases: schema coverage expansion, organizational training, and safety case integration.
- Expand coverage: Additional resource types, cross-resource constraints
- Organizational rollout: Training additional teams, constraint library governance
- Advanced patterns: Template validation, differential analysis
- Safety case integration: Link schema commits to safety case work products
7. Resources
| Resource | URL | Purpose |
|---|---|---|
| CUE Language | cuelang.org | Core language and tooling |
| CUE Spec (formal semantics) | cuelang.org/docs/references/spec/ | Lattice semantics reference |
| CUE Kubernetes | github.com/cue-labs/cue-api-machinery | K8s-specific patterns |
| Puppet State of DevOps 2023 | puppet.com/resources/state-of-devops-report | Configuration defect statistics |