prism ai PRISM AI: Architecting a Cost-Effective Cloud-Native AI Platform with Kubernetes and GitOps
Jilks Smith
· 9 min read
Here we’ll explore the DevOps and Kubernetes architecture powering Prism AI, a confidential computing platform that enables secure collaborative AI workloads. We’ll dive deep into the technical decisions and tooling choices that enable reliable scaling while maintaining enterprise-grade security and observability.
A Modern Cloud-Native Stack
Prism AI represents a modern cloud-native application built with a microservices architecture that embraces DevOps best practices from the ground up.
Microservices
The platform is built around the following microservices, each with specific responsibilities:
- Authentication Service: Identity and access management
- Users Service: User lifecycle and profile management
- Domains Service: Domain management with Redis caching
- Computations Service: AI workload orchestration and management
- Backends Service: Unified infrastructure provider abstraction. It orchestrates confidential VMs across Azure (TDX), GCP (SEV-SNP), AWS, and Ultraviolet Cloud.
- Attestation Service (am-certs): Manages secure enclave attestation and certificate issuance using OpenBao as the PKI backend.
- Certificates Service: General PKI management for secure communications
- Billing Services: Financial operations and subscription management
- UI Service: Frontend application.
Kubernetes Orchestration: The Foundation
Helm Charts as Infrastructure Code
Prism AI’s entire Kubernetes infrastructure is defined through Helm charts, Infrastructure as Code. The following dependencies are defined in the charts:
- External Secrets for secret management
- PgBouncer for efficient database connection pooling
- Cloudflared for secure edge tunneling
- Argo Rollouts for progressive delivery
- Kube Prometheus Stack for monitoring
- Jaeger for distributed tracing
- OpenBao for dynamic PKI and secrets
- SpiceDB for fine-grained authorization
- PostgreSQL for persistence
- Redis for caching
- NATS for event streaming
- OpenSearch for log analytics
- Fluent Bit for log collection
This dependency management approach ensures consistent deployments across environments while maintaining version control and rollback capabilities.
Container Strategy and Registry Management
Prism AI utilizes the GitHub Container Registry (GHCR) for managing microservices images. These images are built and tagged from the source code repository using GitHub Actions.
Development Flow:
- Latest builds tagged as latest: deployed to staging
- Release builds tagged with semantic versions: deployed to production
Security Implementation: All services use private registry authentication through Kubernetes secrets. This ensures secure image distribution while maintaining access control.
imagePullSecrets:
- name: your-container-registry-secret
Resource Management and Scaling
For CPU and memory utilization and management, Horizontal Pod Autoscaler is used. Uniquely, Prism AI targets Argo Rollouts instead of standard Deployments, ensuring that autoscaling rules persist correctly across canary releases. Vertical Pod Autoscalers (VPA) are also employed for specific backend services to automatically adjust resource requests based on historical usage.
GitOps & Progressive Delivery
Prism AI goes beyond standard GitOps by implementing Progressive Delivery using Argo Rollouts.

- Production Strategy (Canary): Releases are rolled out in stepped phases (e.g., 20% -> pause -> 40% -> …). This allows the team to validate metrics before exposing the new version to 100% of traffic.
- Staging Strategy: Uses a simplified rolling update strategy to ensure rapid iteration loops. Argo CD synchronizes these definitions from Git, ensuring that the cluster always matches the declarative state.
Automated Image Updates
Argo CD Image Updater is used to automatically monitor container registries and update the deployments when new images are available:
annotations:
argocd-image-updater.argoproj.io/image-list: auth=ghcr.io/example/auth
argocd-image-updater.argoproj.io/auth.update-strategy: digest
argocd-image-updater.argoproj.io/auth.force-update: "true"
argocd-image-updater.argoproj.io/auth.ignore-tags: latest, mastery
This automation reduces manual intervention while maintaining control over the deployments.
Networking and Ingress Architecture
Secure Edge Connectivity with Cloudflare Tunnels
Before traffic reaches the cluster, Prism AI employs Cloudflare Tunnels (cloudflared) to establish a secure, outbound-only connection to the Cloudflare edge. This architecture eliminates the need to expose public IP addresses or configure complex firewall rules.
Traefik
Traefik serves as both the ingress controller and load balancer, providing structured routing and SSL termination.
Entry Points Configuration:
- Port 80: HTTP traffic (redirects to HTTPS)
- Port 443: HTTPS traffic with automatic SSL (Let’s Encrypt)
- Port 8080: Traefik dashboard on development
- Port 7018: gRPC backends for confidential agent communication
Unified Permissioning with SpiceDB
Prism AI employs SpiceDB (based on Google Zanzibar) for its authorization layer. Instead of simple roles, it defines a schema allowing fine-grained Relationship-Based Access Control (ReBAC). This enables complex permission checks (e.g., “Can User A run computation B on Domain C?”) to be evaluated with millisecond latency.
Database Architecture and Management
Prism AI uses a database-per-service pattern with PostgreSQL. This approach ensures services are isolated and have autonomy.
auth,users,domains: Core entity application dataspicedb: Stores the authorization graphcomputations: AI workload metadatabackends/am-certs: Infrastructure and Attestation data
Connection Efficiency with PgBouncer
To handle high-concurrency workloads without exhausting database connections, Prism AI integrates PgBouncer as a lightweight connection pooler.
Observability: Comprehensive Monitoring and Logging
Components:
- Prometheus: Metrics collection and alerting
- Grafana: Visualization and dashboards
- AlertManager: Notification routing and management
- Node Exporter: Infrastructure metrics
- Kube State Metrics: Kubernetes cluster metrics
- Cocos Manager Monitoring: Specific metrics for Confidential Computing nodes (TDX/SNP status).
Custom Dashboards:
There are pre-configured dashboards for each service, which provide deeper and custom insights in addition to the default dashboards that are provided out of the box by Grafana. These dashboards are defined in JSON files and imported into Grafana on service creation via ConfigMaps:
{{- range $path, $content := .Files.Glob "files/dashboards/*.json" }}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ $.Release.Name }}-dashboard-{{ base $path | trimSuffix ".json" }}
namespace: {{ $.Release.Namespace }}
labels:
grafana_dashboard: "1"
data:
{{ base $path }}: |-
{{ ($content| toString) | indent 4 }}
{{- end }}
Alerting Strategy: We have defined custom PrometheusRule sets to catch issues proactively:
- Application: High Error Rate (>5%), Latency Spikes (P95 > 2s).
- Cocos Infrastructure: Specific alerts for
CocosManagerDownorAttestationFailed, ensuring the confidential computing substrate is always healthy.
Centralized Logging with OpenSearch
Prism makes use of Fluent Bit for log collection and forwarding, OpenSearch for log storage and indexing, and OpenSearch Dashboards for visualization.
fluent-bit:
inputs: |
[INPUT]
Name: tail
Path: /var/log/containers/*.log
Parser: cri
outputs: |
[OUTPUT]
Name: opensearch
Host: opensearch-cluster-master
Index: prism-logs
Security
Secret Management & Advanced PKI
- External Secrets Operator: Syncs static secrets (DB passwords, API keys) from Google Secret Manager.
- OpenBao (PKI): For the high-security confidential environments, we use OpenBao as a dynamic secret backend. The
am-certsservice interacts with OpenBao to issue short-lived X.509 certificates to computing nodes, enabling mutual TLS (mTLS) and secure attestation without long-lived credentials.
Network Security
Micro-Segmentation with Network Policies: Kubernetes network policies restrict inter-service communication, implementing zero-trust networking principles.
- Ingress Allow-listing: Services only accept traffic from specific upstream callers.
- Egress lockdown: Outbound traffic is restricted to essential dependencies like the database or specific external APIs.
Container Security
For container security, the following measures have been applied:
- Private container registries with pull secrets
- Non-root container execution
- Resource limits to prevent resource exhaustion attacks
- Regular security scanning through CI/CD pipelines
Data Persistence and Backup Strategy
Backup with Velero
In order to ensure resilience and availability of user data whenever an incident occurs, Prism AI uses Velero for disaster recovery. Backups are done regularly. You can have a look at https://velero.io/ for more information on configuring backups.
- Kubernetes object backup to DigitalOcean Spaces
- Persistent volume snapshots
- Scheduled backups with configurable retention
CI/CD Pipeline Architecture
The CI/CD pipeline supports multiple environments with different promotion strategies:
Branching Strategy
mainbranch: Staging environment (latest image tags)productionbranch: Production environment (semantic versioning)
Automated Testing
- Unit tests in source repositories and run on GitHub Actions.
- UI tests with Playwright and Load Test with Artillery before production promotion.
- Manual QA tests before promotion to production.
Container Build and Distribution
The container build and distribution process is automated through GitHub Actions and takes place in the following sequence:
- Code commit triggers GitHub Actions
- Docker images built and tested
- Helm repository is updated with new image references
- Argo CD detects changes and syncs to cluster
Deployment Environments and Configuration Management
Environment-Specific Configurations
The platform supports multiple deployment environments through Helm values files:
staging.yaml: This contains development and testing configurationsproduction.yaml: This contains production-ready configuration with enhanced security
Major key differences between the files are:
- Resource allocations scaled for production workloads
- Enhanced security configurations
- External secret management enabled for production
Some best practices takeaway from the environment configurations:
- Sensitive data is not stored in Git
- Environment variables are configured through environment-specific values
- Helm template validation and linting in CI/CD
Conclusion
There are a lot of lessons that can be drawn from the Prism AI architecture, CI/CD, and DevOps practices, especially when it comes to cost optimisation without compromising on excellence and security. The platform makes good use of open-source production-ready tools. They also utilize DigitalOcean primarily, which is a lot cheaper than other Cloud Providers. Let’s look at some of the considerations below:
Scalability Considerations
- Service Decomposition: Clear service boundaries enable independent scaling
- Database Strategy: Database-per-service prevents bottlenecks
- Async Communication: Event-driven architecture improves resilience
Operational Excellence
- Observability First: Comprehensive monitoring from day one
- GitOps Adoption: Declarative infrastructure management
- Automated Testing: Continuous validation of deployments
- Disaster Recovery: Regular backup testing and restoration procedures
Security Implementation
- Zero Trust Networking: Network policies restrict communication
- Least Privilege Access: RBAC controls limit access scope
- Secret Rotation: Automated secret management and rotation
- Container Security: Private registries and security scanning
It is also worth mentioning a few areas of improvement:
- Service Mesh Integration: Istio or Linkerd for advanced traffic management
- Chaos Engineering: Automated failure testing with Chaos Monkey
- Cost Optimization: Advanced resource scheduling and spot instance usage
- Multi-Cloud Deployment: Cross-cloud redundancy and disaster recovery
- Kubernetes Operators: Custom operators for application lifecycle management
- Advanced Scheduling: Topology-aware scheduling and resource optimization
Learn More:
Happy Orchestrating 🙂!