PRISM AI: Architecting a Cost-Effective Cloud-Native AI Platform with Kubernetes and GitOps

Here we’ll explore the DevOps and Kubernetes architecture powering Prism AI, a confidential computing platform that enables secure collaborative AI workloads. We’ll dive deep into the technical decisions and tooling choices that enable reliable scaling while maintaining enterprise-grade security and observability.

A Modern Cloud-Native Stack

Prism AI represents a modern cloud-native application built with a microservices architecture that embraces DevOps best practices from the ground up.

Microservices

The platform is built around the following microservices, each with specific responsibilities:

Authentication Service: Identity and access management
Users Service: User lifecycle and profile management
Domains Service: Domain management with Redis caching
Computations Service: AI workload orchestration and management
Backends Service: Unified infrastructure provider abstraction. It orchestrates confidential VMs across Azure (TDX), GCP (SEV-SNP), AWS, and Ultraviolet Cloud.
Attestation Service (am-certs): Manages secure enclave attestation and certificate issuance using OpenBao as the PKI backend.
Certificates Service: General PKI management for secure communications
Billing Services: Financial operations and subscription management
UI Service: Frontend application.

Kubernetes Orchestration: The Foundation

Helm Charts as Infrastructure Code

Prism AI’s entire Kubernetes infrastructure is defined through Helm charts, Infrastructure as Code. The following dependencies are defined in the charts:

External Secrets for secret management
PgBouncer for efficient database connection pooling
Cloudflared for secure edge tunneling
Argo Rollouts for progressive delivery
Kube Prometheus Stack for monitoring
Jaeger for distributed tracing
OpenBao for dynamic PKI and secrets
SpiceDB for fine-grained authorization
PostgreSQL for persistence
Redis for caching
NATS for event streaming
OpenSearch for log analytics
Fluent Bit for log collection

This dependency management approach ensures consistent deployments across environments while maintaining version control and rollback capabilities.

Container Strategy and Registry Management

Prism AI utilizes the GitHub Container Registry (GHCR) for managing microservices images. These images are built and tagged from the source code repository using GitHub Actions.

Development Flow:

Latest builds tagged as latest: deployed to staging
Release builds tagged with semantic versions: deployed to production

Security Implementation: All services use private registry authentication through Kubernetes secrets. This ensures secure image distribution while maintaining access control.

imagePullSecrets:
 - name: your-container-registry-secret

Resource Management and Scaling

For CPU and memory utilization and management, Horizontal Pod Autoscaler is used. Uniquely, Prism AI targets Argo Rollouts instead of standard Deployments, ensuring that autoscaling rules persist correctly across canary releases. Vertical Pod Autoscalers (VPA) are also employed for specific backend services to automatically adjust resource requests based on historical usage.

GitOps & Progressive Delivery

Prism AI goes beyond standard GitOps by implementing Progressive Delivery using Argo Rollouts.

Prism AI GitOps Workflow

Production Strategy (Canary): Releases are rolled out in stepped phases (e.g., 20% -> pause -> 40% -> …). This allows the team to validate metrics before exposing the new version to 100% of traffic.
Staging Strategy: Uses a simplified rolling update strategy to ensure rapid iteration loops. Argo CD synchronizes these definitions from Git, ensuring that the cluster always matches the declarative state.

Automated Image Updates

Argo CD Image Updater is used to automatically monitor container registries and update the deployments when new images are available:

annotations:
  argocd-image-updater.argoproj.io/image-list: auth=ghcr.io/example/auth
  argocd-image-updater.argoproj.io/auth.update-strategy: digest
  argocd-image-updater.argoproj.io/auth.force-update: "true"
  argocd-image-updater.argoproj.io/auth.ignore-tags: latest, mastery

This automation reduces manual intervention while maintaining control over the deployments.

Networking and Ingress Architecture

Secure Edge Connectivity with Cloudflare Tunnels

Before traffic reaches the cluster, Prism AI employs Cloudflare Tunnels (cloudflared) to establish a secure, outbound-only connection to the Cloudflare edge. This architecture eliminates the need to expose public IP addresses or configure complex firewall rules.

Traefik

Traefik serves as both the ingress controller and load balancer, providing structured routing and SSL termination.

Entry Points Configuration:

Port 80: HTTP traffic (redirects to HTTPS)
Port 443: HTTPS traffic with automatic SSL (Let’s Encrypt)
Port 8080: Traefik dashboard on development
Port 7018: gRPC backends for confidential agent communication

Unified Permissioning with SpiceDB

Prism AI employs SpiceDB (based on Google Zanzibar) for its authorization layer. Instead of simple roles, it defines a schema allowing fine-grained Relationship-Based Access Control (ReBAC). This enables complex permission checks (e.g., “Can User A run computation B on Domain C?”) to be evaluated with millisecond latency.

Database Architecture and Management

Prism AI uses a database-per-service pattern with PostgreSQL. This approach ensures services are isolated and have autonomy.

auth, users, domains: Core entity application data
spicedb: Stores the authorization graph
computations: AI workload metadata
backends/am-certs: Infrastructure and Attestation data

Connection Efficiency with PgBouncer

To handle high-concurrency workloads without exhausting database connections, Prism AI integrates PgBouncer as a lightweight connection pooler.

Observability: Comprehensive Monitoring and Logging

Components:

Prometheus: Metrics collection and alerting
Grafana: Visualization and dashboards
AlertManager: Notification routing and management
Node Exporter: Infrastructure metrics
Kube State Metrics: Kubernetes cluster metrics
Cocos Manager Monitoring: Specific metrics for Confidential Computing nodes (TDX/SNP status).

Custom Dashboards:

There are pre-configured dashboards for each service, which provide deeper and custom insights in addition to the default dashboards that are provided out of the box by Grafana. These dashboards are defined in JSON files and imported into Grafana on service creation via ConfigMaps:

{{- range $path, $content := .Files.Glob "files/dashboards/*.json" }}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ $.Release.Name }}-dashboard-{{ base $path | trimSuffix ".json" }}
  namespace: {{ $.Release.Namespace }}
  labels:
    grafana_dashboard: "1"
data:
  {{ base $path }}: |-
{{ ($content| toString) | indent 4 }}
{{- end }}

Alerting Strategy: We have defined custom PrometheusRule sets to catch issues proactively:

Application: High Error Rate (>5%), Latency Spikes (P95 > 2s).
Cocos Infrastructure: Specific alerts for CocosManagerDown or AttestationFailed, ensuring the confidential computing substrate is always healthy.

Centralized Logging with OpenSearch

Prism makes use of Fluent Bit for log collection and forwarding, OpenSearch for log storage and indexing, and OpenSearch Dashboards for visualization.

fluent-bit:
  inputs: |
    [INPUT]
        Name: tail
        Path: /var/log/containers/*.log
        Parser: cri
  outputs: |
    [OUTPUT]
        Name: opensearch
        Host: opensearch-cluster-master
        Index: prism-logs

Security

Secret Management & Advanced PKI

External Secrets Operator: Syncs static secrets (DB passwords, API keys) from Google Secret Manager.
OpenBao (PKI): For the high-security confidential environments, we use OpenBao as a dynamic secret backend. The am-certs service interacts with OpenBao to issue short-lived X.509 certificates to computing nodes, enabling mutual TLS (mTLS) and secure attestation without long-lived credentials.

Network Security

Micro-Segmentation with Network Policies: Kubernetes network policies restrict inter-service communication, implementing zero-trust networking principles.

Ingress Allow-listing: Services only accept traffic from specific upstream callers.
Egress lockdown: Outbound traffic is restricted to essential dependencies like the database or specific external APIs.

Container Security

For container security, the following measures have been applied:

Private container registries with pull secrets
Non-root container execution
Resource limits to prevent resource exhaustion attacks
Regular security scanning through CI/CD pipelines

Data Persistence and Backup Strategy

Backup with Velero

In order to ensure resilience and availability of user data whenever an incident occurs, Prism AI uses Velero for disaster recovery. Backups are done regularly. You can have a look at https://velero.io/ for more information on configuring backups.

Kubernetes object backup to DigitalOcean Spaces
Persistent volume snapshots
Scheduled backups with configurable retention

CI/CD Pipeline Architecture

The CI/CD pipeline supports multiple environments with different promotion strategies:

Branching Strategy

main branch: Staging environment (latest image tags)
production branch: Production environment (semantic versioning)

Automated Testing

Unit tests in source repositories and run on GitHub Actions.
UI tests with Playwright and Load Test with Artillery before production promotion.
Manual QA tests before promotion to production.

Container Build and Distribution

The container build and distribution process is automated through GitHub Actions and takes place in the following sequence:

Code commit triggers GitHub Actions
Docker images built and tested
Helm repository is updated with new image references
Argo CD detects changes and syncs to cluster

Deployment Environments and Configuration Management

Environment-Specific Configurations

The platform supports multiple deployment environments through Helm values files:

staging.yaml: This contains development and testing configurations
production.yaml: This contains production-ready configuration with enhanced security

Major key differences between the files are:

Resource allocations scaled for production workloads
Enhanced security configurations
External secret management enabled for production

Some best practices takeaway from the environment configurations:

Sensitive data is not stored in Git
Environment variables are configured through environment-specific values
Helm template validation and linting in CI/CD

Conclusion

There are a lot of lessons that can be drawn from the Prism AI architecture, CI/CD, and DevOps practices, especially when it comes to cost optimisation without compromising on excellence and security. The platform makes good use of open-source production-ready tools. They also utilize DigitalOcean primarily, which is a lot cheaper than other Cloud Providers. Let’s look at some of the considerations below:

Scalability Considerations

Service Decomposition: Clear service boundaries enable independent scaling
Database Strategy: Database-per-service prevents bottlenecks
Async Communication: Event-driven architecture improves resilience

Operational Excellence

Observability First: Comprehensive monitoring from day one
GitOps Adoption: Declarative infrastructure management
Automated Testing: Continuous validation of deployments
Disaster Recovery: Regular backup testing and restoration procedures

Security Implementation

Zero Trust Networking: Network policies restrict communication
Least Privilege Access: RBAC controls limit access scope
Secret Rotation: Automated secret management and rotation
Container Security: Private registries and security scanning

It is also worth mentioning a few areas of improvement:

Service Mesh Integration: Istio or Linkerd for advanced traffic management
Chaos Engineering: Automated failure testing with Chaos Monkey
Cost Optimization: Advanced resource scheduling and spot instance usage
Multi-Cloud Deployment: Cross-cloud redundancy and disaster recovery
Kubernetes Operators: Custom operators for application lifecycle management
Advanced Scheduling: Topology-aware scheduling and resource optimization

Learn More:

Happy Orchestrating 🙂!

Products

Documentation

Projects

prism ai PRISM AI: Architecting a Cost-Effective Cloud-Native AI Platform with Kubernetes and GitOps