Senior Platform / DevOps Engineer (Kubernetes, Helm, Python, Terraform)

Posted 6 days ago

Role Summary

We are seeking a hands-on Platform/DevOps Engineer to build and operate a cloud-native AI model hosting platform. The role requires strong Kubernetes fundamentals, Helm-based deployment management, Python automation, and Terraform-driven infrastructure provisioning across GKE and AKS. The engineer will also implement and troubleshoot OAuth/OIDC authentication (Google OAuth and Microsoft Entra ID) and ensure secure, scalable deployment lifecycles.

This role is execution-heavy and requires deep understanding of how Kubernetes, cloud infrastructure, identity, and application code interact in production systems.

Experience Level

  1. 1 – 3 years in Platform / DevOps / Infrastructure engineering.
  2. Proven experience delivering production Kubernetes systems.
  3. Comfortable working independently and owning end-to-end delivery.

Key Responsibilities

Kubernetes & Platform Engineering

  • Design, deploy, and operate Kubernetes workloads using Deployments, Services, Ingress, ConfigMaps, Secrets, and RBAC.
  • Implement and manage multi-GPU workloads (T4 / L4 / H100) using node labels, taints/tolerations, node affinity, and GPU resource requests.
  • Debug scheduling, networking, and runtime issues using kubectl, cluster logs, and metrics.
  • Implement deployment lifecycle operations including create, update, rollback, and delete with clean resource deallocation.
  • Enforce namespace isolation, least-privilege RBAC, and secure service-account usage.

Helm (Mandatory)

  • Author and maintain Helm charts for application and model deployments.
  • Parameterize deployments via values.yaml for:
    • GPU type and resource limits
    • Image versions
    • Environment-specific configuration (dev/stage/prod)
  • Implement Helm-based upgrade and rollback strategies.
  • Validate rendered manifests and ensure reproducible deployments.

Python (Mandatory)

  • Develop Python services and scripts for:
    • Kubernetes API interactions (listing nodes, deployments, pods)
    • Deployment orchestration and automation
    • Validation of cluster resources (GPU availability, scheduling constraints)
  • Use Kubernetes Python client to manage CRDs, Deployments, and Services.
  • Implement backend logic supporting deployment workflows and user actions.
  • Write unit and integration tests for deployment and infrastructure logic.

Infrastructure as Code – Terraform (Mandatory)

  • Provision and manage cloud infrastructure using Terraform, including:
    • GKE and AKS clusters
    • GPU node pools
    • Networking (VPC/VNet, subnets, firewall rules)
    • IAM / RBAC integrations
  • Maintain reusable Terraform modules and environment-specific configurations.
  • Ensure infrastructure changes are version-controlled and reproducible.
  • Collaborate with Kubernetes/Helm workflows to align infra and app deployments.

Cloud Platforms – GKE / AKS

  • Operate Kubernetes clusters on Google Kubernetes Engine (GKE) and Azure Kubernetes Service (AKS).
  • Configure GPU-enabled node pools and validate driver/device-plugin readiness.
  • Manage cluster access, IAM roles, and service-account permissions.
  • Troubleshoot cloud-specific issues related to networking, storage, and node provisioning.

Authentication & Security (OAuth/OIDC)

  • Implement OAuth/OIDC authentication flows for:
    • Google OAuth
    • Microsoft Entra ID (Work/School accounts)
  • Configure and troubleshoot app registrations, redirect URIs, CORS, scopes, and claims.
  • Validate tokens and securely map user identity (email, subject, tenant).
  • Ensure no password storage and compliance with enterprise SSO requirements.

Testing & Reliability

  • Develop automated test scripts to validate:
    • Helm chart rendering
    • GPU scheduling and resource allocation
    • Deployment creation and deletion
    • OAuth authentication flows
  • Document operational runbooks and deployment procedures.
  • Support production readiness and incident debugging.

Required Skills (Must-Have)

  • Kubernetes: strong hands-on experience with core objects, RBAC, scheduling, and troubleshooting.
  • Helm: authoring and maintaining Helm charts (mandatory).
  • Python: backend development and Kubernetes automation (mandatory).
  • Terraform: infrastructure provisioning and management (mandatory).
  • Docker: containerization and image optimization.
  • GKE and/or AKS: production experience operating Kubernetes in the cloud.
  • OAuth/OIDC: practical implementation experience (Google OAuth, Microsoft Entra ID).

Nice-to-Have

  • GPU workload experience (NVIDIA device plugin, gpu-operator).
  • GitOps tools (Argo CD, Flux).
  • Observability tooling (Prometheus, Grafana).
  • CRD design and Kubernetes API extensions.
  • CI/CD pipeline integration.

Apply For This Job

A valid email address is required.
A valid phone number is required.
loader

Apply for this role