Mushroom Solutions Inc
make life better
Role Summary
We are seeking a hands-on Platform/DevOps Engineer to build and operate a cloud-native AI model hosting platform. The role requires strong Kubernetes fundamentals, Helm-based deployment management, Python automation, and Terraform-driven infrastructure provisioning across GKE and AKS. The engineer will also implement and troubleshoot OAuth/OIDC authentication (Google OAuth and Microsoft Entra ID) and ensure secure, scalable deployment lifecycles.
This role is execution-heavy and requires deep understanding of how Kubernetes, cloud infrastructure, identity, and application code interact in production systems.
Experience Level
- 1 – 3 years in Platform / DevOps / Infrastructure engineering.
- Proven experience delivering production Kubernetes systems.
- Comfortable working independently and owning end-to-end delivery.
Key Responsibilities
Kubernetes & Platform Engineering
- Design, deploy, and operate Kubernetes workloads using Deployments, Services, Ingress, ConfigMaps, Secrets, and RBAC.
- Implement and manage multi-GPU workloads (T4 / L4 / H100) using node labels, taints/tolerations, node affinity, and GPU resource requests.
- Debug scheduling, networking, and runtime issues using kubectl, cluster logs, and metrics.
- Implement deployment lifecycle operations including create, update, rollback, and delete with clean resource deallocation.
- Enforce namespace isolation, least-privilege RBAC, and secure service-account usage.
Helm (Mandatory)
- Author and maintain Helm charts for application and model deployments.
- Parameterize deployments via values.yaml for:
- GPU type and resource limits
- Image versions
- Environment-specific configuration (dev/stage/prod)
- Implement Helm-based upgrade and rollback strategies.
- Validate rendered manifests and ensure reproducible deployments.
Python (Mandatory)
- Develop Python services and scripts for:
- Kubernetes API interactions (listing nodes, deployments, pods)
- Deployment orchestration and automation
- Validation of cluster resources (GPU availability, scheduling constraints)
- Use Kubernetes Python client to manage CRDs, Deployments, and Services.
- Implement backend logic supporting deployment workflows and user actions.
- Write unit and integration tests for deployment and infrastructure logic.
Infrastructure as Code – Terraform (Mandatory)
- Provision and manage cloud infrastructure using Terraform, including:
- GKE and AKS clusters
- GPU node pools
- Networking (VPC/VNet, subnets, firewall rules)
- IAM / RBAC integrations
- Maintain reusable Terraform modules and environment-specific configurations.
- Ensure infrastructure changes are version-controlled and reproducible.
- Collaborate with Kubernetes/Helm workflows to align infra and app deployments.
Cloud Platforms – GKE / AKS
- Operate Kubernetes clusters on Google Kubernetes Engine (GKE) and Azure Kubernetes Service (AKS).
- Configure GPU-enabled node pools and validate driver/device-plugin readiness.
- Manage cluster access, IAM roles, and service-account permissions.
- Troubleshoot cloud-specific issues related to networking, storage, and node provisioning.
Authentication & Security (OAuth/OIDC)
- Implement OAuth/OIDC authentication flows for:
- Google OAuth
- Microsoft Entra ID (Work/School accounts)
- Configure and troubleshoot app registrations, redirect URIs, CORS, scopes, and claims.
- Validate tokens and securely map user identity (email, subject, tenant).
- Ensure no password storage and compliance with enterprise SSO requirements.
Testing & Reliability
- Develop automated test scripts to validate:
- Helm chart rendering
- GPU scheduling and resource allocation
- Deployment creation and deletion
- OAuth authentication flows
- Document operational runbooks and deployment procedures.
- Support production readiness and incident debugging.
Required Skills (Must-Have)
- Kubernetes: strong hands-on experience with core objects, RBAC, scheduling, and troubleshooting.
- Helm: authoring and maintaining Helm charts (mandatory).
- Python: backend development and Kubernetes automation (mandatory).
- Terraform: infrastructure provisioning and management (mandatory).
- Docker: containerization and image optimization.
- GKE and/or AKS: production experience operating Kubernetes in the cloud.
- OAuth/OIDC: practical implementation experience (Google OAuth, Microsoft Entra ID).
Nice-to-Have
- GPU workload experience (NVIDIA device plugin, gpu-operator).
- GitOps tools (Argo CD, Flux).
- Observability tooling (Prometheus, Grafana).
- CRD design and Kubernetes API extensions.
- CI/CD pipeline integration.