Top 3 Reasons To Join Us
The Job
- Own and improve SLOs, SLIs, and error budgets for critical services across playback, login, subscription, recommendation, and API layers.
- Build and maintain observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog) to proactively detect and resolve issues.
- Drive incident management, root cause analysis (RCA), and postmortem culture for service outages and performance degradation.
- Automate repetitive operational tasks via IaC (Terraform), CI/CD (GitHub Actions), and scripting (Python/Bash/Golang).
- Collaborate with backend, frontend, and data teams to design fault-tolerant, scalable infrastructure (GKE, Cloud Run, Cloud CDN, etc.).
- Work closely with security and platform teams to ensure system hardening, compliance, and zero-trust principles.
- Continuously assess infrastructure cost and performance trade-offs to optimize cloud spend (GCP preferred).
- Contribute to the evolution of our deployment strategy (blue/green, canary, A/B), especially during high-traffic events (e.g. livestreams, premieres).
Your Skills and Experience
- 5+ years of experience as SRE, DevOps, or Production Engineer in large-scale environments.
- Strong knowledge of Linux internals, networking, and systems performance tuning.
- Deep experience with Kubernetes, containers, and service mesh technologies (Istio or Linkerd).
- Proficiency with cloud platforms (preferably GCP), including IAM, Compute, GKE, Cloud CDN, Cloud Logging.
- Solid experience with monitoring, logging, and alerting stacks (e.g. Prometheus, Grafana, ELK, Loki, Datadog).
- Strong scripting or programming skills in Python, Go, or Bash.
- Familiarity with CI/CD, IaC, and GitOps tools (Terraform, Helm, ArgoCD, Cloud Build).
- Clear communication skills and a calm, analytical approach to solving complex problems in high-pressure environments.
Nice to Have
- Experience supporting real-time media systems or video streaming platforms.
- Knowledge of multi-region HA, failover, and edge optimization strategies (especially for Asia-Pacific markets).
- Familiarity with error budgets, chaos engineering, and resiliency testing.
- Background in supporting platform services for experimentation (A/B), personalization, or user engagement.
Why You'll Love Working Here
- Own the reliability of a platform used by 20M+ users with large-scale live events and high concurrency.
- Work in a modern, cloud-native environment (GCP, Kubernetes, Kafka, Iceberg, Cloud CDN).
- Be part of a highly autonomous engineering culture focused on velocity, quality, and learning.
- Influence architecture and process for the next generation of entertainment infrastructure in Vietnam and beyond.