1. Site Reliability Engineering & Infrastructure Management (35%)
• Define SRE roadmap: Build and implement SRE roadmap to create
cross-functional systems that meet the company's scalable
requirements
• Multi-cloud deployment: Deploy and manage services across On-
premise, Azure, and AWS environments
• Infrastructure as Code: Implement deployment and configuration
automation using tools like Terraform, Ansible, or CloudFormation
• System monitoring: Set up and maintain monitoring systems
(Prometheus, Grafana) to identify potential issues
• Performance optimization: Optimize performance of services
including Elasticsearch, Logstash, and overall system performance
• Ensure system and service continuity, implement disaster recovery
strategies
2. DevOps & CI/CD Pipeline (25%)
• Application modernization: Standardize and automate application
development & deployment pipelines
• CI/CD implementation: Design and maintain CI/CD pipelines using
GitLab CI, Jenkins, or Azure DevOps
• Configuration management: Manage and control deployment flows,
standardize configurations for tracking
• Container orchestration: Deploy and manage containerized
applications using Docker and Kubernetes
• GitOps practices: Implement GitOps workflows with ArgoCD or
FluxCD
• Secret management: Implement secure secret management
solutions
3. MLOps & Machine Learning Infrastructure (25%)
• ML pipeline automation: Design and implement automated ML
pipelines for SmartCity AI models
• Model deployment & serving: Setup model training, versioning, and
serving infrastructure
• Experiment tracking: Implement experiment tracking and model
registry systems
• Data pipeline management: Manage data pipelines and data lakes
for ML workloads
• Feature stores: Set up and maintain feature stores for ML model
consistency
• ML monitoring: Monitor ML model performance, drift detection, and
retraining automation
• A/B testing infrastructure: Design infrastructure for A/B testing ML
models in production
4. SmartCity Specific Infrastructure (15%)
• IoT infrastructure: Manage infrastructure for IoT device integration
and data streaming
• Edge computing: Deploy and manage edge computing solutions for
real-time processing
• Data streaming: Implement real-time data streaming with Kafka or
similar tools for SmartCity applications
• Security compliance: Ensure compliance with security standards for
government/municipal systems
• High availability: Design fault-tolerant systems for critical SmartCity
services
Required Qualifications:
• Bachelor's degree in Computer Science,
Software Engineering, Information Technology, or equivalent
• 2+ years of experience in DevOps/SRE with Linux experience and Site Reliability Engineering responsibilities
• Good understanding of Docker and Kubernetes container orchestration
• Experience with monitoring systems: Grafana, Prometheus
• Hands-on experience with CI/CD tools: GitLab CI, Jenkins, or Azure DevOps
• Experience with centralized logging solutions: ELK stack or similar
• Cloud experience: AWS or Azure - deployment, management, optimization
• Programming skills: Bash, Python for automation and scripting