Manage and optimize on-premises bare metal servers to ensure scalability, reliability, and cost-efficiency - including right-sizing hardware, performance tuning, capacity planning, and lifecycle management.
Install, rack, cable, and commission physical servers (Dell, HPE, Supermicro, or equivalent), including RAID configuration, firmware/BIOS updates, and out-of-band management via IPMI, iDRAC, or iLO.
Design, deploy, and operate virtualization platforms (VMware vSphere/ESXi, Proxmox, or Hyper-V), including clustering, HA, live migration, snapshots, and resource optimization.
Configure and maintain on-premises storage (SAN/NAS, iSCSI/NFS, RAID arrays) and ensure data integrity, performance, and capacity planning.
Install, configure, and optimize NVIDIA GPU servers for AI/ML workloads - including driver installation, CUDA/cuDNN setup, NVIDIA Container Toolkit, GPU passthrough, vGPU, or MIG partitioning where applicable.
Monitor GPU health and utilization using tools such as nvidia-smi, DCGM Exporter, and Prometheus/Grafana.
Administer Linux servers (Ubuntu, Rocky/CentOS, or equivalent).
Configure and maintain core network infrastructure: VLANs, switches (L2/L3), firewalls (Fortinet, Sophos, pfSense, or similar), VPN, and basic routing.
Design and operate backup and disaster recovery solutions (Veeam, Acronis, or equivalent); perform regular restore tests.
Monitor hardware health, manage physical infrastructure in the server room/data center (rack layout, UPS, cooling, cabling), and handle capacity planning.
Troubleshoot complex infrastructure issues end-to-end (hardware, OS, network, storage, virtualization, GPU) and implement root-cause solutions.
Partner with the
DevOps Engineer to provide reliable infrastructure for CI/CD pipelines, Kubernetes clusters, and application workloads - ensuring compute, storage, and network resources meet application needs.
Automate repetitive infrastructure tasks using Bash, Python, PowerShell, or Ansible.
Apply security best practices across infrastructure: patching, hardening, access control, network segmentation, and compliance with internal standards.
Document infrastructure architecture, runbooks, and operational procedures to support team knowledge and continuity.
Participate in on-call rotations for critical infrastructure incidents
Must have:
3-5 years of experience in Infrastructure Engineering, Systems Administration, or a closely related role with heavy focus on on-premises environments.
Hands-on experience with physical server provisioning and management: hardware installation, RAID configuration, firmware updates, iDRAC/iLO/IPMI.
Strong experience with server virtualization - VMware vSphere/ESXi, Proxmox, or Hyper-V (clustering, HA, vMotion/live migration, resource pools).
Solid administration of Linux servers (Ubuntu, Rocky/CentOS).
Practical knowledge of enterprise networking: VLANs, managed switches, firewalls, VPN, basic routing.
Experience with storage systems: SAN/NAS, iSCSI/NFS, RAID, capacity planning.
Experience implementing and maintaining backup and disaster recovery (Veeam, Acronis, or equivalent).
Proficiency in scripting with Bash, Python, or PowerShell for automation.
Experience with monitoring and logging tools (Grafana, Prometheus, Zabbix, ELK, Datadog, or similar).
Understanding of SLA, SLI, and SLO concepts and how they apply to infrastructure reliability.
Experience in on-call rotations and incident response
Strong problem-solving and analytical skills to address complex technical challenges.
Preferred Qualifications (High Priority):
Hands-on experience installing and managing NVIDIA GPU servers - including driver setup, CUDA/cuDNN, NVIDIA Container Toolkit, GPU passthrough, vGPU, or MIG.
Experience monitoring GPU metrics with DCGM Exporter, Prometheus, or similar tools.
Working knowledge of Kubernetes - enough to collaborate effectively with the DevOps Engineer on cluster infrastructure, especially around GPU device plugins, networking, and storage.
Experience with Ansible (or Terraform) for infrastructure automation and configuration management.
Experience supporting AI/ML teams running training or inference workloads on GPUs.
Nice to have (Bonus Points):
IT Support / Helpdesk experience - able to assist end users when needed (hardware issues, AD account management, peripherals).
Industry certifications such as VMware VCP, Microsoft MCSA/MCSE, Red Hat RHCSA/RHCE, Cisco CCNA, or equivalent.
Experience with containerization (Docker) beyond GPU use cases.
Familiarity with Infrastructure as Code (Terraform).
Exposure to Git and collaborating with development/DevOps teams via version control.
Good command of English (reading/writing technical documentation; basic communication).