Job Summary
The HPC Systems Administrator has worked in state-of-the-art GPU-centric HPC data centers and has proven ownership and accountability over production-level environments. The candidate must be self-sufficient, be comfortable in a fast-paced greenfield environment, and have a growth mindset.
Education and Experience
· 5-7 years of experience, with at least 2 years of hands-on responsibilities in a GPU HPC environment
· Strong understanding of AI and ML workloads, including deep learning frameworks (TensorFlow, PyTorch, etc.).
· Strong knowledge of cluster management, scheduling, and reporting in enterprise and HPC data center operations such as Kubernetes, Singularity, Slurm, SlurmDB, and Grafana
· Demonstrated mastery in at least one major Linux server distribution with comfort and proficiency in a variety of Linux server distributions
· Strong understanding of security trade-offs with various architectures and virtualization techniques
· Comfort and track record of scripting, systems and process automation using tools such as bash, python, and ansible
· Proven experience learning complex new systems and technologies
· Full-stack technical depth of HPC Linux clustered environments
· Demonstrated ability to contribute and constantly improve cutting HPC systems environments
· Knowledge and understanding of advanced HPC systems architectures, technologies, packages, and workloads, in line with industry standards
· Excellent verbal and written communication skills, with the ability to communicate complex concepts to non-technical internal and external stakeholders
· Proven track record of effectively prioritizing heavy workloads in a fast-moving environment
· Positive and constructive attitude with strong attention to detail, ability to work productively in others
· Must be comfortable in a rapidly growing startup environment but in an enterprise level-production data center environment
· Bachelor’s or Masters degree in Computer Science, Information Technology, or a related field is required
· Strong engineering background with good judgment, rationale, and technical aptitude
Primary Job Duties
· Accountable for the build-out, documentation, and administration of a new HPC data center environment, including monitoring, cluster management, and server systems
· Evaluate and implement cluster management and scheduling systems for ML, AI, and Deep Learning workloads
· Responsible for building, deploying, and managing HPC system images
· Ensure high availability and uptime to meet customer service level agreements and company excellence standards
· Push the status quo with a relentless focus on constant iteration and improvement, including but not limited to automation, monitoring, alerting, updates, and technical refreshes
· Displays a hands-on approach, working cross-functionally with stakeholders and peers to accomplish individual, team, and company goals.
· Ensure compliance with industry regulations and standards such as SOC 1 & 2, ISO 27001/2, etc.
· Stay current with the latest trends and technologies, and ensure that the company’s infrastructure is competitive with comparable state-of-the-art systems
· Develop key metrics and provide regular reports to senior management on the status of systems, deployments, and customer workloads