Company
Applied Digital
Company Website
Info
Full Time
Closes: 1 April 2023
Applications have closed
HPC Systems Administrator

Job Summary

The HPC Systems Administrator has worked in state-of-the-art GPU-centric HPC data centers and has proven ownership and accountability over production-level environments. The candidate must be self-sufficient, be comfortable in a fast-paced greenfield environment, and have a growth mindset.

Education and Experience

·       5-7 years of experience, with at least 2 years of hands-on responsibilities in a GPU HPC environment

·       Strong understanding of AI and ML workloads, including deep learning frameworks (TensorFlow, PyTorch, etc.).

·       Strong knowledge of cluster management, scheduling, and reporting in enterprise and HPC data center operations such as Kubernetes, Singularity, Slurm, SlurmDB, and Grafana

·       Demonstrated mastery in at least one major Linux server distribution with comfort and proficiency in a variety of Linux server distributions

·       Strong understanding of security trade-offs with various architectures and virtualization techniques

·       Comfort and track record of scripting, systems and process automation using tools such as bash, python, and ansible

·       Proven experience learning complex new systems and technologies

·       Full-stack technical depth of HPC Linux clustered environments

·       Demonstrated ability to contribute and constantly improve cutting HPC systems environments

·       Knowledge and understanding of advanced HPC systems architectures, technologies, packages, and workloads, in line with industry standards

·       Excellent verbal and written communication skills, with the ability to communicate complex concepts to non-technical internal and external stakeholders

·       Proven track record of effectively prioritizing heavy workloads in a fast-moving environment

·       Positive and constructive attitude with strong attention to detail, ability to work productively in others

·       Must be comfortable in a rapidly growing startup environment but in an enterprise level-production data center environment

·       Bachelor’s or Masters degree in Computer Science, Information Technology, or a related field is required

·       Strong engineering background with good judgment, rationale, and technical aptitude

Primary Job Duties

·       Accountable for the build-out, documentation, and administration of a new HPC data center environment, including monitoring, cluster management, and server systems

·       Evaluate and implement cluster management and scheduling systems for ML, AI, and Deep Learning workloads

·       Responsible for building, deploying, and managing HPC system images

·       Ensure high availability and uptime to meet customer service level agreements and company excellence standards

·       Push the status quo with a relentless focus on constant iteration and improvement, including but not limited to automation, monitoring, alerting, updates, and technical refreshes

·       Displays a hands-on approach, working cross-functionally with stakeholders and peers to accomplish individual, team, and company goals.

·       Ensure compliance with industry regulations and standards such as SOC 1 & 2, ISO 27001/2, etc.

·       Stay current with the latest trends and technologies, and ensure that the company’s infrastructure is competitive with comparable state-of-the-art systems

·       Develop key metrics and provide regular reports to senior management on the status of systems, deployments, and customer workloads