HPC Systems Administrator

Company

Applied Digital
Company Website

Where

Dallas, open to remote (Remote)

Info

Full Time

Closes: 1 April 2023

Applications have closed

HPC Systems Administrator

Job Summary

The HPC Systems Administrator has worked in state-of-the-art GPU-centric HPC data centers and has proven ownership and accountability over production-level environments. The candidate must be self-sufficient, be comfortable in a fast-paced greenfield environment, and have a growth mindset.

Education and Experience

· 5-7 years of experience, with at least 2 years of hands-on responsibilities in a GPU HPC environment

· Strong understanding of AI and ML workloads, including deep learning frameworks (TensorFlow, PyTorch, etc.).

· Strong knowledge of cluster management, scheduling, and reporting in enterprise and HPC data center operations such as Kubernetes, Singularity, Slurm, SlurmDB, and Grafana

· Demonstrated mastery in at least one major Linux server distribution with comfort and proficiency in a variety of Linux server distributions

· Strong understanding of security trade-offs with various architectures and virtualization techniques

· Comfort and track record of scripting, systems and process automation using tools such as bash, python, and ansible

· Proven experience learning complex new systems and technologies

· Full-stack technical depth of HPC Linux clustered environments

· Demonstrated ability to contribute and constantly improve cutting HPC systems environments

· Knowledge and understanding of advanced HPC systems architectures, technologies, packages, and workloads, in line with industry standards

· Excellent verbal and written communication skills, with the ability to communicate complex concepts to non-technical internal and external stakeholders

· Proven track record of effectively prioritizing heavy workloads in a fast-moving environment

· Positive and constructive attitude with strong attention to detail, ability to work productively in others

· Must be comfortable in a rapidly growing startup environment but in an enterprise level-production data center environment

· Bachelor’s or Masters degree in Computer Science, Information Technology, or a related field is required

· Strong engineering background with good judgment, rationale, and technical aptitude

Primary Job Duties

· Accountable for the build-out, documentation, and administration of a new HPC data center environment, including monitoring, cluster management, and server systems

· Evaluate and implement cluster management and scheduling systems for ML, AI, and Deep Learning workloads

· Responsible for building, deploying, and managing HPC system images

· Ensure high availability and uptime to meet customer service level agreements and company excellence standards

· Push the status quo with a relentless focus on constant iteration and improvement, including but not limited to automation, monitoring, alerting, updates, and technical refreshes

· Displays a hands-on approach, working cross-functionally with stakeholders and peers to accomplish individual, team, and company goals.

· Ensure compliance with industry regulations and standards such as SOC 1 & 2, ISO 27001/2, etc.

· Stay current with the latest trends and technologies, and ensure that the company’s infrastructure is competitive with comparable state-of-the-art systems

· Develop key metrics and provide regular reports to senior management on the status of systems, deployments, and customer workloads

HPC Systems Administrator

Title

First Name

Last Name

Email Address

What gender do you identify as?:

Country of residence

Nationality

What is the name of the organisation you currently work for?