Company
Applied Digital
Company Website
Info
Full Time
Closes: 1 April 2023
  • April 1, 2023
  • HPC Systems Administrator

    Job Summary

    The HPC Systems Administrator has worked in state-of-the-art GPU-centric HPC data centers and has proven ownership and accountability over production-level environments. The candidate must be self-sufficient, be comfortable in a fast-paced greenfield environment, and have a growth mindset.

    Education and Experience

    ·       5-7 years of experience, with at least 2 years of hands-on responsibilities in a GPU HPC environment

    ·       Strong understanding of AI and ML workloads, including deep learning frameworks (TensorFlow, PyTorch, etc.).

    ·       Strong knowledge of cluster management, scheduling, and reporting in enterprise and HPC data center operations such as Kubernetes, Singularity, Slurm, SlurmDB, and Grafana

    ·       Demonstrated mastery in at least one major Linux server distribution with comfort and proficiency in a variety of Linux server distributions

    ·       Strong understanding of security trade-offs with various architectures and virtualization techniques

    ·       Comfort and track record of scripting, systems and process automation using tools such as bash, python, and ansible

    ·       Proven experience learning complex new systems and technologies

    ·       Full-stack technical depth of HPC Linux clustered environments

    ·       Demonstrated ability to contribute and constantly improve cutting HPC systems environments

    ·       Knowledge and understanding of advanced HPC systems architectures, technologies, packages, and workloads, in line with industry standards

    ·       Excellent verbal and written communication skills, with the ability to communicate complex concepts to non-technical internal and external stakeholders

    ·       Proven track record of effectively prioritizing heavy workloads in a fast-moving environment

    ·       Positive and constructive attitude with strong attention to detail, ability to work productively in others

    ·       Must be comfortable in a rapidly growing startup environment but in an enterprise level-production data center environment

    ·       Bachelor’s or Masters degree in Computer Science, Information Technology, or a related field is required

    ·       Strong engineering background with good judgment, rationale, and technical aptitude

    Primary Job Duties

    ·       Accountable for the build-out, documentation, and administration of a new HPC data center environment, including monitoring, cluster management, and server systems

    ·       Evaluate and implement cluster management and scheduling systems for ML, AI, and Deep Learning workloads

    ·       Responsible for building, deploying, and managing HPC system images

    ·       Ensure high availability and uptime to meet customer service level agreements and company excellence standards

    ·       Push the status quo with a relentless focus on constant iteration and improvement, including but not limited to automation, monitoring, alerting, updates, and technical refreshes

    ·       Displays a hands-on approach, working cross-functionally with stakeholders and peers to accomplish individual, team, and company goals.

    ·       Ensure compliance with industry regulations and standards such as SOC 1 & 2, ISO 27001/2, etc.

    ·       Stay current with the latest trends and technologies, and ensure that the company’s infrastructure is competitive with comparable state-of-the-art systems

    ·       Develop key metrics and provide regular reports to senior management on the status of systems, deployments, and customer workloads