HPC Systems Engineer
About Applied Digital:
Applied Digital (NASDAQ: APLD) operates next-generation data centers for high-performance compute and Machine Learning applications. As a rapidly growing publicly traded company, we seek an HPC Systems Engineer who understands state-of-the-art HPC data center system hardware and systems design.
The Data Center Systems Engineer [HPC] will be a subject matter expert in High Performance Computing (HPC) Infrastructure and storage in conjunction with a team responsible for engineering, deploying, and supporting HPC based clusters and data centers. You will provide technical guidance to a small team of system administrators and perform activities required to design, build, support, and automate large, complex High-Performance Compute data center server systems.
The Systems Engineer will also own Applied Digital’s data center IT systems for existing cryptocurrency mining which includes design and engineering decisions. Additionally, this role will lead a team of systems engineers at data centers across North America with technical architecture and management responsibilities.
Primary Job Duties
Engineer HPC computer systems based on customer requirements, budgets, timelines, and parts availability.
Design and implement scalable systems, automation, and architectures.
Support the HPC design, construction, engineering, operations, networking, storage teams and operations.
Enhance system efficiency, robustness, and scalability.
Lead capacity planning to help determine compute and storage growth needs.
Apply in-depth HPC and Linux expertise to collaborate with stakeholders across IT and domain disciplines to expand HPC use cases.
Evaluate, analyze, and integrate HPC technologies such as job schedulers, high performance interconnects, networked filesystems, cybersecurity, cluster management, virtualization, networking, performance tuning, and data center planning.
Own job scheduler, such as SLURM, including configuration, optimization, and advanced features.
Assist customers with dataset storage and systems to support their requirements.
Help customers optimize and troubleshoot complex ML/AI jobs and pipelines.
Act as the senior engineer assessing innovative technologies and integrate existing commercial and open-source automation solutions.
Work closely with network team to define and design network requirements for systems environments.
Engineering, developing, deploying, and operating large scale distributed systems at scale
System, datacenter, or DevOps engineer in a complex HPC datacenter environment
Experience with Job Schedulers for High Performance Computing (HPC) systems, including consideration of resilience, memory, scalability, and central processing unit (CPU) footprint.
Experience doing performance analysis studies of automation and applications on HPC system architectures.
Working with containerization and micro-service technologies: Kubernetes (K8s), Docker, Singularity, etc.
Implementing and supporting High-Performance Compute (HPC) Clusters
Experience with Virtualization, Windows, and Linux-based operating systems in HPC environments
Experience with various Processor architectures (e.g., CPU, GPU, FPGA)
Experience with assorted Memory architectures (e.g., DRAM, DDR, HBM, persistent memories)
Experience with large-scale storage and filesystems (e.g., Flash, NVMe, HDD)
Enterprise automation development and processes necessary to communicate with data scientists, ML engineers and effectively orchestrate large scale server clusters (Python, Shell scripting, etc.)
Experience with Open Cloud Platform (OCP) a plus
Knowledge of systems management, logging, and monitoring systems
Demonstrated networking knowledge in all OSI network layers.