The Argonne Leadership Computing Facility’s (ALCF) mission is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. We help researchers solve some of the world’s largest and most complex problems with our unique combination of supercomputing resources and computational science expertise.
The Operations Group within the ALCF supports the mission of our science users by maintaining stable and performant systems at all levels, including hardware, networking, and software. The ALCF seeks candidates for multiple aspects of research in the context of operations and performance, including the following areas:
Scheduling, including smarter scheduling of data and compute resources, and mixed simulation/MPI and AI workloads.
Scalable event sourcing and microservice/event-based processing systems as a basis for a massively scalable HPC Job Scheduler.
Disaggregated memory, with particular interest in the impact and mitigation of increased latency of memory access due to remote memory in HPC applications.
Kubernetes, especially as integrated into HPC systems
Scheduling APIs and cross-compatibility between HPC systems
Recent PhD in related field
Comprehensive knowledge in C/C++ programming under Unix/Linux.
Comprehensive knowledge in Python programming under Unix/Linux
Comprehensive knowledge in System Programming
Considerable knowledge of parallel algorithms, I/O architectures, performance evaluation and tuning.
Considerable expertise in parallel programming, multicore systems, threading, and scientific application codes.
Considerable software development skills, written, and communication skills.
Good collaborative skills, including the ability to work well with other divisions, laboratories, and universities.
Good self-motivation to get involved and participate in the project team`s research, and balance that against intense code development.
Candidate should have the ability to create, maintain, and support high-quality software.
Knowledge of Scala / Akka or other functional language / actor system a plus
Ability to model Argonne’s Core Values: Impact, Safety, Respect, Integrity, and Teamwork.