Argonne National Laboratory
Company Website
Postdoctoral Appointee – Heterogeneity in High Performance Computing

Position Description

The Argonne Leadership Computing Facility’s (ALCF) mission is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. We help researchers solve some of the world’s largest and most complex problems with our unique combination of supercomputing resources and computational science expertise

Heterogeneity in High Performance Computing (HPC) has never been greater, and most exascale systems deployed in the US will be accelerator-based systems. Understanding the performance of applications running on sizeable fractions of those machines will be a challenge, as the scale and complexity of those applications will be unprecedented. Nonetheless, the hybrid nature of these platforms offers a common opportunity to better understand how the applications interact with the accelerator (which is where most of the computing power will reside). Indeed, the Application Programming Interfaces (APIs) of those accelerators are well defined entry points that can be traced. CUDA or OpenCL are such APIs. By applying techniques derived from Model Centric Debugging on those APIs, it is possible to capture most of the accelerator-related context and events. Those techniques have already been leveraged in an HPC context for debugging purposes or for porting an HPC application from CUDA to OpenCL.

At Argonne, in order to meet those challenges, we’ve been developing a collection of Model Centric Tracing tools that cover the APIs that will be encountered on exascale platforms. In order to meet the scalability and performance requirements, those tracers are based on LTTng, and work on a similar manner to LTTng CLUST but with Model Centric Tracing in mind. The fine level of control in the granularity of the captured traces allows those tracers to be used for a variety of purposes:

Profiling accelerator usage of HPC applications,
Debugging accelerator usage,
Capturing traces that can be reinjected in simulation frameworks,
Extracting kernels for replay, allowing study and tuning in a sand-box,
Lightweight and transparent monitoring of platform usage.
Most of those remain to be invented or perfected and offer a lot of opportunities to develop new research. In this context, Argonne’s ALCF is looking for a post-doctoral appointee to perform research and development on the collection of tracers and their uses. Especially, with the exascale Aurora platform expected next year, integration of the tracing framework and its scalability will be an important topic. Another important objective will be to collaborate with application developers to help them leverage the possibilities offered by the tracers. The work will take place in a multi-disciplinary environment and will offer opportunities to interact with a wide range of talents from the whole spectrum of HPC research.  The successful candidate will be expected to present and publish their work at major symposia and journals.

Position Requirements

Recent PhD in related field
Comprehensive knowledge in C/C++ programming under Unix/Linux.
Comprehensive knowledge of one or more libraries and tools such as OpenCL, CUDA/HIP, ROCm, Level0
Comprehensive knowledge in System Programming
Considerable knowledge of parallel algorithms, I/O architectures, performance evaluation and tuning.
Considerable expertise in parallel programming, multicore systems, threading, and scientific application codes.
Considerable software development skills, written, and communication skills.
Good collaborative skills, including the ability to work well with other divisions, laboratories, and universities.
Good self-motivation to get involved and participate in the project team`s research, and balance that against intense code development.
Candidate should have the ability to create, maintain, and support high-quality software.
Ability to model Argonne’s Core Values: Impact, Safety, Respect, Integrity, and Teamwork.

Application Documents:

  •  Cover letter (optional); uploaded as a PDF document
  •  Resume