Onsite

Node-Level Performance Optimization

This course covers advanced topics on code optimization for x86 platforms (Intel and AMD CPUs). We discuss different techniques for analyzing and maximizing both single and multi-core performance within a single node. The topics inlude instruction-level parallelism, vectorization, and efficient utilization of cache and memory. The course consists of lectures and hands-on exercises.

Learning outcome

– Awareness of features and internal workings of x86 CPUs
– Ability to analyze and assess single-node performance
– Ability to vectorize computations
– Ability to optimize cache and memory access

Prerequisites

– Good knowledge of C/C++ or Fortran
– Good knowledge of threading using OpenMP
– Basic knowledge of modern CPU architectures

Agenda

Day 1
– Overview about performance engineering
– General overview of modern multicore CPU
– Main memory performance
– Performance analysis tools

Day 2
– Deeper dive into caches
– Detailed look into Intel and AMD CPUs
– Advanced vectorization
– Additional optimization topics