Performance Analysis and Optimization
Coordinated by: Călin Bîră
Master: Advanced Computing in Embedded Systems (in English)
Rooms: A415
Description:
The role of this course is to introduce the basic tools needed for engineers to evaluate and improve applications and it has a very important place in the core competencies set since it is a requirement for more advanced system and architecture design and implementation.
The objectives are to train students to be able to profile an application using specific tools, to identify and mitigate bottlenecks, and write high performance code, to understand the operating systems, CPU microarchitecture and runtime concepts relevant to performance optimization, to identify the major blocks of a compiler / interpreter, give examples from language internals, identify major hotspots in compiler / interpreter and propose optimizations.
Course is 25% of the time, lab is 75% of the time and serves as a theoretical introduction of the technology. The lab is a hands-on approach on solving a particular problem.
Students are taught practical estimation, speed and code-size/ram optimizations. Profiling is done using valgrind and cachegrind.
- Utilized architectures & technologies: 8-bit PIC C/ASM, x64, x64-SIMD, GP-GPU (OpenCL), custom hardware.
- Speed optimizations: g++ compiler levels, x64 Look-Up Table, OpenMP, SSE4/AVX/AVX2 intrinsics, dedicated hardware accelerator, OpenCL GPU/CPU accelerator
- Code-size / RAM optimization: 8-bit PIC (microcontroller) programmed in C and ASM
Course contents
Course 1: Introduction, course and tools presentation.
Course 2: x64 implementation of bit-order inversion in 100 Million 32-bit unsigned int numbers. Starting from naive implementation students are guided through loop-version, loop-unrolled, compiler optimization levels and eventually, to 8-bit LUT version. Homework: 16-bit and 32-bit LUT implementations. Discussions about cache in general and micro-op cache in particular, I/O and CPU performance.
Course 3: ML-L3 Infrared remote control for Nikon DSLR.
Task: Given the waveform of the triggering signal, drive GPIOs of a very weak mcu, using the least amount of code size and ram memory. First write space-optimized C code (by analyzing the main loop). Student are taught to look into program memory, and run step-by-step in the GPIO simulator. Introduction in digital system design and breadboarding.
Hardware: 8-bit Microchip PIC10F200 (the cheapest mcu found, with 33 instructions / 256 words of program memory / 16 Bytes of RAM memory).
Software: MPLAB IDE with PICC C compiler and MPASM assembler.
Course 4: same as course 3, but implementation is now in ASM. A prototype is build and programmed with the hex file. Test is performed, and photos are downloaded from the DSLR.
Course 5: SIFT matching speed optimization (computation of L1 or L2 distance between 2 groups of points in 128-dimensions space). Introduction into SIMD (SEE / AVX / NEON).
Software tools: Intel Intrinsics Guide
Course 6: g++ compiler optimizations levels. OpenMP. Applied to FFT256k Cooley-Tukey algorithm using 16-bit real / 16-bit imaginary numbers.
Course 7: Optimizing DES/3DES (software-resistent, symmetric encryption algorithm) as bit-sliced implementation for S-boxes(S-boxes are considered hardware circuits, and are minimized accordingly using 2-input AND/OR/XOR/... gates, depending on the instructions of the 64-bit machine we use)
Course 8: ASIFT C++ profiling and optimizations => estimating performance gain in case of custom hardware accelerator. Estimating required I/O performance.
Course 9: Introduction into GP-GPU, CUDA, OpenCL. Parallel 2-vector addition.
Course 10: Matrix-column normalization in OpenCL. Comparing CPU and CPU + GPU configuration using same OpenCL code. Compare with CPU non-OpenCL code.