logo


slogan

4
About Us
Training
Calendar
Registration
Contact Us
10

 

Parallel Programming: Writing and Tuning for Performance and Scalability

 
   

Course Highlights:

“Programming on Multicore: Writing and Tuning for Performance and Scalability” is a two-day course (with an optional third day) that offers knowledge of how program performance is affected by the underlying multicore architecture, and knowledge and skills of how to write and tune programs so that performance and scalability are maximized.

The course covers an overview of parallel programming, overview of multicore architectures, techniques to improve cache performance, techniques to improve virtual memory performance, scalability of parallel programs, programming on graphics processing units (GPUs), and parallel programming using Google’s MapReduce. The course is designed to equip professionals or code developers who have experience in C or C++ programming, an ability to measure, tune, and maximize program performance and scalability given a platform to work on. Roughly 30% of the course will be focused on hands-on programming experience solving several problems that are relevant to real world scenarios. The hands-on programming is based on the industry-standard OpenMP, as well as on an emerging programming model called MapReduce.

 

Objective of the course:

This course focuses on how to maximize the performance of sequential and parallel programs, and scalability of parallel programs. As the number of cores on a chip continues to increase rapidly, the ability of programmers to write and tune programs that exploit a large number of cores is critical for ensuring their competitive edge.

It is widely known that parallel program performance is often non-portable across platforms. Writing parallel programs for a multicore platform requires a good understanding of the specific multicore architecture that forms the target platform. Consider the following questions:

  1. Do you know how cache configurations affect program performance? Cache configurations include the number of cache levels, cache size, cache block size, and whether the cache is shared by multiple cores or is private per core.
  2. Do you know how to identify and pinpoint the source of performance bottlenecks?
  3. Do you know how to write and tune programs in order to maximize their performance on a specific multicore architecture?
  4. Do you know how virtual memory management affect program performance, and how to write or tune programs for better virtual memory performance?
  5. Do you know how various synchronization primitives differ in terms of latency and scalability, and which one to use for a particular situation?
  6. Do you know how to write and tune programs for Graphics Processing Units (GPUs) and for MapReduce model?

 

If you are curious of the answers to the above questions, this course is for you. Code developers who know how to exploit the potentials of multicore processors will have a significant competitive advantage over those who do not.

The course objective is to equip code developers or other professionals the foundational concepts and techniques for measuring, tuning, and maximizing program performance, and to equip them with the practical skills in how to write high-performing and scalable programs. 

 

Who Should Attend:

Code developers or other professionals who have experience in C or C++ programming. The course does not assume prior knowledge in multicore architecture or parallel programming. However, participants are encouraged to take a complementary course “Fundamentals of Parallel Programming”, which provides skills in identifying, expressing, and exploiting parallelism in multicore systems, in order to acquire a complete skill set on multicore programming.

Course Outline:

1. Overview of Parallel Programming
- Refresher on OpenMP
- Types of parallelism: data parallelism, function parallelism, and pipeline parallelism
- Why parallel program performance is not portable across platforms
- Trends and implication on software development in current and future multicore processors

2. Overview of Multicore Architecture
- Memory hierarchy organization of current multicore systems
- Thread contexts vs. cores on chip vs. cores across chips, and their implications on performance
- Coherent vs. non-coherent multicore architecture
- Two diverging markets: few powerful processor cores versus many simple processor cores
- Future trends: heterogeneous multicore, integration of CPUs and GPUs, systems on chip, and accelerators

3. Tools for Measuring and Diagnosing Program Performance
- Clock timer
- Hardware counter profiling
- Code profiling: identifying hot spots
- Microbenchmarks and their uses
- Memory benchmark and how to extract memory hierarchy parameters from the target machine
- Techniques for removing volatility of program performance
- How to pin threads to specific CPUs

4. Program Performance and Cache Organization
- Overview of caching mechanism
- Types of cache misses: cold, capacity, conflict, true sharing, and false sharing
- Techniques to reduce various types of cache misses
- Program behavior in spatial locality and temporal locality
- Various loop transformations: blocking, unrolling, and interchange
- Improving cache performance for dense matrix algorithms
- Improving cache performance of classes and structures

5. Program Performance and Virtual Memory Organization
- Overview of virtual memory
- Various roles of paging
- Page table organization and Translation Lookaside Buffer (TLB)
- Implications of page table organizations on performance
- Techniques to improve virtual memory performance

6. Scalability of Parallel Programs
- Amdahl’s Law
- Load balancing techniques: static vs. dynamic task distribution and scheduling
- Scalable lock design and implementation: Peterson, atomic instruction-based, ticket, array-based
- Scalable barrier design and implementations

7. Pessimistic vs. Optimistic Concurrency
- Pessimistic vs. optimistic concurrency
- Lock-free data structures
- Multiple Compare-and-Swap (MCAS) and Software transactional memory (STM)
- Major limitations of MCAS and STM

8. Programming on GPUs (Optional third day material)
- Overview of general-purpose graphics processing units (GP-GPUs)
- Programming model for GPUs
- Expressing parallelism with CUDA
- Memory model for GPUs
- Thread creation and scheduling mechanism
- Optimizing memory performance in GPUs

9. Programming with MapReduce (Optional third day material)
- Overview of MapReduce programming model
- Specifying map() and reduce() functions
- Advantages and drawbacks
- Limitations of MapReduce

Notes:
- Students can take this course as a two-day course to cover Module 1-7.
- Students can also take this course as a three-day course to cover Module 1-9.
- In each module there will be Q&A sessions.
- Module 3, 4, 6, and 9 involve hands-on programming exercises.

 

 

 

 

Who We Are
We are a professional organisation providing training services to companies.  We offer a comprehensive range of training courses, workshops and seminars covering every aspects relating to engineering. 

We provide various training programs that meet the immediate and future needs of engineers. The training is organised through seminar style, hands-on workshop, project-based tutorial or a mixture to bring the maximum learning benefits to the enginners.
Our Trainers
We have a quality pool of leading authorities, worldwide experts and fully trained up professionals who are constantly striving to uncover the pitfalls and best practices of modern technology development.
     
All rights reserved by
Omniscient International
     About Us      Training       Calendar      Registration      Contact Us