Platforms: OpenMP on Shared Memory Architectures

OpenMP Overview
OpenMP Compliant Shared Memory Architectures

OpenMP Overview

What is OpenMP?

Stands for Open Multi-Processing, or Open specifications for Multi-Processing via collaborative work between interested parties from the hardware and software industry, government and academia.
OpenMP is an Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.
Is supported in C, C++, and Fortran, on most processor architectures and operating systems, including Solaris, AIX, HP-UX, GNU/Linux, Mac OS X, and Windows platforms.
It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Why use OpenMP?

OpenMP provides a standard among a variety of shared memory architectures/platforms.
OpenMP establish a simple and limited set of directives for programming shared memory machines. Significant parallelism can be implemented by using just 3 or 4 directives.
OpenMP provides capability to incrementally parallelize a serial program, unlike message-passing libraries such as MPI which typically require an all or nothing approach.
OpenMP provide the capability to implement both coarse-grain and fine-grain parallelism.

OpenMP Programming Model

OpenMP is based upon the existence of multiple threads in the shared memory programming paradigm. A shared memory process consists of multiple threads. OpenMP is an explicit, none-automatic, programming model which offers the programmer full control over parallelization.
OpenMP uses the fork-join model of parallel execution:

(Picture taken from- http://www.ocgy.ubc.ca/~yzq/books/OpenMP.html)
When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team. Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. If any thread terminates within a parallel region, all threads in the team will terminate, and the work done up until that point is undefined.

Much more information on OpenMP can be found here.

OpenMP Compliant Shared Memory Architectures

Background

Parallel programming on shared memory machines has always been an important area in high performance computing (HPC). However, the utilization of such platforms has never been straightforward for the programmer.
- The Message Passing Interface (MPI) commonly used on massively parallel distributed memory architectures offers good scalability and portability, but is non-trivial to implement with codes originally written for serial machines. It also fails to take advantage of the architecture of shared memory platforms.
- The data parallel extension to Fortran90, High Performance Fortran (HPF) offers easier implementation, but lacks the efficiency and functionality of MPI.
- Over the years there have been several other products from both hardware and software vendors which have offered scalability and performance on a particular platform, but the issue of portability has always been raised when using these products.
OpenMP is the proposed industry standard Application Program Interface (API) for shared memory programming. It is based on a combination of compiler directives, library routines and environment variables that can be used to specify shared memory parallelism in Fortran and C/C++ programs. OpenMP is intended to provide a model for parallel programming that is portable across shared memory architectures from different vendors. In relation to other parallel programming techniques it lies between HPF and MPI in that it has the ease of use of HPF, in the form of compiler directives, combined with the functionality of MPI.

Popular Architectures

The shared memory architecture consists of a number of processors which each have access to a global memory store via some interconnect or bus. The key feature is the use of a single address space across the whole memory system, so that all the processors have the same view of memory. The processors communicate with one another by one processor writing data into a location in memory and another processor reading the data. With this type of communications the time to access any piece of data is the same, as all of the communication goes through the bus. The advantage of this type of architecture is that it is easy to program as there are no explicit communications between processors, with the communications being handled via the global memory store.
(Picture taken from- anusf.anu.edu.au/~dbs900/OpenMP/openmp)
Since different threads communicate with each other by reading and writing shared memory, the latencies involved with these communications are an important factor for overall performance. Two different kind of latencies can be distinguished: first the latency to access the main memory, second the latency that occurs in the direct communication between two threads.

Latency to main memory- With the rapidly increasing clock frequency the latency to the main memory measured in clock cycles is getting higher and higher. Therefore techniques have been developed to reduce or hide this latency. Examples are caches, out of order execution or data prefetching. Nevertheless, as soon as these optimization techniques fail the cost of a memory access is dominated by the latency.
Interprocess memory latency- In OpenMP threads communicate with each other by sharing variables. This will take different times depending on the schematics of the architecture. One thread will have to write to the main memory while another thread will have to read from it. In a bus based system used by PCs the second thread could fetch the cache line while it is being written, on other systems the whole cache line has to be read before it can be written.

Three of the main different platforms for shared memory architecture are presented:

PC- A dual processor Pentium IV pc with 2 gigabyte of RDRAM memory. It serves as an example for a simple bus based system.
SGI Origin2000- An Origin 2000 system is composed of nodes linked together by an interconnection network. It uses the distributed shared memory S2MP (Scalable Shared-Memory Multiprocessing) architecture.
HPC servers- A range of UNIX server computers produced by Sun Microsystems from 1996 to 2001. These systems were based on the 64-bit UltraSPARC microprocessor architecture.

Multi-Core PCs

Processors have been consistently getting faster. But the more rapidly they can perform instructions, the quicker they need to receive the values of operands from memory. Unfortunately, the speed with which data can be read from and written to memory has not increased at the same rate. In response, the vendors have built computers with hierarchical memory systems, in which a small, expensive, and very fast memory called cache memory, or cache for short, supplies the processor with data and instructions at high rates. Each processor of an SMP needs its own private cache if it is to be fed quickly; hence, not all memory is shared. Data is copied into cache from main memory: blocks of consecutive memory locations are transferred at a time. Since the cache is very small in comparison to main memory, a new block may displace data that was previously copied in. An operation can be (almost) immediately performed if the values it needs are available in cache. But if they are not, there will be a delay while the corresponding data is retrieved from main memory. Hence, it is important to manage cache carefully.

(Picture taken from- http://mitpress.mit.edu/books/chapters/0262533022chap1)
A processor is basically a unit that reads and executes program instructions, which are fixed-length typically 32 or 64 bit or variable-length chunks of data. The data in the instruction tells the processor what to do. The instructions are very basic things like reading data from memory or sending data to the user display, but they are processed so rapidly that we experience the results as the smooth operation of a program.

A core is the part of the processor which performs reading and executing of the instruction. Single core processors can only execute one instruction at a time. However as the name implies, Multi-core processors are composed of more than one core. A very common example would be a dual core processor. The advantage of a multi-core processor over a single core one is that the multi-core processor can either use both its cores to accomplish a single task or it can span threads which divided tasks between both its cores, so that it takes twice the amount of time it would take to execute the task than it would on a single core processor. Multi- core processors can also execute multiple tasks at a single time. A common example would be watching a movie on windows media player while your dual-core processor is running a background virus check. Multi-core is a shared memory processor. All cores share the same memory. All cores are on the same chip on a multi-core architecture.

OpenMP divides tasks into threads; a thread is the smallest unit of a processing that can be scheduled by an operating system. The master thread assigns tasks unto worker threads. Afterwards, they execute the task in parallel using the multiple cores of a processor.

SGI Origin2000

This is effectively a hybrid shared and distributed memory architecture. The memory is physically distributed across nodes, with two processors located at each node having equal access to their local memory. It is a shared memory platform in the sense that all other nodes have similar access to this memory but are physically more distant, but it can still be programmed as a symmetric multi-processor (SMP) machine. Also as the number of nodes accessing this memory increases a bottleneck situation will arise, but this is a limitation one would expect.

Sun HPC Servers

Servers such as Enterprise 3000 or Enterprise 1000. These are true shared memory boxes, with the E3000 containing 1 to 6 processors and the E10000 between 4 and 64 processors.

Citation/Sources Used

Using OpenMP Portable Shared Memory Parallel Programming by Barbara Chapman, Gabriele Jost, Ruud van der Pas.
OpenMP A Parallel Programming Model for Shared Memory Architectures by Paul Graham.
OpenMP Tutorials by Blaise Barney, Lawrence Livermore National Laboratory.