Data/Task Reordering

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.

References

2011

ASPLOS
On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, and 2 more authors

In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Jun 2011

Abs Bib HTML PDF

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs.This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.
@inproceedings{zhangetalasplos11, author = {Zhang, Eddy Z. and Jiang, Yunlian and Guo, Ziyu and Tian, Kai and Shen, Xipeng}, title = {On-the-Fly Elimination of Dynamic Irregularities for GPU Computing}, year = {2011}, isbn = {9781450302661}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/1950365.1950408}, booktitle = {Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems}, pages = {369–380}, numpages = {12}, keywords = {thread divergence, thread data remapping, gpgpu, cpu-gpu pipelining, data transformation, memory coalescing}, location = {Newport Beach, California, USA}, series = {ASPLOS XVI} }

2010

PPoPP
Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?

Eddy Z. Zhang, Yunlian Jiang, and Xipeng Shen

In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Jun 2010

Abs Bib HTML PDF

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.
@inproceedings{zhangetalppopp10, author = {Zhang, Eddy Z. and Jiang, Yunlian and Shen, Xipeng}, title = {Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?}, year = {2010}, isbn = {9781605588773}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/1693453.1693482}, booktitle = {Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, pages = {203–212}, numpages = {10}, keywords = {shared cache, thread scheduling, chip multiprocessors, parallel program optimizations}, location = {Bangalore, India}, series = {PPoPP '10} }
ICS
Streamlining GPU Applications on the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, and 1 more author

In Proceedings of the 24th ACM International Conference on Supercomputing, Jun 2010

Abs Bib HTML PDF

Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops.This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.
@inproceedings{zhangetalics10, author = {Zhang, Eddy Z. and Jiang, Yunlian and Guo, Ziyu and Shen, Xipeng}, title = {Streamlining GPU Applications on the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping}, year = {2010}, isbn = {9781450300186}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/1810085.1810104}, booktitle = {Proceedings of the 24th ACM International Conference on Supercomputing}, pages = {115–126}, numpages = {12}, keywords = {thread divergence, GPGPU, thread-data remapping, data transformation, CPU-GPU pipelining}, location = {Tsukuba, Ibaraki, Japan}, series = {ICS '10} }