Register Allocation

The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture – single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast “context switches” when the processor is stalled due to exe- cution dependence, memory requests and etc. The SIMT ar- chitecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making op- timization decisions that benefit single thread but degrade the whole application performance. Is it possible for compilers to make register allocation de- cisions that can maximize the whole GPU application per- formance? We tackle this important question from two dif- ferent aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pres- sure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execu- tion model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes con- currency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representa- tive set of GPU benchmarks with non-trivial register pres- sure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.

References

2016

Middleware
Orion: A Framework for GPU Occupancy Tuning

Ari B. Hayes, Lingda Li, Daniel Chavarrı́a-Miranda, and 2 more authors

In Proceedings of the 17th International Middleware Conference, Jun 2016

Abs Bib PDF

An important feature of modern GPU architectures is variable occupancy. Occupancy measures the ratio between the actual number of threads actively running on a GPU and the maximum number of threads that can be scheduled on a GPU. High-occupancy execution enables a large number of threads to run simultaneously and to hide memory latency, but may increase resource contention. Low-occupancy execution leads to less resource contention, but is less capable of hiding memory latency. Occupancy tuning is an important and challenging problem. A program running at two different occupancy levels can have three to four times difference in performance.We introduce Orion, the first GPU program occupancy tuning framework. The Orion framework automatically generates and chooses occupancy-adaptive code for any given GPU program. It is capable of finding the (near-)optimal occupancy level by combining static and dynamic tuning techniques. We demonstrate the efficiency of Orion with twelve representative benchmarks from the Rodinia benchmark suite and CUDA SDK evaluated on two different GPU architectures, obtaining up to 1.61 times speedup, 62.5% memory resource saving, and 6.7% energy saving compared to the baseline of optimized code compiled by nvcc.
@inproceedings{hayesetalmiddleware16, author = {Hayes, Ari B. and Li, Lingda and Chavarr\'{\i}a-Miranda, Daniel and Song, Shuaiwen Leon and Zhang, Eddy Z.}, title = {Orion: A Framework for GPU Occupancy Tuning}, year = {2016}, isbn = {9781450343008}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/2988336.2988355}, doi = {10.1145/2988336.2988355}, booktitle = {Proceedings of the 17th International Middleware Conference}, articleno = {18}, numpages = {13}, keywords = {Shared Memory Allocation, GPU Compiler, Concurrent Program Compilation, Register Allocation, Occupancy Tuning}, location = {Trento, Italy}, series = {Middleware '16} }

2014

ICS
Unified On-Chip Memory Allocation for SIMT Architecture

Ari B. Hayes, and Eddy Z. Zhang

In Proceedings of the 28th ACM International Conference on Supercomputing, Jun 2014

Abs Bib HTML PDF

The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture – single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast “context switches" when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance.Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes concurrency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.
@inproceedings{hayesetalics14, author = {Hayes, Ari B. and Zhang, Eddy Z.}, title = {Unified On-Chip Memory Allocation for SIMT Architecture}, year = {2014}, isbn = {9781450326421}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/2597652.2597685}, booktitle = {Proceedings of the 28th ACM International Conference on Supercomputing}, pages = {293–302}, numpages = {10}, keywords = {concurrency, register allocation, compiler optimization, gpu, shared memory allocation}, location = {Munich, Germany}, series = {ICS '14} }