[CUDA] Chap9. Parallel histogram

2026. 3. 24. 16:54ComputerScience/Parallel Programming

 

 

 

Chap9. Parallel histogram

 In practice, whenever there is a large volume of data that needs to be analyzed to distill interesting events, histograms are likely used as a foundational computation.

 

 Note that multiple threads need to update the same counter (m-p), which is a conflict that is referred to as output interference. Programmers must understand the concepts of race conditions and atomic operations in order to confidently handle such output interferences in their parallel code.

 

 In both examples, undesirable outcomes are caused by a phenomenon called read-modify-write race condition, in which the outcome of two or more simultaneous update operations varies depending on the relative timing of the operations that are involved.1 Some outcomes are correct, and some are incorrect.

 

 

 

 

 

 

 One approach to improving the throughput of atomic operations is to reduce the access latency to the heavily contended locations. Cache memories are the primary tool for reducing memory access latency. For this reason, modern GPUs allow atomic operations to be performed in the last-level cache, which is shared among all streaming multiprocessors (SMs). During an atomic operation, if the updated variable is found in the last-level cache, it is updated in the cache. If it cannot be found in the last-level cache, it triggers a cache miss and is brought  into the cache, where it is updated. Since the variables that are updated by atomic operations tend to be heavily accessed by many threads, these variables tend to remain in the cache once they have been brought in from DRAM.

 Because the access time to the last-level cache is in tens of cycles rather than hundreds of cycles, the throughput of atomic operations is improved by at least an order of magnitude compared to early generations of GPU. This is an important reason why most modern GPUs support atomic operations in the last-level cache.

 

 

 

 

 

 

Privatization

 Another approach to improving the throughput of atomic operations is to alleviate the contention by directing the traffic away from the heavily contended locations. This can be achieved with a technique referred to as privatization that is commonly used to address the heavy output interference problems in parallel computing. The idea is to replicate highly contended output data structures into private copies so that each subset of threads can update its private copy. The benefit is that the private copies can be accessed with much less contention and often at much lower latency. These private copies can dramatically increase the throughput for updating the data structures. The downside is that the private copies need to be merged into the original data structure after the computation completes. One must carefully balance between the level of contention and the merging cost. Therefore in massively parallel systems, privatization is typically done for subsets of threads rather than individual threads.

 

 contention will be experienced only between threads in the same block and when the private copies are being

merged at the end.

 

 Fig. 9.9 presents a simple kernel that creates and associates a private copy of the histogram to every block. In this scheme, up to 1024 threads would work on a copy of the histogram. In this kernel the private histograms are in the global memory. These private copies will likely be cached in the L2 cache for reduced latency and improved throughput.

 

 

 

Coarsening

 We have seen that privatization is effective in reducing the contention of atomic operations and that storing the privatized histogram in the shared memory reduces the latency of each atomic operation. However, the overhead of privatization is the need to commit the private copies to the public copy. This commit operation -is done once per thread block.

 

 

 

 

 

 

Aggregation

 Some datasets have a large concentration of identical data values in localized areas. For example, in pictures of the sky, there can be large patches of pixels of identical value. Such a high concentration of identical values causes heavy contention and reduced throughput of parallel histogram computation.

 

 For such datasets a simple and yet effective optimization is for each thread to aggregate consecutive updates into a single update if they are updating the same element of the histogram (Merrill, 2015). Such aggregation reduces the number of atomic operations to the highly contended histogram elements, thus improving the effective throughput of the computation.

 

 Fig. 9.15 shows an aggregated text histogram kernel. The key changes compared to the kernel in Fig. 9.14 are as follows: Each thread declares an additional accumulator variable (line 09) that keeps track of the number of updates aggregated thus far and a prevBinIdx variable (line 10) that tracks the index of the histogram bin that was last encountered and is being aggregated. Each thread initializes the accumulator variable to zero, indicating that no updates has been initially aggregated, and the prevBinIdx to21 so that no alphabet input will match it.

 

 

 

Summary

 Computing histograms is important for analyzing large datasets. It also represents an important class of parallel computation patterns in which the output location of each thread is data-dependent, which makes it infeasible to apply the ownercomputes rule. It is therefore a natural vehicle for introducing the concept of read-modify-write race conditions and the practical use of atomic operations that ensure the integrity of concurrent read-modify-write operations to the same memory location.

 

 Unfortunately, as we explained in this chapter, atomic operations have much lower throughput than simpler memory read or write operations because their throughput is approximately the inverse of two times the memory latency. Thus in the presence of heavy contention, histogram computation can have surprisingly low computation throughput. Privatization is introduced as an important optimization technique that systematically reduces contention and can further enable the use of shared memory, which supports low latency and thus high throughput. In fact, supporting fast atomic operations among threads in a block is an important use case of the shared memory.

 

 Coarsening was also applied to reduce the number of private copies that need to be merged, and different coarsening strategies that use contiguous partitioning and interleaved partitioning were compared. Finally, for datasets that cause heavy contention, aggregation can also lead to significantly higher execution speed.

 

 

 

'ComputerScience > Parallel Programming' 카테고리의 다른 글

[CUDA] Chap11. Prefix sum (scan)  (0) 2026.03.24
[CUDA] Chap10. Reduction  (0) 2026.03.24
[CUDA] Chap8. Stencil  (0) 2026.03.24
[CUDA] Chap7. Convolution  (0) 2026.03.24
[CUDA] Chap6. Performance considerations  (0) 2026.03.20