2026. 3. 19. 21:01ㆍComputerScience/Parallel Programming
Chap3. Multidimensional grids and data
In this chapter, we will look more generally at how threads are organized and learn how threads and blocks can be used to process multidimensional arrays. Multiple examples will be used throughout the chapter, including converting a colored image to a grayscale image, blurring an image, and matrix multiplication. These examples also serve to familiarize the reader with reasoning about data parallelism before we proceed to discuss the GPU architecture, memory organization, and performance optimizations in the upcoming chapters.
Heterogeneous Data Parallel Computing, these threads are organized into a two-level hierarchy: A grid consists of one or more blocks, and each block consists of one or more threads. All threads in a block share the same block index, which can be accessed via the blockIdx (built-in) variable. Each thread also has a thread index, which can be accessed via the threadIdx (built-in) variable.
However, the ANSI C standard on the basis of which CUDA C was developed requires the number of columns in Pin to be known at compile time for Pin to be accessed as a 2D array. Unfortunately, this information is not known at compile time for dynamically allocated arrays. In fact, part of the reason why one uses dynamically allocated arrays is to allow the sizes and dimensions of these arrays to vary according to the data size at runtime.
Thus the information on the number of columns in a dynamically allocated 2D array is not known at compile time by design. As a result, programmers need to explicitly linearize, or “flatten,” a dynamically allocated 2D array into an equivalent 1D array in the current CUDA C. In reality, all multidimensional arrays in C are linearized.

To implement matrix multiplication using CUDA, we can map the threads in the grid to the elements of the output matrix P with the same approach that we used for colorToGrayscaleConversion. That is, each thread is responsible for calculating one P element. The row and column indices for the P element to be calculated by each thread.
Since the size of a grid is limited by the maximum number of blocks per grid and threads per block, the size of the largest output matrix P that can be handled by matrixMulKernel will also be limited by these constraints. In the situation in which output matrices larger than this limit are to be computed, one can divide the output matrix into submatrices whose sizes can be covered by a grid and use the host code to launch a different grid for each submatrix. Alternatively, we can change the kernel code so that each thread calculates more P elements. We will explore both options later in this book.


Summary
CUDA grids and blocks are multidimensional with up to three dimensions. The multidimensionality of grids and blocks is useful for organizing threads to be mapped to multidimensional data. The kernel execution configuration parameters
define the dimensions of a grid and its blocks. Unique coordinates in blockIdx and threadIdx allow threads of a grid to identify themselves and their domains of data. It is the programmer’s responsibility to use these variables in kernel functions so that the threads can properly identify the portion of the data to process.
When accessing multidimensional data, programmers will often have to linearize multidimensional indices into a 1D offset. The reason is that dynamically allocated multidimensional arrays in C are typically stored as 1D arrays in row-major order. We use examples of increasing complexity to familiarize the reader with the mechanics of processing multidimensional arrays with multidimensional grids. These skills will be foundational for understanding parallel patterns and their associated optimization techniques.
'ComputerScience > Parallel Programming' 카테고리의 다른 글
| [CUDA] Chap7. Convolution (0) | 2026.03.24 |
|---|---|
| [CUDA] Chap6. Performance considerations (0) | 2026.03.20 |
| [CUDA] Chap5. Memory architecture and data locality (0) | 2026.03.19 |
| [CUDA] Chap4. Compute architecture and scheduling (0) | 2026.03.19 |
| [CUDA] Chap2. Heterogeneous data parallel computing (0) | 2026.03.19 |