A variant of the previous matrix multiplication can be used to illustrate how strided accesses to global memory, as well as shared memory bank conflicts, are handled. This variant simply uses the transpose of A rather than B, or C = AAT.
As in the previous section, this example assumes compute capability 1.x; it would have a width of 32 rather than 16 for compute capability 2.x.
A simple implementation for C = AAT is shown in Unoptimized handling of strided accesses to global memory.
In Unoptimized handling of strided accesses to global memory, the row-th, col-th element of C is obtained by taking the dot product of the row-th and col-th rows of A. The effective bandwidth for this kernel is 1.1 GBps on an NVIDIA GeForce GTX 280 and 0.5 GBps on an NVIDIA GeForce GTX 8800. These results are substantially lower than the corresponding measurements for the C = AB kernel. The difference is in how threads in a half warp access elements of A in the second term, a[col*TILE_DIM+i], for each iteration i. For a half warp of threads, col represents sequential columns of the transpose of A, and therefore col*TILE_DIM represents a strided access of global memory with a stride of 16. This results in uncoalesced memory accesses on devices with compute capability 1.1 or lower and plenty of wasted bandwidth on devices with compute capability 1.2 or higher. The way to avoid strided access is to use shared memory as before, except in this case a half warp reads a row of A into a column of a shared memory tile, as shown in Optimized handling of strided accesses using coalesced reads from global memory.
Optimized handling of strided accesses using coalesced reads from global memory uses the shared transposedTile to avoid uncoalesced accesses in the second term in the dot product, and the shared aTile technique from the previous example to avoid uncoalesced accesses in the first term. The effective bandwidth of this kernel is 24.9 GBps on an NVIDIA GeForce GTX 280 and 13.2 GBps on an NVIDIA GeForce GTX 8800. These results are slightly lower than those obtained by the final kernel for C = AB. The cause of the difference is shared memory bank conflicts.
__shared__ float transposedTile[TILE_DIM][TILE_DIM+1];
This padding eliminates the conflicts entirely, because now the stride between threads is 17 banks (33 banks for compute capability 2.x), which, due to modular arithmetic used to compute bank indices, is equivalent to a unit stride. After this change, the effective bandwidth is 30.4 GBps on an NVIDIA GeForce GTX 280 and 15.6 GBps on an NVIDIA GeForce GTX 8800, which is comparable to the results from the last C = AB kernel.
The results of these optimizations are summarized in Table 1.
Optimization | NVIDIA GeForce GTX 280 | NVIDIA GeForce GTX 8800 |
---|---|---|
No optimization | 1.1 GBps | 0.5 GBps |
Using shared memory to coalesce global reads | 24.9 GBps | 13.2 GBps |
Removing bank conflicts | 30.4 GBps | 15.6 GBps |
These results should be compared with those in Table 1. As can be seen from these tables, judicious use of shared memory can dramatically improve performance.
The examples in this section have illustrated three reasons to use shared memory: