tellkruto.blogg.se - Memory clock vs gpu clock

Each memory transaction feeds into a queue and is individually executed by the memory subsystem. One of the key areas to consider is in the number of memory transactions in flight. This makes the GPU model from Fermi onwards considerably easier to program than previous generations. Thus, one crucial difference is that access by a stride other than one, but within 128 bytes, now results in cached access instead of another memory fetch. By default every memory transaction is a 128-byte cache line fetch.

Fermi, unlike compute 1.x devices, fetches memory in transactions of either 32 or 128 bytes. The situation in Fermi and Kepler is much improved from this perspective. Thus, without careful consideration of how memory is used, you can easily receive a tiny fraction of the actual bandwidth available on the device. The bytes not used will be fetched from memory and simply be discarded. In fact, the hardware will issue one read request of at least 32 bytes for each thread. Thus, if thread 0 reads addresses 0, 1, 2, 3, 4, …, 31 and thread 1 reads addresses 32, 32, 34, …, 63, they will not be coalesced. This memory was not cached, so if threads did not access consecutive memory addresses, it led to a rapid drop off in memory bandwidth. This would then be reduced to 64 or 32 bytes if the total region being accessed by the coalesced threads was small enough and within the same 32-byte aligned block. In compute 1.x devices (G80, GT200), the coalesced memory transaction size would start off at 128 bytes per memory access. The size of memory transactions varies significantly between Fermi and the older versions. This request will be automatically combined or coalesced with requests from other threads in the same warp, provided the threads access adjacent memory locations and the start of the memory area is suitably aligned. When a warp accesses a memory location that is not available, the hardware issues a read or write request to the memory. Memory latency is designed to be hidden on GPUs by running threads from other warps. Latency refers to the time the operation takes to complete. In the GPU case we’re concerned primarily about the global memory bandwidth. Bandwidth refers to the amount of data that can be moved to or from a given destination. Memory bandwidth and latency are key considerations in almost all applications, but especially so for GPU applications. Shane Cook, in CUDA Programming, 2013 Memory bandwidth