您當(dāng)前位置：首頁(yè) > php框架 > 框架設(shè)計(jì) > CUDA系列學(xué)習(xí)（五）GPU基礎(chǔ)算法: Reduce, Scan, Histogram

CUDA系列學(xué)習(xí)（五）GPU基礎(chǔ)算法: Reduce, Scan, Histogram

來源：程序員人生發(fā)布時(shí)間：2015-07-03 08:53:05 閱讀次數(shù)：6859次

喵~不知不覺到了CUDA系列學(xué)習(xí)第5講，前幾講中我們主要介紹了基礎(chǔ)GPU中的軟硬件結(jié)構(gòu)，內(nèi)存管理，task類型等；這1講中我們將介紹3個(gè)基礎(chǔ)的GPU算法：reduce，scan，histogram，它們?cè)诓⑿兴惴ㄖ蟹浅＝?jīng)常使用，我們?cè)诒疚闹蟹謩e就其功能用途，串行與并行實(shí)現(xiàn)進(jìn)行論述。

1. Task complexity

task complexity包括step complexity（可以并行成幾個(gè)操作） & work complexity（總共有多少個(gè)工作要做）。
e.g. 下面的tree-structure圖中每一個(gè)節(jié)點(diǎn)表示1個(gè)操作數(shù)，每條邊表示1個(gè)操作，同層edge表示相同操作，問該圖表示的task的step complexity & work complexity分別是多少。

tree operation

Ans:
step complexity: 3；
work complexity: 6。
下面會(huì)有更具體的例子。

## 2. Reduce 引入：我們斟酌1個(gè)task：1+2+3+4+… 1) 最簡(jiǎn)單的順序履行順序組織為((1+2)+3)+4… 2) 由于operation之間沒有依賴關(guān)系，我們可以用Reduce簡(jiǎn)化操作，它可以減少serial implementation的步數(shù)。

### 2.1 what is reduce? Reduce input: 1. set of elements 2. reduction operation 1. binary: 兩個(gè)輸入1個(gè)輸出 2. 操作滿足結(jié)合律： (a@b)@c = a@(b@c), 其中@表示operator e.g +, 按位與都符合；a^b(expotentiation)和減法都不是 ![2. add_tree.png](http://img.blog.csdn.net/20150213145544805)

### 2.1.1 Serial implementation of Reduce: reduce的每步操作都依賴于其前1個(gè)操作的結(jié)果。比如對(duì)前面那個(gè)例子，n個(gè)數(shù)相加，work complexity 和 step complexity都是O(n)（緣由不言自明吧~）我們的目標(biāo)就是并行化操作，降下來step complexity. e.g add serial reduce -> parallel reduce。

### 2.1.2 Parallel implementation of Reduce: ![3. parallel_add.png](http://img.blog.csdn.net/20150213145641025) 也就是說，我們把step complexity降到了

log2n $log_2n$ 那末如果對(duì)

210 $2^{10}$ 個(gè)數(shù)做parallel reduce add，其step complexity就是10. 那末在這個(gè)parallel reduce的第1步，我們需要做512個(gè)加法，這對(duì)modern gpu不是啥大問題，但是如果我們要對(duì)

220 $2^{20}$ 個(gè)數(shù)做加法呢？就需要斟酌到gpu數(shù)量了，如果說gpu最多能并行做512個(gè)操作，我們就應(yīng)將

220 $2^{20}$ 個(gè)數(shù)分成1024*1024(共1024組)，每次做

210 $2^{10}$ 個(gè)數(shù)的加法。這類斟酌task范圍和gpu數(shù)量關(guān)系的做法有個(gè)理論叫Brent’s Theory. 下面我們具體來看： ![4. brent’s theory.png](http://img.blog.csdn.net/20150213145804704) 也就是進(jìn)行兩步操作，第1步分成1024個(gè)block，每一個(gè)block做加法；第2步將這1024個(gè)結(jié)果再用1個(gè)1024個(gè)thread的block進(jìn)行求和。kernel code：

__global__ void parallel_reduce_kernel(float *d_out, float* d_in){
    int myID = threadIdx.x + blockIdx.x * blockDim.x;
    int tid = threadIdx.x;

    //divide threads into two parts according to threadID, and add the right part to the left one, lead to reducing half elements, called an iteration; iterate until left only one element
    for(unsigned int s = blockDim.x / 2 ; s>0; s>>1){
        if(tid<s){
            d_in[myID] += d_in[myID + s];
        }
        __syncthreads(); //ensure all adds at one iteration are done
    }
    if (tid == 0){
        d_out[blockIdx.x] = d_in[myId];
    }
}

Quiz: 看1下上面的code可以從哪里進(jìn)行優(yōu)化？
Ans：我們?cè)谏?講中提到了global，shared & local memory的速度，那末這里對(duì)global memory的操作可以更改成shared memory，從而進(jìn)行提速：

__global__ void parallel_shared_reduce_kernel(float *d_out, float* d_in){
    int myID = threadIdx.x + blockIdx.x * blockDim.x;
    int tid = threadIdx.x;
    extern __shared__ float sdata[];
    sdata[tid] = d_in[myID];
    __syncthreads();

    //divide threads into two parts according to threadID, and add the right part to the left one, lead to reducing half elements, called an iteration; iterate until left only one element
    for(unsigned int s = blockDim.x / 2 ; s>0; s>>1){
        if(tid<s){
            sdata[myID] += sdata[myID + s];
        }
        __syncthreads(); //ensure all adds at one iteration are done
    }
    if (tid == 0){
        d_out[blockIdx.x] = sdata[myId];
    }
}

優(yōu)化的代碼中還有1點(diǎn)要注意，就是聲明的時(shí)候記得我們第3講中說過的kernel通用表示情勢(shì)：

kernel<<<grid of blocks, block of threads, shmem>>>

最后1項(xiàng)要在call kernel的時(shí)候聲明好，即:

parallel_reduce_kernel<<<blocks, threads, threads*sizeof(float)>>>(data_out, data_in);

好，那末問題來了，對(duì)這兩個(gè)版本（parallel_reduce_kernel 和 parallel_shared_reduce_kernel）, parallel_reduce_kernel比parallel_shared_reduce_kernel多用了幾倍的global memory帶寬？ Ans: 分別斟酌兩個(gè)版本的讀寫操作：

parallel_reduce_kernel

Times	Read Ops	Write Ops
1	1024	512
2	512	256
3	256	128
…
n	1	1

parallel_shared_reduce_kernel

Times	Read Ops	Write Ops
1	1024	1

所以，parallel_reduce_kernel所需的帶寬是parallel_shared_reduce_kernel的3倍。

3. Scan

3.1 what is scan?

Example:
- input: 1,2,3,4
- operation: Add
- ouput: 1,3,6,10（out[i]=sum(in[0:i])）
目的：解決難以并行的問題

拍拍腦袋想一想上面這個(gè)問題O(n)的1個(gè)解法是out[i] = out[i⑴] + in[i].下面我們來引入scan。

Inputs to scan:

input array
操作：binary & 滿足結(jié)合律（和reduce1樣）
identity element [I op a = a], 其中I 是identity element
quiz: what is the identity for 加法，乘法，邏輯與，邏輯或？
Ans：

op	Identity
加法	0
乘法	1
邏輯或\|\|	False
邏輯與&&	True

3.2 what scan does?

I/O	content
input	[ $a_0$	$a_1$	$a_2$	…	$a_n$ ]
output	[ $I$	$a_0$	$a_0igotimes a_1$	…	$a_0igotimes a_1igotimes$ … $igotimes a_n$ ]

其中 $igotimes$ 是scan operator，I 是 $igotimes$ 的identity element

3.2.1 Serial implementation of Scan

很簡(jiǎn)單：

int acc = identity;
for(i=0;i<elements.length();i++){
    acc = acc op elements[i];
    out[i] = acc;
}

work complexity: $O(n)$
step complexity: $O(n)$

那末，對(duì)scan問題，我們?cè)鯓訉?duì)其進(jìn)行并行化呢？

3.2.1 Parallel implementation of Scan

斟酌scan的并行化，可以并行計(jì)算n個(gè)output，每一個(gè)output元素i相當(dāng)于 $a_0igotimes a_1igotimes$ … $igotimes a_i$ ，是1個(gè)reduce operation。

Q: 那末問題的work complexity和step complexity分別變成多少了呢？
Ans:

step complexity:
取決于n個(gè)reduction中耗時(shí)最長(zhǎng)的，即 $O(log_2n)$
work complexity:
對(duì)每一個(gè)output元素進(jìn)行計(jì)算，總計(jì)算量為0+1+2+…+(n⑴)，所以復(fù)雜度為 $O(n^2)$ .

可見，step complexity降下來了，惋惜work complexity上去了，那末怎樣解決呢？這里有兩種Scan算法：

	more step efficiency	more work efficiency
hillis + steele （1986）	√
blelloch （1990）		√

Hillis + Steele

對(duì)Scan加法問題，hillis+steele算法的解決方案以下：

hillis + steele

即streaming’s
step 0: out[i] = in[i] + in[i⑴];
step 1: out[i] = in[i] + in[i⑵];
step 2: out[i] = in[i] + in[i⑷];
如果元素不存在（向下越界）就記為0；可見step 2的output就是scan 加法的結(jié)果(想一想為何，我們1會(huì)再分析)。

那末問題來了。。。
Q: hillis + steele算法的work complexity 和 step complexity分別為多少？

Hillis + steele Algorithm complexity
	$log(n)$	$O(sqrt n)$	$O(n)$	$O(nlogn)$	O(n^2)
work complexity				√
step complexity	√

解釋：

為了無(wú)妨礙大家思路，我在表格中將答案設(shè)為了白色，選中表格可見答案。

step complexity：
由于第i個(gè)step的結(jié)果為上1步輸出作為in, out[idx] = in[idx] + in[idx - 2^i], 所以step complexity = $O(log(n))$
work complexity:
workload = $(n⑴) + (n⑵)+ (n⑷)+ ...$ ，共有 $log(n)$ 項(xiàng)元素相加，所以可以近似看作1個(gè)矩陣，對(duì)應(yīng)上圖，長(zhǎng) $log(n)$ , 寬n，所以復(fù)雜度為 nlog( 生活不易，碼農(nóng)辛苦
如果您覺得本網(wǎng)站對(duì)您的學(xué)習(xí)有所幫助,可以手機(jī)掃描二維碼進(jìn)行捐贈(zèng)