您當(dāng)前位置：首頁 > 服務(wù)器 > 大數(shù)據(jù)處理算法三：分而治之/hash映射 + hash統(tǒng)計(jì) + 堆/快速/歸并排序

大數(shù)據(jù)處理算法三：分而治之/hash映射 + hash統(tǒng)計(jì) + 堆/快速/歸并排序

來源：程序員人生發(fā)布時(shí)間：2015-07-03 08:38:36 閱讀次數(shù)：4576次

百度面試題1、海量日志數(shù)據(jù)，提取出某日訪問百度次數(shù)最多的那個(gè)IP。

IP 是32位的，最多有個(gè)2^32個(gè)IP。一樣可以采取映照的方法，比如模1000，把全部大文件映照為1000個(gè)小文件，再找出每一個(gè)小文中出現(xiàn)頻率最大的 IP（可以采取hash_map進(jìn)行頻率統(tǒng)計(jì)，然后再找出頻率最大的幾個(gè)）及相應(yīng)的頻率。然后再在這1000個(gè)最大的IP中，找出那個(gè)頻率最大的IP，即為所求。

百度面試題2、搜索引擎會通過日志文件把用戶每次檢索使用的所有檢索串都記錄下來，每一個(gè)查詢串的長度為1⑵55字節(jié)。

假定目前有1千萬個(gè)記錄（這些查詢串的重復(fù)度比較高，雖然總數(shù)是1千萬，但如果除去重復(fù)后，不超過3百萬個(gè)。1個(gè)查詢串的重復(fù)度越高，說明查詢它的用戶越多，也就是越熱門。），請你統(tǒng)計(jì)最熱門的10個(gè)查詢串，要求使用的內(nèi)存不能超過1G。

第 1步借用hash統(tǒng)計(jì)進(jìn)行預(yù)處理：先對這批海量數(shù)據(jù)預(yù)處理(保護(hù)1個(gè)Key為Query字串，Value為該Query出現(xiàn)次數(shù)，即Hashmap(Query，Value)，每次讀取1 個(gè)Query，如果該字串不在Table中，那末加入該字串，并且將Value值設(shè)為1；如果該字串在Table中，那末將該字串的計(jì)數(shù)加1便可。終究我們在O(N)（N為1千萬，由于要遍歷全部數(shù)組1遍才能統(tǒng)計(jì)處每一個(gè)query出現(xiàn)的次數(shù)）的時(shí)間復(fù)雜度內(nèi)用Hash表完成了統(tǒng)計(jì)；
第2步借用堆排序找出最熱門的10個(gè)查詢串：時(shí)間復(fù)雜度為N'*logK。保護(hù)1個(gè)K(該題目中是10)大小的小根堆，然后遍歷3百萬個(gè)Query，分別和根元素進(jìn)行對照（對照value的值），找出10個(gè)value值最大的query
終究的時(shí)間復(fù)雜度是：O（N） + N'*O（logK），（N為1000萬，N’為300萬）

或：采取trie樹，關(guān)鍵字域存該查詢串出現(xiàn)的次數(shù)，沒有出現(xiàn)為0。最后用10個(gè)元素的最小推來對出現(xiàn)頻率進(jìn)行排序。

我們先看HashMap 實(shí)現(xiàn)

1. HashMap的數(shù)據(jù)結(jié)構(gòu)

數(shù)據(jù)結(jié)構(gòu)中有數(shù)組和鏈表來實(shí)現(xiàn)對數(shù)據(jù)的存儲，但這二者基本上是兩個(gè)極端。

數(shù)組

數(shù)組存儲區(qū)間是連續(xù)的，占用內(nèi)存嚴(yán)重，故空間復(fù)雜的很大。但數(shù)組的2分查找時(shí)間復(fù)雜度小，為O(1)；數(shù)組的特點(diǎn)是：尋址容易，插入和刪除困難；

鏈表

鏈表存儲區(qū)間離散，占用內(nèi)存比較寬松，故空間復(fù)雜度很小，但時(shí)間復(fù)雜度很大，達(dá)O（N）。鏈表的特點(diǎn)是：尋址困難，插入和刪除容易。

哈希表

那末我們能不能綜合二者的特性，做出1種尋址容易，插入刪除也容易的數(shù)據(jù)結(jié)構(gòu)？答案是肯定的，這就是我們要提起的哈希表。哈希表（(Hash table）既滿足了數(shù)據(jù)的查找方便，同時(shí)不占用太多的內(nèi)容空間，使用也10分方便。

　　哈希表有多種不同的實(shí)現(xiàn)方法，我接下來解釋的是最經(jīng)常使用的1種方法―― 拉鏈法，我們可以理解為“鏈表的數(shù)組”

我用java 自己實(shí)現(xiàn)了1個(gè)HashMap，固然這比較簡點(diǎn)，不過能說明大概原理，改有的功能基本上有了

index=hashCode(key)=key％16

哈希算法很多，下面我用了java自帶的，固然你也能夠用別的

/** * 自定義 HashMap * @author JYC506 * * @param <K> * @param <V> */ public class HashMap<K, V> { private static final int CAPACTITY = 16; transient Entry<K, V>[] table = null; @SuppressWarnings("unchecked") public HashMap() { super(); table = new Entry[CAPACTITY]; } /* 哈希算法 */ private final int toHashCode(Object obj) { int h = 0; if (obj instanceof String) { return StringHash.toHashCode((String) obj); } h ^= obj.hashCode(); h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); } /*放入hashMap*/ public void put(K key, V value) { int hashCode = this.toHashCode(key); int index = hashCode % CAPACTITY; if (table[index] == null) { table[index] = new Entry<K, V>(key, value, hashCode); } else { for (Entry<K, V> entry = table[index]; entry != null; entry = entry.nextEntry) { if (entry.hashCode == hashCode && (entry.key == key || key.equals(entry.key))) { entry.value = value; return; } } Entry<K, V> entry2 = table[index]; Entry<K, V> entry3 = new Entry<K, V>(key, value, hashCode); entry3.nextEntry = entry2; table[index] = entry3; } } /*獲得值*/ public V get(K key) { int hashCode = this.toHashCode(key); int index = hashCode % CAPACTITY; if (table[index] == null) { return null; } else { for (Entry<K, V> entry = table[index]; entry != null; entry = entry.nextEntry) { if (entry.hashCode == hashCode && (entry.key == key || key.equals(entry.key))) { return entry.value; } } return null; } } /*刪除*/ public void remove(K key){ int hashCode = this.toHashCode(key); int index = hashCode % CAPACTITY; if (table[index] == null) { return ; } else { Entry<K, V> parent=null; for (Entry<K, V> entry = table[index]; entry != null; entry = entry.nextEntry) { if (entry.hashCode == hashCode && (entry.key == key || key.equals(entry.key))) { if(parent!=null){ parent.nextEntry=entry.nextEntry; entry=null; return ; } } parent=entry; } } } public static void main(String[] args) { HashMap<String,String> map=new HashMap<String,String>(); map.put("1", "2"); map.put("1", "3"); map.put("3", "哈哈哈"); System.out.println(map.get("1")); System.out.println(map.get("3")); map.remove("1"); System.out.println(map.get("1")); } } class Entry<K, V> { K key; V value; int hashCode; Entry<K, V> nextEntry; public Entry(K key, V value, int hashCode) { super(); this.key = key; this.value = value; this.hashCode = hashCode; } } /* 字符串hash算法 */ class StringHash { public static final int toHashCode(String str) { /* 我用java自帶的 */ return str.hashCode(); } }

生活不易，碼農(nóng)辛苦
如果您覺得本網(wǎng)站對您的學(xué)習(xí)有所幫助,可以手機(jī)掃描二維碼進(jìn)行捐贈
程序員人生

------分隔線----------------------------

上一篇 Linux定時(shí)任務(wù)中用腳本代替可執(zhí)行程序的一個(gè)好處

下一篇 oracle已經(jīng)卸載了但為什么在服務(wù)里還是有看到

分享到:

------分隔線----------------------------

為碼而活

積分：4237

15粉絲

7關(guān)注

欄目熱點(diǎn)

多多色-多人伦交性欧美在线观看-多人伦精品一区二区三区视频-多色视频-免费黄色视屏网站-免费黄色在线

大數(shù)據(jù)處理算法三：分而治之/hash映射 + hash統(tǒng)計(jì) + 堆/快速/歸并排序