多多色-多人伦交性欧美在线观看-多人伦精品一区二区三区视频-多色视频-免费黄色视屏网站-免费黄色在线

國內最全IT社區平臺 聯系我們 | 收藏本站
阿里云優惠2
您當前位置:首頁 > 互聯網 > pig簡單的代碼實例:報表統計行業中的點擊和曝光量

pig簡單的代碼實例:報表統計行業中的點擊和曝光量

來源:程序員人生   發布時間:2014-11-12 09:01:32 閱讀次數:3075次

注意:pig中用run或exec 運行腳本。除cd和ls,其他命令不用。在本代碼中用rm和mv命令做例子,容易出錯。

另外,pig只有在store或dump時候才會真正加載數據,否則,只是加載代碼,不具體操作數據。所以在rm操作時必須注意該文件是不是已生成。如果rm的文件為生成,可以第3文件,進行mv改名操作


SET job.name 'test_age_reporth_istorical';-- 定義任務名字,在http://172.XX.XX.XX:50030/jobtracker.jsp中查看任務狀態,失敗成功。

SET job.priority HIGH;--優先級


--注冊jar包,用于讀取sequence file和輸出分析結果文件
REGISTER piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); --讀取2進制文件,函數名定義


%default Cleaned_Log /user/C/data/XXX/cleaned/$date/*/part* --$date是外部傳入參數


%default AD_Data /user/XXX/data/xxx/metadata/ad/part*
%default Campaign_Data /user/xxx/data/xxx/metadata/campaign/part*
%default Social_Data /user/xxx/data/report/socialdata/part*


--所有的輸出文件路徑:
%default Industry_Path $file_path/report/historical/age/$year/industry
%default Industry_SUM $file_path/report/historical/age/$year/industry_sum
%default Industry_TMP $file_path/report/historical/age/$year/industry_tmp


%default Industry_Brand_Path $file_path/report/historical/age/$year/industry_brand
%default Industry_Brand_SUM $file_path/report/historical/age/$year/industry_brand_sum
%default Industry_Brand_TMP $file_path/report/historical/age/$year/industry_brand_tmp


%default ALL_Path $file_path/report/historical/age/$year/all
%default ALL_SUM $file_path/report/historical/age/$year/all_sum
%default ALL_TMP $file_path/report/historical/age/$year/all_tmp


%default output_path /user/xxx/tmp/result




origin_cleaned_data = LOAD '$Cleaned_Log' USING PigStorage(',') --讀取日志文件
AS (ad_network_id:chararray,
    xxx_ad_id:chararray,
    guid:chararray,
    id:chararray,
    create_time:chararray,
    action_time:chararray,
    log_type:chararray, 
    ad_id:chararray,
    positioning_method:chararray,
    location_accuracy:chararray,
    lat:chararray, 
    lon:chararray,
    cell_id:chararray,
    lac:chararray,
    mcc:chararray,
    mnc:chararray,
    ip:chararray,
    connection_type:chararray,
    android_id:chararray,
    android_advertising_id:chararray,
    openudid:chararray,
    mac_address:chararray,
    uid:chararray,
    density:chararray,
    screen_height:chararray,
    screen_width:chararray,
    user_agent:chararray,
    app_id:chararray,
    app_category_id:chararray,
    device_model_id:chararray,
    carrier_id:chararray,
    os_id:chararray,
    device_type:chararray,
    os_version:chararray,
    country_region_id:chararray,
    province_region_id:chararray,
    city_region_id:chararray,
    ip_lat:chararray,
    ip_lon:chararray,
    quadkey:chararray);


--loading metadata/ad(adId,campaignId) 
metadata_ad = LOAD '$AD_Data' USING PigStorage(',') AS (adId:chararray, campaignId:chararray);


--loading metadata/campaign??°??(campaignId, industryId, brandId)
metadata_campaign = LOAD '$Campaign_Data' USING PigStorage(',') AS (campaignId:chararray, industryId:chararray, brandId:chararray);


--ad and campaign for inner join
joinAdCampaignByCampaignId = JOIN metadata_ad BY campaignId,metadata_campaign BY campaignId;--(adId,campaignId,campaignId,industryId,brandId)
--filtering out redundant column of joinAdCampaignByCampaignId
joined_ad_campaign_data = FOREACH joinAdCampaignByCampaignId GENERATE $0 AS adId,$3 AS industryId,$4 AS brandId; --(adId,industryId,brandId)


--extract column for analyzing
origin_historical_age = FOREACH origin_cleaned_data GENERATE xxx_ad_id,guid,log_type;--(xxx_ad_id,guid,log_type)
--distinct
distinct_origin_historical_age = DISTINCT origin_historical_age;--(xxx_ad_id,guid,log_type)


--loading metadata_region(guid_social, sex, age, income, edu, hobby)
metadata_social = LOAD '$Social_Data' USING PigStorage(',') AS (guid_social:chararray, sex:chararray, age:chararray, income:chararray, edu:chararray, hobby:chararray);
--extract needed column in metadata_social
social_age = FOREACH metadata_social GENERATE guid_social,age;


--join socialData(metadata_social) and logData(distinct_origin_historical_age):
joinedByGUID = JOIN social_age BY guid_social, distinct_origin_historical_age BY guid;
--(guid_social, age; xxx_ad_id,guid,log_type)




--generating analyzing age data
joined_orgin_age_data = FOREACH joinedByGUID GENERATE xxx_ad_id,guid,log_type,age;
joinedByAdId = JOIN joined_ad_campaign_data BY adId, joined_orgin_age_data BY xxx_ad_id; --(adId,industryId,brandId,xxx_ad_id,guid,log_type,age)
--filtering
all_current_data = FOREACH joinedByAdId GENERATE guid,log_type,industryId,brandId,age; --(guid,log_type,industryId,brandId,age)


--for industry analyzing
industry_current_data = FOREACH all_current_data GENERATE industryId,guid,age,log_type;  --(industryId,guid,age,log_type)


--load all in the path "industry"
industry_existed_Data = LOAD '$Industry_Path' USING PigStorage(',') AS (industryId:chararray,guid:chararray,age:chararray,log_type:chararray);


--merge with history data 
union_Industry = UNION industry_existed_Data, industry_current_data;
distict_union_industry = DISTINCT union_Industry;
group_industry = GROUP distict_union_industry BY ($2,$0,$3);
count_guid_for_industry = FOREACH group_industry GENERATE FLATTEN(group),COUNT($1.$1);


rm $Industry_SUM;
STORE count_guid_for_industry INTO '$Industry_SUM' USING PigStorage(',');


--storing union industry data(current and history)
STORE distict_union_industry INTO '$Industry_TMP' USING PigStorage(',');
rm $Industry_Path
mv $Industry_TMP $Industry_Path


--counting guid for industry and brand 
industry_brand_current = FOREACH all_current_data GENERATE age,industryId,brandId,log_type,guid;
--(age,industryId,brandId,log_type,guid)


--load history data of industry_brand
industry_brand_history = LOAD '$Industry_Brand_Path' USING PigStorage(',') AS(age:chararray, industryId:chararray, brandId:chararray, log_type:chararray, guid:chararray);


--union all data of industry_brand
union_industry_brand = UNION industry_brand_current,industry_brand_history;
unique_industry_brand = DISTINCT union_industry_brand;
--(age,industryId,brandId,log_type,guid)


--counting users' number for industry and brand
group_industry_brand = GROUP unique_industry_brand BY ($0,$1,$2,$3);
count_guid_for_industry_brand = FOREACH group_industry_brand GENERATE FLATTEN(group),COUNT($1.$4);


rm $Industry_Brand_SUM;
STORE count_guid_for_industry_brand INTO '$Industry_Brand_SUM' USING PigStorage(',');


STORE unique_industry_brand INTO '$Industry_Brand_TMP' USING PigStorage(',');
rm $Industry_Brand_Path;
mv $Industry_Brand_TMP $Industry_Brand_Path


--counting user number for age and logtype
current_data = FOREACH all_current_data GENERATE age,log_type,guid;--(age,log_type,guid)


--load history data of age and logtype
history_data = LOAD '$ALL_Path' USING PigStorage(',') AS(age:chararray,log_type:chararray,guid:chararray);


--union current and history data
union_all_data = UNION history_data, current_data;
unique_all_data = DISTINCT union_all_data;


--count users' number
group_all_data = GROUP unique_all_data BY ($0,$1);
count_guid_for_age_logtype = FOREACH group_all_data GENERATE FLATTEN(group),COUNT($1.$2);


rm $ALL_SUM;
STORE count_guid_for_age_logtype INTO '$ALL_SUM' USING PigStorage(',');


STORE unique_all_data INTO '$ALL_TMP' USING PigStorage(',');
rm $ALL_Path
mv $ALL_TMP $ALL_Path

生活不易,碼農辛苦
如果您覺得本網站對您的學習有所幫助,可以手機掃描二維碼進行捐贈
程序員人生
------分隔線----------------------------
分享到:
------分隔線----------------------------
關閉
程序員人生
主站蜘蛛池模板: 亚洲欧美天堂综合久久 | 亚洲精品在线播放 | 欧美日韩精品一区二区在线线 | 波多野结衣资源在线观看 | 欧美成人h版网址 | 国产精品亚洲欧美大片在线看 | 国产精品久久久久毛片 | 伊人情人综合成人久久网小说 | 中国在线播放精品区 | 亚洲国产欧美在线成人aaaa | 国产精品久久久久久影视 | 日本成人不卡视频 | 午夜影院在线观看视频 | www.日韩精品 | 麻豆va一区二区三区久久浪 | 日韩高清一区二区 | 九色视屏 | 日本不卡视频在线播放 | 三级性生活视频 | 亚洲综合图片 | 欧美激情久久久久久久大片 | 亚洲欧洲国产综合 | 国产精品久久现线拍久青草 | 男人边吃奶边做好爽的视频 | 成人亚欧网站在线观看 | 亚洲欧美成人 | 国产精品自拍在线观看 | 日本欧美一区二区三区在线观看 | 成人免费视频一区 | 一区二区中文字幕亚洲精品 | 亚洲欧美精品天堂久久综合一区 | 网站大全黄免费 | 秋霞午夜 | 乌克兰鲜嫩xxxx | 高跟鞋性xxxxhd| 亚洲成a人片在线观看www流畅 | 亚洲日本高清 | jux397在线三浦惠理子 | 成人免费视频一区二区三区 | 久久精品国产亚洲麻豆 | 亚洲精品456在线观看 |