Hive自定義函數(shù)的使用――useragent解析
來源:程序員人生 發(fā)布時(shí)間:2014-11-07 08:58:23 閱讀次數(shù):5177次
想要從日志數(shù)據(jù)中分析1下操作系統(tǒng)、閱讀器、版本使用情況,但是hive中的函數(shù)不能直接解析useragent,因而可以寫1個(gè)UDF來解析。useragent用于表示用戶確當(dāng)前操作系統(tǒng),閱讀器版本信息,形如:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29
其中解析ua可以用1個(gè)開源的工具包,叫做useragentutils.jar來處理,但是不能直接引入這個(gè)包,由于hadoop和hive都不支持直接援用第3方的包,要導(dǎo)入源碼。項(xiàng)目結(jié)構(gòu)應(yīng)當(dāng)以下圖

下面的代碼用來打印出操作系統(tǒng)、閱讀器版本信息:
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import eu.bitwalker.useragentutils.UserAgent;
public class ParseUserAgent_UDF extends UDF{
public Text evaluate(final Text userAgent){
StringBuilder builder = new StringBuilder();
UserAgent ua = new UserAgent(userAgent.toString());
builder.append(ua.getOperatingSystem()+" "+ua.getBrowser()+" "+ua.getBrowserVersion());
return new Text(builder.toString());
}
}
使用:打成jar包,hive中add jar xx.jar;
create temporary function ua_parse as 'com.xx.ParseUserAgent_UDF';
select ua_parse(ua) from table_name limit 3;
結(jié)果:
WINDOWS_7 CHROME21 21.0.1180.89
WINDOWS_7 CHROME33 33.0.1750.146
WINDOWS_7 CHROME21 21.0.1180.89
此種方式只能處理1行,生成1行,沒法進(jìn)行統(tǒng)計(jì)分析。
下面使用UDTF(User Defined Table Generating Function),處理1行,生成多列。
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import eu.bitwalker.useragentutils.UserAgent;
public class ParseUserAgent_UDTF extends GenericUDTF{
@Override
public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
if (args.length != 1) {
throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
}
if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentException("ExplodeMap takes string as a parameter");
}
ArrayList<String> fieldNames = new ArrayList<String>();
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
fieldNames.add("system");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldNames.add("browser");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldNames.add("version");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
@Override
public void process(Object[] arg){
try {
if(arg == null || arg.length == 0)
return;
String input = arg[0].toString();
String result[] = ua_parse(input).split(" ");
forward(result);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void close() throws HiveException {
}
public String ua_parse(String userAgent){
StringBuilder builder = new StringBuilder();
UserAgent ua = new UserAgent(userAgent.toString());
builder.append(ua.getOperatingSystem()+" "+ua.getBrowser()+" "+ua.getBrowserVersion());
return builder.toString();
}
}
select t.browser,count(*) c from (select ua_parse(ua) as (system,browser,version) from table_name) t group by t.browser order by c desc;
前10名:
CHROME31 987220571
UNKNOWN 708890045
IE8 420021677
IE7 411500373
MOBILE_SAFARI 291920740
IE6 217574865
IE11 179582201
IE9 165160040
CHROME30 158623163
CHROME21 155192489
未辨認(rèn)的還是很多!
參考:http://blog.csdn.net/ruidongliu/article/details/8791865
http://computerdragon.blog.51cto.com/6235984/1288567
生活不易,碼農(nóng)辛苦
如果您覺得本網(wǎng)站對(duì)您的學(xué)習(xí)有所幫助,可以手機(jī)掃描二維碼進(jìn)行捐贈(zèng)