博客
关于我
自定义Hive Sql Job分析工具
阅读量:367 次
发布时间:2019-03-05

本文共 18091 字,大约阅读时间需要 60 分钟。

Hive SQL Job Analysis Tool

前言

在大数据领域,Hive的出现显著降低了使用Hadoop时的学习成本。用户可以通过类似SQL的语法规则编写查询语句,从Hive表中检索目标数据。最为重要的是,这些SQL语句最终会转化为MapReduce作业进行处理。这也是Hive最强大的地方,可以简单理解为Hive依托于Hadoop的1个壳。不过,需要注意的是,并非每个Hive查询SQL语句都会一一对应到最后生成的Job。如果你的SQL是一个大SQL,它会转化为多个小Job,这些小Job是独立运行的,以不同的Job名称进行区别,但保留公共的Job名称。因此,一个自然出现的问题是:对于超级长的Hive SQL语句,我们如何确定到底是哪个子SQL花费了大量执行时间?在JobHistory中,只有每个子Job的运行时间信息,而没有对应的子SQL语句。一旦实现了这个功能,就能帮助我们快速定位问题所在。


Hive子Job中的SQL

为了实现上述目标,我们需要先知道这些子SQL到底存在于哪里。在前面的描述中已经提到,Hive依托于Hadoop,因此Hive提交的Job信息也保存在Hadoop的HDFS上。思考一下JobHistory中的文件类型,你会发现包含以下后缀的文件存在:

  • .jhist文件:这是Hive JobHistory文件,记录了每个Job的运行信息。
  • 带有conf字符的.xml文件:这是Job提交时的一些配置信息,从文件名上来看就是Job提交时的一些配置信息。
  • 通过使用vim命令查阅conf.xml后缀的文件,我们可以查看是否有我们想要的hive query string这样的属性。目标算是找到了,这样的信息确实存在。后面的操作就是如何解析这段有用的信息了。


    程序工具分析Hive SQL Job

    知道了目标数据源,我们可以想到的最简单快速的方法就是逐行解析文件,做文本匹配,筛选关键信息。这些代码谁都会写,首先要传入一个HDFS目录地址,这个地址是在JobHistory的存储目录上加上一个具体日期目录。这段解析程序在文章的末尾会加上。

    在调试分析程序时,我们遇到了一些问题,这些问题在调试时比较有用:


    1. Hive SQL中的中文导致解析出现乱码

    这个问题非常讨厌,因为考虑到SQL中存在中文注释,而Hadoop在存储中文的时候都是用UTF-8编码方式,所以读出文件数据后进行一次转UTF-8编码方式的处理,就是下面所示代码:

    InputStreamReader isr = new InputStreamReader(in, "UTF-8");BufferedReader br = new BufferedReader(isr);while ((str = br.readLine()) != null) {    // 读取并处理每一行内容}

    2. 单线程解析文件速度过慢

    之前在测试环境中做文件解析看不出真实效果,文件量不大,解析就OK了。但到真实环境中,多达几万个Job文件,程序马上就吃不消了,算上解析文件,再把结果写入MySQL,耗时达到60多分钟。后来改成了多线程的方式,后来开到10个线程去跑,速度才快了许多。


    3. 结果数据写入MySQL过慢

    后来处理速度是上去了,但写入SQL速度过慢。比如说,我有一次测试,开10个线程区解析,花了8分钟就解析好了几万个文件数据,但插入数据库花了20分钟左右,且量也就几万条语句。后来改成了批处理的方式,效果并没有什么大的改变,这个慢的问题具体并没有被解决掉,怀疑可能是因有些语句中存在超长的Hive SQL语句导致的。


    解析线程代码

    下面是程序的主要分析代码,分为工具类代码和解析线程类,代码全部链接在此处:

    主工具代码

    package org.apache.hadoop.mapreduce.v2.hs.tool.sqlanalyse;import java.io.BufferedReader;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.util.ArrayList;import java.util.HashMap;import java.util.LinkedList;import java.util.List;import java.util.Map.Entry;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileContext;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.fs.RemoteIterator;import org.apache.hadoop.fs.UnsupportedFileSystemException;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.mapreduce.v2.hs.tool.sqlanalyse.ParseThread;public class HiveSqlAnalyseTool {    private int threadNum;    private String dirType;    private String jobHistoryPath;    private FileContext doneDirFc;    private Path doneDirPrefixPath;    private LinkedList
    fileStatusList; private HashMap
    dataInfos; private DbClient dbClient; public HiveSqlAnalyseTool(String dirType, String jobHistoryPath, int threadNum) { this.threadNum = threadNum; this.dirType = dirType; this.jobHistoryPath = jobHistoryPath; this.dataInfos = new HashMap
    (); this.fileStatusList = new LinkedList
    (); this.dbClient = new DbClient(BaseValues.DB_URL, BaseValues.DB_USER_NAME, BaseValues.DB_PASSWORD, BaseValues.DB_HIVE_SQL_STAT_TABLE_NAME); try { doneDirPrefixPath = FileContext.getFileContext(new Configuration()) .makeQualified(new Path(this.jobHistoryPath)); doneDirFc = FileContext.getFileContext(doneDirPrefixPath.toUri()); } catch (UnsupportedFileSystemException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IllegalArgumentException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public void readJobInfoFiles() { List
    files; files = new ArrayList
    (); try { files = scanDirectory(doneDirPrefixPath, doneDirFc, files); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } if (files != null) { for (FileStatus fs : files) { parseFileInfo(fs); } System.out.println("files num is " + files.size()); System.out.println("fileStatusList size is " + fileStatusList.size()); ParseThread[] threads = new ParseThread[threadNum]; for (int i = 0; i < threadNum; i++) { System.out.println("thread " + i + "start run"); threads[i] = new ParseThread(this, fileStatusList, dataInfos); threads[i].start(); } for (int i = 0; i < threadNum; i++) { System.out.println("thread " + i + "join run"); try { if (threads[i] != null) { threads[i].join(); } } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } else { System.out.println("files is null"); } printStatDatas(); } protected List
    scanDirectory(Path path, FileContext fc, List
    jhStatusList) throws IOException { path = fc.makeQualified(path); System.out.println("dir path is " + path.getName()); try { RemoteIterator
    fileStatusIter = fc.listStatus(path); while (fileStatusIter.hasNext()) { FileStatus fileStatus = fileStatusIter.next(); Path filePath = fileStatus.getPath(); if (fileStatus.isFile()) { jhStatusList.add(fileStatus); fileStatusList.add(fileStatus); } else if (fileStatus.isDirectory()) { scanDirectory(filePath, fc, jhStatusList); } } } catch (FileNotFoundException fe) { System.out.println("Error while scanning directory " + path); } return jhStatusList; } private void parseFileInfo(FileStatus fs) { String resultStr; String str; String username; String fileType; String jobId; String jobName; String hiveSql; int startPos; int endPos; int hiveSqlFlag; long launchTime; long finishTime; int mapTaskNum; int reduceTaskNum; String xmlNameFlag; String launchTimeFlag; String finishTimeFlag; String launchMapFlag; String launchReduceFlag; Path path; FileSystem fileSystem; InputStream in; resultStr = ""; fileType = ""; hiveSql = ""; jobId = ""; jobName = ""; username = ""; hiveSqlFlag = 0; launchTime = 0; finishTime = 0; mapTaskNum = 0; reduceTaskNum = 0; xmlNameFlag = "
    "; launchTimeFlag = "\"launchTime\":"; finishTimeFlag = "\"finishTime\":"; launchMapFlag = "\"Launched map tasks\""; launchReduceFlag = "\"Launched reduce tasks\""; path = fs.getPath(); str = path.getName(); if (str.endsWith(".xml")) { fileType = "config"; endPos = str.lastIndexOf("_"); jobId = str.substring(0, endPos); } else if (str.endsWith(".jhist")) { fileType = "info"; endPos = str.indexOf("-"); jobId = str.substring(0, endPos); } else { return; } try { fileSystem = path.getFileSystem(new Configuration()); in = fileSystem.open(path); InputStreamReader isr; BufferedReader br; isr = new InputStreamReader(in, "UTF-8"); br = new BufferedReader(isr); while ((str = br.readLine()) != null) { if (str.contains("mapreduce.job.user.name")) { startPos = str.indexOf(xmlNameFlag); endPos = str.indexOf("
    "); username = str.substring(startPos + xmlNameFlag.length(), endPos); } else if (str.contains("mapreduce.job.name")) { startPos = str.indexOf(xmlNameFlag); endPos = str.indexOf(""); jobName = str.substring(startPos + xmlNameFlag.length(), endPos); } else if (str.contains("hive.query.string")) { hiveSqlFlag = 1; hiveSql = str; } else if (hiveSqlFlag == 1) { hiveSql += str; if (str.contains("")) { startPos = hiveSql.indexOf(xmlNameFlag); endPos = hiveSql.indexOf(""); hiveSql = hiveSql.substring(startPos + xmlNameFlag.length(), endPos); hiveSqlFlag = 0; } } else if (str.startsWith("{\"type\":\"JOB_INITED\"")) { startPos = str.indexOf(launchTimeFlag); str = str.substring(startPos + launchTimeFlag.length()); endPos = str.indexOf(","); launchTime = Long.parseLong(str.substring(0, endPos)); } else if (str.startsWith("{\"type\":\"JOB_FINISHED\"")) { mapTaskNum = parseTaskNum(launchMapFlag, str); reduceTaskNum = parseTaskNum(launchReduceFlag, str); startPos = str.indexOf(finishTimeFlag); str = str.substring(startPos + finishTimeFlag.length()); endPos = str.indexOf(","); finishTime = Long.parseLong(str.substring(0, endPos)); } } System.out.println("jobId is " + jobId); System.out.println("jobName is " + jobName); System.out.println("username is " + username); System.out.println("map task num is " + mapTaskNum); System.out.println("reduce task num is " + reduceTaskNum); System.out.println("launchTime is " + launchTime); System.out.println("finishTime is " + finishTime); System.out.println("hive query sql is " + hiveSql); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } if (fileType.equals("config")) { insertConfParseData(jobId, jobName, username, hiveSql); } else if (fileType.equals("info")) { insertJobInfoParseData(jobId, launchTime, finishTime, mapTaskNum, reduceTaskNum); } } private void insertConfParseData(String jobId, String jobName, String username, String sql) { String[] array; if (dataInfos.containsKey(jobId)) { array = dataInfos.get(jobId); } else { array = new String[BaseValues.DB_COLUMN_HIVE_SQL_LEN]; } array[BaseValues.DB_COLUMN_HIVE_SQL_JOBID] = jobId; array[BaseValues.DB_COLUMN_HIVE_SQL_JOBNAME] = jobName; array[BaseValues.DB_COLUMN_HIVE_SQL_USERNAME] = username; array[BaseValues.DB_COLUMN_HIVE_SQL_HIVE_SQL] = sql; dataInfos.put(jobId, array); } private void insertJobInfoParseData(String jobId, long launchTime, long finishedTime, int mapTaskNum, int reduceTaskNum) { String[] array; if (dataInfos.containsKey(jobId)) { array = dataInfos.get(jobId); } else { array = new String[BaseValues.DB_COLUMN_HIVE_SQL_LEN]; } array[BaseValues.DB_COLUMN_HIVE_SQL_JOBID] = jobId; array[BaseValues.DB_COLUMN_HIVE_SQL_START_TIME] = String.valueOf(launchTime); array[BaseValues.DB_COLUMN_HIVE_SQL_FINISH_TIME] = String.valueOf(finishedTime); array[BaseValues.DB_COLUMN_HIVE_SQL_MAP_TASK_NUM] = String.valueOf(mapTaskNum); array[BaseValues.DB_COLUMN_HIVE_SQL_REDUCE_TASK_NUM] = String.valueOf(reduceTaskNum); dataInfos.put(jobId, array); } private int parseTaskNum(String flag, String jobStr) { int taskNum; int startPos; int endPos; String tmpStr; taskNum = 0; tmpStr = jobStr; startPos = tmpStr.indexOf(flag); if (startPos == -1) { return 0; } tmpStr = tmpStr.substring(startPos + flag.length()); endPos = tmpStr.indexOf("}"); tmpStr = tmpStr.substring(0, endPos); taskNum = Integer.parseInt(tmpStr.split(":")[1]); return taskNum; } private void printStatDatas() { if (dbClient != null) { dbClient.createConnection(); } if (dataInfos != null) { System.out.println("map data size is " + dataInfos.size()); if (dbClient != null && dirType.equals("dateTimeDir")) { dbClient.insertDataBatch(dataInfos); } } if (dbClient != null) { dbClient.closeConnection(); } } public synchronized FileStatus getOneFile() { FileStatus fs; fs = null; if (fileStatusList != null && fileStatusList.size() > 0) { fs = fileStatusList.poll(); } return fs; } public synchronized void addDataToMap(String jobId, String[] values) { if (dataInfos != null) { dataInfos.put(jobId, values); } }}

    解析线程代码

    package org.apache.hadoop.mapreduce.v2.hs.tool.sqlanalyse;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.util.HashMap;import java.util.LinkedList;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;public class ParseThread extends Thread {    private HiveSqlAnalyseTool tool;    private LinkedList
    fileStatus; private HashMap
    dataInfos; public ParseThread(HiveSqlAnalyseTool tool, LinkedList
    fileStatus, HashMap
    dataInfos) { this.tool = tool; this.fileStatus = fileStatus; this.dataInfos = dataInfos; } @Override public void run() { FileStatus fs; while (fileStatus != null && !fileStatus.isEmpty()) { fs = tool.getOneFile(); parseFileInfo(fs); } super.run(); } private void parseFileInfo(FileStatus fs) { String str; String username; String fileType; String jobId; String jobName; String hiveSql; int startPos; int endPos; int hiveSqlFlag; long launchTime; long finishTime; int mapTaskNum; int reduceTaskNum; String xmlNameFlag; String launchTimeFlag; String finishTimeFlag; String launchMapFlag; String launchReduceFlag; Path path; FileSystem fileSystem; InputStream in; fileType = ""; hiveSql = ""; jobId = ""; jobName = ""; username = ""; hiveSqlFlag = 0; launchTime = 0; finishTime = 0; mapTaskNum = 0; reduceTaskNum = 0; xmlNameFlag = "
    "; launchTimeFlag = "\"launchTime\":"; finishTimeFlag = "\"finishTime\":"; launchMapFlag = "\"Launched map tasks\""; launchReduceFlag = "\"Launched reduce tasks\""; path = fs.getPath(); str = path.getName(); if (str.endsWith(".xml")) { fileType = "config"; endPos = str.lastIndexOf("_"); jobId = str.substring(0, endPos); } else if (str.endsWith(".jhist")) { fileType = "info"; endPos = str.indexOf("-"); jobId = str.substring(0, endPos); } else { return; } try { fileSystem = path.getFileSystem(new Configuration()); in = fileSystem.open(path); InputStreamReader isr; BufferedReader br; isr = new InputStreamReader(in, "UTF-8"); br = new BufferedReader(isr); while ((str = br.readLine()) != null) { if (str.contains("mapreduce.job.user.name")) { startPos = str.indexOf(xmlNameFlag); endPos = str.indexOf("
    "); username = str.substring(startPos + xmlNameFlag.length(), endPos); } else if (str.contains("mapreduce.job.name")) { startPos = str.indexOf(xmlNameFlag); endPos = str.indexOf(""); jobName = str.substring(startPos + xmlNameFlag.length(), endPos); } else if (str.contains("hive.query.string")) { hiveSqlFlag = 1; hiveSql = str; } else if (hiveSqlFlag == 1) { hiveSql += str; if (str.contains("")) { startPos = hiveSql.indexOf(xmlNameFlag); endPos = hiveSql.indexOf(""); hiveSql = hiveSql.substring(startPos + xmlNameFlag.length(), endPos); hiveSqlFlag = 0; } } else if (str.startsWith("{\"type\":\"JOB_INITED\"")) { startPos = str.indexOf(launchTimeFlag); str = str.substring(startPos + launchTimeFlag.length()); endPos = str.indexOf(","); launchTime = Long.parseLong(str.substring(0, endPos)); } else if (str.startsWith("{\"type\":\"JOB_FINISHED\"")) { mapTaskNum = parseTaskNum(launchMapFlag, str); reduceTaskNum = parseTaskNum(launchReduceFlag, str); startPos = str.indexOf(finishTimeFlag); str = str.substring(startPos + finishTimeFlag.length()); endPos = str.indexOf(","); finishTime = Long.parseLong(str.substring(0, endPos)); } } if (fileType.equals("config")) { insertConfParseData(jobId, jobName, username, hiveSql); } else if (fileType.equals("info")) { insertJobInfoParseData(jobId, launchTime, finishTime, mapTaskNum, reduceTaskNum); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } if (fileType.equals("config")) { insertConfParseData(jobId, jobName, username, hiveSql); } else if (fileType.equals("info")) { insertJobInfoParseData(jobId, launchTime, finishTime, mapTaskNum, reduceTaskNum); } } private void insertConfParseData(String jobId, String jobName, String username, String sql) { String[] array; if (dataInfos.containsKey(jobId)) { array = dataInfos.get(jobId); } else { array = new String[BaseValues.DB_COLUMN_HIVE_SQL_LEN]; } array[BaseValues.DB_COLUMN_HIVE_SQL_JOBID] = jobId; array[BaseValues.DB_COLUMN_HIVE_SQL_JOBNAME] = jobName; array[BaseValues.DB_COLUMN_HIVE_SQL_USERNAME] = username; array[BaseValues.DB_COLUMN_HIVE_SQL_HIVE_SQL] = sql; tool.addDataToMap(jobId, array); } private void insertJobInfoParseData(String jobId, long launchTime, long finishedTime, int mapTaskNum, int reduceTaskNum) { String[] array; if (dataInfos.containsKey(jobId)) { array = dataInfos.get(jobId); } else { array = new String[BaseValues.DB_COLUMN_HIVE_SQL_LEN]; } array[BaseValues.DB_COLUMN_HIVE_SQL_JOBID] = jobId; array[BaseValues.DB_COLUMN_HIVE_SQL_START_TIME] = String.valueOf(launchTime); array[BaseValues.DB_COLUMN_HIVE_SQL_FINISH_TIME] = String.valueOf(finishedTime); array[BaseValues.DB_COLUMN_HIVE_SQL_MAP_TASK_NUM] = String.valueOf(mapTaskNum); array[BaseValues.DB_COLUMN_HIVE_SQL_REDUCE_TASK_NUM] = String.valueOf(reduceTaskNum); tool.addDataToMap(jobId, array); } private int parseTaskNum(String flag, String jobStr) { int taskNum; int startPos; int endPos; String tmpStr; taskNum = 0; tmpStr = jobStr; startPos = tmpStr.indexOf(flag); if (startPos == -1) { return 0; } tmpStr = tmpStr.substring(startPos + flag.length()); endPos = tmpStr.indexOf("}"); tmpStr = tmpStr.substring(0, endPos); taskNum = Integer.parseInt(tmpStr.split(":")[1]); return taskNum; }}

    总结

    通过上述工具,我们可以快速分析Hive SQL Job的运行情况。程序通过解析Hive JobHistory文件,提取每个Job的详细信息,并将结果存储到数据库中。该工具在实际应用中可以显著提升Hive Job调试和优化的效率。如果你对YARN和Hadoop的其他方面代码分析感兴趣,可以点击链接查看更多内容。

    转载地址:http://xvng.baihongyu.com/

    你可能感兴趣的文章
    Nginx 学习总结(17)—— 8 个免费开源 Nginx 管理系统,轻松管理 Nginx 站点配置
    查看>>
    Nginx 我们必须知道的那些事
    查看>>
    oauth2-shiro 添加 redis 实现版本
    查看>>
    OAuth2.0_授权服务配置_Spring Security OAuth2.0认证授权---springcloud工作笔记140
    查看>>
    Objective-C实现A-Star算法(附完整源码)
    查看>>
    Objective-C实现atoi函数功能(附完整源码)
    查看>>
    Objective-C实现base64加密和base64解密算法(附完整源码)
    查看>>
    Objective-C实现base85 编码算法(附完整源码)
    查看>>
    Objective-C实现basic graphs基本图算法(附完整源码)
    查看>>
    Objective-C实现BCC校验计算(附完整源码)
    查看>>
    Objective-C实现bead sort珠排序算法(附完整源码)
    查看>>
    Objective-C实现BeadSort珠排序算法(附完整源码)
    查看>>
    Objective-C实现bellman ford贝尔曼福特算法(附完整源码)
    查看>>
    Objective-C实现bellman-ford贝尔曼-福特算法(附完整源码)
    查看>>
    Objective-C实现bellman-ford贝尔曼-福特算法(附完整源码)
    查看>>
    Objective-C实现BellmanFord贝尔曼-福特算法(附完整源码)
    查看>>
    Objective-C实现BF算法 (附完整源码)
    查看>>
    Objective-C实现binary exponentiation二进制幂运算算法(附完整源码)
    查看>>
    Objective-C实现binomial coefficient二项式系数算法(附完整源码)
    查看>>
    Objective-C实现bogo sort排序算法(附完整源码)
    查看>>