Saturday, February 23, 2013

Hive+Xml Processing


                                  Hive+Xml Processing
First to understand XPATH() : by using this to parse XML data into String array.
Example: small xml data
   <rec><name>Babu</name><age>25</age><sex>male</sex></rec>
 <rec><name>Radha</name><age>23</age><sex>female</sex></rec>
NOTE:  xml data converted into hive table in two steps process
1.      Convert the xml data into array format  
2.      Array data can be converted into hive table format.
Process :
Step1: create the hive table
      Ex: Hive>create table hivexml(str string);
Step2: load the xmldata into hive table
     EX:    Hive>load data local inpath ‘xmlfile’ into table hivexml;
Ø  By this step load all local xml data into your hive table astise, so we can convert that data into STRING ARRAY format by using XPATH(), And then we can convert the array data into normal hive table data,
Step3:  convert the xml data into array format
     EX:   Hive>select xpath(str,’rec/*/text()’) from xmlhive;
Ø  OutPut:  ["Babu”,”25”,”male”]
                [“Radha”,”23”,”female”]

Explanation of ‘rec/*/text()’
                     rec: its define Node of xml same as XML DATA  (Check the xml data)
                       *: its define all the fields of xml data.
If you want specific fields simply mansion it  like below
Ex: Hive>select xpath(str,’rec/name/text()’)from xmlhive;

Ø  OutPut:  [“Babu”]
                [“Radha”]
Step4:  crate the HIVE table required columns
   EX: Hive> create  table newhivexml(name string,age  int,sex  string);
Ø  After creating the table to load the xml array format data into newhivexml table like below
Step5: 
     Hive> insert overwrite table newhivexml select xpath_string(str,'rec/name'),xpath_string(str,'rec/age'),xpath_string(str,'rec/sex')from hivexml;

Hive>select * from  newhivwxml ;
To get the data in table format like below.
name    age    sex
Babu    25      male
Radha   23      female

Thank you.
This note only for to get some basic idea purpose give me your feedback                                



9 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi kali,

    I am following your blog post for Hive+Xml Processing on aws ec2 instance and running the commands on centos. All was working fine until i ran the following command :
    "select xpath(data,'Food_Display_Table/*/text()') from food_display_table_xml;"

    This command is giving me the following error message:


    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":hdfs:hdfs:drwxr-xr-x
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
    Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":hdfs:hdfs:drwxr-xr-x
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:225)
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)

    at org.apache.hadoop.ipc.Client.call(Client.java:1118)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
    at $Proxy11.mkdirs(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1641)
    ... 26 more
    Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException(org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":hdfs:hdfs:drwxr-xr-x)'
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask

    ReplyDelete
  4. hi kaali,
    thank you for the stuff.

    the code which you gave is only working in hive 0.7 version but not in 10.1 .....may i know why?

    ReplyDelete
  5. Thanks for the post, In this complex environment business need to present there company data in meaningful way.So user easily understand it .Sqiar (http://www.sqiar.com/blog) which is in UK,provide services like Tableau and Data Warehousing etc .In these services sqiar experts convert company data into meaningful way.

    ReplyDelete
  6. Thanks for sharing this informative information. For more you may refer http://www.s4techno.com/hadoop-training-in-pune/

    ReplyDelete