Wednesday, August 25, 2010

Xml Processing in hadoop

In this post, I will describe how to process xml files using hadoop. XML files can be process using Hadoop streaming but we will process an other way which is more efficient than hadoop streaming. The details of streaming can be found on the following link
http://hadoop.apache.org/common/docs/r0.17.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

We will use Mahout XmlInputFormat class to process the xml files. Now for processing xml files, we need three things

1- Drive Class to run the program
2- Mapper Class
3- XmlInputFormat class

I am not using reducers to make the example simple. Now Lets do some programming to work out these things.

Driver Class:

Here is the code for driver class. which is explained below.




import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
*
* @author root
*/
public class ParserDriver {

/**
* @param args the command line arguments
*/
public static void main(String[] args) {
try {
runJob(args[0], args[1]);

} catch (IOException ex) {
Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex);
}

}


public static void runJob(String input,
String output ) throws IOException {

Configuration conf = new Configuration();

conf.set("xmlinput.start", "");
conf.set("xmlinput.end", "
");
conf
.set(
"io.serializations",
"org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");

Job job = new Job(conf, "jobName");


FileInputFormat.setInputPaths(job, input);
job.setJarByClass(ParserDriver.class);
job.setMapperClass(MyParserMapper.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(XmlInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
Path outPath = new Path(output);
FileOutputFormat.setOutputPath(job, outPath);
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}


try {

job.waitForCompletion(true);

} catch (InterruptedException ex) {
Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex);
} catch (ClassNotFoundException ex) {
Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex);
}

}

}
The code is mostly self explanatory. You need to define the starting and ending tag of to split a record from the xml file and it can be defined in the following lines

conf.set("xmlinput.start", "<startingTag>");
conf.set("xmlinput.end", "</endingTag>");

Then you need to set input path, output path which i am taking as command line arguments, need to set mapper class.

Next we will define our mapper.

Mapper:

To parse the xml files, you need some parser library, There are many ways to parse xml file in java like using SAX, DOM parser. I have used jdom library to parse the xml file. Here is the code for mapper class which is explained below.


import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;

/**
 *
 * @author root
 */
public class MyParserMapper1   extends
    Mapper<LongWritable, Text, NullWritable, Text> {



    @Override
    public void map(LongWritable key, Text value1,Context context)

throws IOException, InterruptedException {

                String xmlString = value1.toString();
             
             SAXBuilder builder = new SAXBuilder();
            Reader in = new StringReader(xmlString);
    String value="";
        try {
           
            Document doc = builder.build(in);
            Element root = doc.getRootElement();
           
            String tag1 =root.getChild("tag").getChild("tag1").getTextTrim() ;
            
            String tag2 =root.getChild("tag").getChild("tag1").getChild("tag2").getTextTrim();
             value= tag1+ ","+tag2;
             context.write(NullWritable.get(), new Text(value));
        } catch (JDOMException ex) {
            Logger.getLogger(MyParserMapper.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(MyParserMapper.class.getName()).log(Level.SEVERE, null, ex);
        }
   
    }

}

The code is very simple, you are getting the record in value1 and then parsing the data and then sending the data using
context.write(NullWritable.get(), new Text(value));

I did not require key so i use NullWritable and value contains comma delimited record after parsing.

Next, i am also providing the Mahout XMLInputFormat class code which is compatible with new Hadoop API.

Mahout XMLinputFormat (Compatible with New Hadoop API):


import java.io.IOException;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;


/**
* Reads records that are delimited by a specifc begin/end tag.
*/
public class XmlInputFormat extends  TextInputFormat {

  public static final String START_TAG_KEY = "xmlinput.start";
  public static final String END_TAG_KEY = "xmlinput.end";

    @Override
    public RecordReader<LongWritable,Text> createRecordReader(InputSplit is, TaskAttemptContext tac)  {
       
       
   
        return new XmlRecordReader();

   

       
    }
  public static class XmlRecordReader extends RecordReader<LongWritable,Text> {
    private  byte[] startTag;
    private  byte[] endTag;
    private  long start;
    private  long end;
    private  FSDataInputStream fsin;
    private  DataOutputBuffer buffer = new DataOutputBuffer();
    private LongWritable key = new LongWritable();
    private Text value = new Text();

   

        @Override
        public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException, InterruptedException {
            FileSplit fileSplit= (FileSplit) is;
            startTag = tac.getConfiguration().get(START_TAG_KEY).getBytes("utf-8");
            endTag = tac.getConfiguration().get(END_TAG_KEY).getBytes("utf-8");

           
                start = fileSplit.getStart();
                end = start + fileSplit.getLength();
                Path file = fileSplit.getPath();

                FileSystem fs = file.getFileSystem(tac.getConfiguration());
                fsin = fs.open(fileSplit.getPath());
                fsin.seek(start);

             


           
        }

        @Override
        public boolean nextKeyValue() throws IOException, InterruptedException {
             if (fsin.getPos() < end) {
        if (readUntilMatch(startTag, false)) {
          try {
            buffer.write(startTag);
            if (readUntilMatch(endTag, true)) {
           
            value.set(buffer.getData(), 0, buffer.getLength());
            key.set(fsin.getPos());
                   return true;
            }
          } finally {
            buffer.reset();
          }
        }
      }
      return false;
        }

        @Override
        public LongWritable getCurrentKey() throws IOException, InterruptedException {
        return key;
        }

        @Override
        public Text getCurrentValue() throws IOException, InterruptedException {
                   return value;
           
           

        }

        @Override
        public float getProgress() throws IOException, InterruptedException {
            return (fsin.getPos() - start) / (float) (end - start);
        }

        @Override
        public void close() throws IOException {
            fsin.close();
        }
        private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {
      int i = 0;
      while (true) {
        int b = fsin.read();
        // end of file:
        if (b == -1) return false;
        // save to buffer:
        if (withinBlock) buffer.write(b);

        // check if we're matching:
        if (b == match[i]) {
          i++;
          if (i >= match.length) return true;
        } else i = 0;
        // see if we've passed the stop point:
        if (!withinBlock && i == 0 && fsin.getPos() >= end) return false;
      }
    }

  }
 
   
}

To run this code, include the necessary jar files (jdom.jar,hadoop-core.jar) and you also need to make a single jar file. You can find the instructions to make a single jar file on the following link

http://java.sun.com/developer/technicalArticles/java_warehouse/single_jar/

Next, give the following command on the terminal to run the job.


hadoop jar MyParser.jar /user/root/Data/file.xml outputhere


Conclusion:

In this way, we can process large amount of xml files using hadoop and Mahout XML input format.

58 comments:

  1. This is exactly what I needed. I have just saved lot of time translating mahout example to the new api.
    Thanks!!

    ReplyDelete
  2. Hi Shuja,

    I have a doubt with xml processing. HDFS splits files in chunks of 64mbs and you program is going to lose records divided between end of a chunk and start of next one.

    ReplyDelete
  3. Can you help me resolve this

    11/12/30 05:34:45 INFO input.FileInputFormat: Total input paths to process : 1
    11/12/30 05:34:45 INFO mapred.JobClient: Running job: job_201112300438_0010
    11/12/30 05:34:46 INFO mapred.JobClient: map 0% reduce 0%
    11/12/30 05:34:56 INFO mapred.JobClient: Task Id : attempt_201112300438_0010_m_000000_0, Status : FAILED
    java.lang.NullPointerException
    at xmlhadoop.ParserDriver$MyParserMapper1.map(ParserDriver.java:65)
    at xmlhadoop.ParserDriver$MyParserMapper1.map(ParserDriver.java:46)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    ReplyDelete
  4. for nested document tag --> improved code

    private boolean readUntilMatch(byte[] match, boolean withinBlock)
    throws IOException {
    int i = 0;
    int j = 0;
    int nestedTags = 0;
    while (true) {
    int b = fsin.read();
    // end of file:
    if (b == -1)
    return false;
    // save to buffer:
    if (withinBlock)
    buffer.write(b);

    // check if we're matching for start tag again(nested tag) if you come here for search of end tag
    if (withinBlock && b == startTag[j]) {
    j++;
    if (j >= startTag.length) {
    nestedTags++;
    j = 0;
    }
    }else {
    j = 0;
    }


    // check if we're matching:
    if (b == match[i]) {
    i++;
    if (i >= match.length) {
    if(nestedTags==0) // Break the loop if there were no nested tags
    return true;
    else {
    --nestedTags; // Else decrement the count
    i = 0; // Reset the index
    }
    }
    } else {
    i = 0;
    }
    // see if we've passed the stop point:
    if (!withinBlock && i == 0 && fsin.getPos() >= end)
    return false;
    }
    }

    ReplyDelete
  5. This has been an informative blog for me. My expert keyword research teammates needs this kind of templates for their next project. Thanks a lot, have a nice day!

    ReplyDelete
  6. I have started training for hadoop map reduce, hbase, and related projects, The training includes slides and labs. Moreover it could be delivered via skype and direct online sessions. You can view one of the lab at following link.

    http://lab01mapreduce.blogspot.com

    Anyone who needs training or consultancy in big data , can email me at shujamughal@gmail.com

    ReplyDelete
    Replies
    1. Hi

      I would like to know where to keep my input file and how to access it.

      I try to run this from eclipse and get teh below exception

      Exception in thread "main" java.lang.NullPointerException
      at java.lang.ProcessBuilder.start(Unknown Source)
      at org.apache.hadoop.util.Shell.runCommand(Shell.java:483)
      at org.apache.hadoop.util.Shell.run(Shell.java:456)
      at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
      at org.apache.hadoop.util.Shell.execCommand(Shell.java:815)
      at org.apache.hadoop.util.Shell.execCommand(Shell.java:798)
      at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:728)
      at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:486)
      at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:527)
      at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
      at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:305)
      at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:133)
      at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:144)
      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Unknown Source)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
      at com.streamingedge.hadoop.avro.ParserDriver.runJob(ParserDriver.java:71)
      at com.streamingedge.hadoop.avro.ParserDriver.main(ParserDriver.java:28)

      Delete
  7. Thanks always for posting reliable information and well researched topics on the hadoop subject which otherwise can be learned either at regular or only at hadoop online training centers.

    ReplyDelete
  8. Hello,
    Thank you for the Blog.Parana Impact help you reach the right target customers
    to advertise your products and services.
    Hadoop Users Email List

    ReplyDelete
  9. this blog really helpful to develop my knowledge in hadoop which really helpful to cracking hadoop interviews

    hadoop training institute in velachery | big data training institute in velachery

    ReplyDelete
  10. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    big data training institute in adyar

    ReplyDelete
  11. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

    rpa Training in tambaram

    blueprism Training in tambaram

    automation anywhere training in tambaram

    iot Training in tambaram

    rpa training in sholinganallur

    blue prism training in sholinganallur

    automation anywhere training in sholinganallur

    iot training in sholinganallur


    ReplyDelete
  12. Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us

    Data Science Training in Chennai
    Data science training in bangalore
    Data science online training
    Data science training in pune
    Data Science training in kalyan nagar
    Data Science training in OMR
    selenium training in chennai

    ReplyDelete
  13. Thank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me.
    python training in rajajinagar
    Python training in btm
    Python training in usa

    ReplyDelete
  14. This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
    Blueprism training in tambaram

    Blueprism training in annanagar

    Blueprism training in velachery

    ReplyDelete
  15. Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Website. Please feel free to visit our site. Thank you for sharing.Well written article Thank You Sharing with Us pmp training in velachery | project management certfication in chennai | project management training institute in chennai | pmp training fee | pmp certification course in chennai

    ReplyDelete
  16. Very good information. Its very useful for me. We need learn from real time examples and for this we choose good training institute, we need to learn from experts . So we make use of demo classes . Recently we tried hadoop demo class of Apponix Technologies.

    ReplyDelete
  17. Very nice blog. A great and very informative post, Keep up the good work!


    ExcelR Data Science Course in Bangalore

    ReplyDelete
  18. I really enjoyed reading this post, big fan. Keep up the good work andplease tell me when can you publish more articles or where can I read more on the subject?

    Data Science Course

    ReplyDelete
  19. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!.
    Data Science Courses

    ReplyDelete
  20. Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
    If you are looking for any python Related information please visit our website Python Training In Pune page!

    ReplyDelete
  21. I want to say thanks to you. I have bookmark your site for future updates.
    360 digitmg data analytics courses in malaysia

    ReplyDelete

  22. Such a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing. Click here for data science course in pune with placements

    ReplyDelete
  23. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!

    digital marketing course

    ReplyDelete

  24. QuickBooksSoftwareSolutionOctober 24, 2019 at 12:09 AM
    What an amazing post, thank you for this mind-boggling post, my mate. Keep it up and keep posting more. QuickBooks Accounting Software is one of the most sought after accounting software used by small and medium-sized businessesnice page.
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  25. Great post and informative blog. It was awesome to read, thanks for sharing this great content to my vision.Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating..Data Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training

    ReplyDelete
  26. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.Best data science courses in hyerabad

    ReplyDelete
  27. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    machine learning courses in bangalore

    ReplyDelete
  28. Good to become visiting your weblog again, it has been months for me. Nicely this article that i've been waited for so long. I will need this post to total my assignment in the college, and it has exact same topic together with your write-up. Thanks, good share.
    machine learning courses in bangalore

    ReplyDelete
  29. Outstanding blog appreciating your endless efforts in coming up with an extraordinary content. Which perhaps motivates the readers to feel excited in grasping the subject easily. This obviously makes every readers to thank the blogger and hope the similar creative content in future too.

    Data Analytics Course in bhilai

    ReplyDelete
  30. This is a fabulous post I seen because of offer it. It is really what I expected to see trust in future you will continue in sharing such a mind boggling post
    data scientist training and placement

    ReplyDelete
  31. Cool stuff you have and you keep overhaul every one of us, Great work.
    pmp certification bangalore

    ReplyDelete
  32. This brilliant site really has the entirety of the data I needed concerning this subject and didn't have a clue who to inquire.
    tech updates

    ReplyDelete
  33. Wow! Fantastic article man! Much obliged to you, However I am experiencing issues with your RSS. I don't comprehend the motivation behind why I can't go along with it. Is there any other person having a similar RSS issues? Anyone who realizes the appropriate response will you generously react? Much appreciated!! tech updates

    ReplyDelete
  34. Pretty excited to come across such an informative content. Felt happy with the blog, keep inspiring the folks with your unique content as always. Finally, thanks for sharing.

    Data Science Training

    ReplyDelete
  35. Nice blog and informative blog. Keep sharing more with us. Keep up this work in your further blogs.
    Data Science Course Institute in Hyderabad

    ReplyDelete
  36. Amazing knowledge and I like to share this kind of information with my friends and hope they like it they why I do
    cyber security training malaysia

    ReplyDelete
  37. Grateful for the excellent details about CA Coaching in Hyderabad
    CA Coaching in Hyderabad

    ReplyDelete
  38. Continue to promote this blog. It seems to have interesting stuff. Amazing information that I enjoy sharing with my friends in the hopes that they would find it as fascinating as I do.

    Colleges for BBA In Hyderabad

    ReplyDelete
  39. A nice and educational blog. Continue to share with us. Continue this work in your upcoming posts.
    Best B.Com Colleges In Hyderabad

    ReplyDelete
  40. Thanks! Very interesting to read. This is really very helpful. Best Data Science training in Jaipur

    ReplyDelete
  41. Very informative article on Xml Processing in hadoop and explained in very simple and easy way
    Data science courses in Nashik

    ReplyDelete

  42. This post provides a clear and informative guide on processing XML files in Hadoop using the Mahout XmlInputFormat class. By steering away from Hadoop streaming, the approach you've outlined emphasizes efficiency and leverage of the MapReduce framework.

    The inclusion of the Driver class and Mapper class illustrates the essential components for setting up the job effectively. Your explanation of configuring the start and end tags for XML records highlights an important aspect of XML processing, which can often be complex due to its hierarchical nature.

    Using the JDOM library for parsing XML in the Mapper class is a solid choice, as it simplifies the process of navigating and extracting data from XML documents. The way you've structured the output as a comma-delimited string is also practical for downstream processing.

    Overall, this post serves as a helpful resource for developers looking to implement XML processing in Hadoop, providing them with the necessary code snippets and explanations to get started. The mention of Mahout's XmlInputFormat adds a valuable context for those already familiar with Hadoop and looking to explore XML data handling further. Great job! Data science courses in Gurgaon

    ReplyDelete
  43. This is an incredibly detailed and well-structured guide on processing XML files in Hadoop! Your walkthrough from the Driver class setup to the Mapper logic and the custom XmlInputFormat class makes it easy to understand each component’s role in handling XML data effectively. Data science courses in Visakhapatnam

    ReplyDelete