XML and Hadoop: Xml Processing in hadoop

Wednesday, August 25, 2010

Xml Processing in hadoop

In this post, I will describe how to process xml files using hadoop. XML files can be process using Hadoop streaming but we will process an other way which is more efficient than hadoop streaming. The details of streaming can be found on the following link
http://hadoop.apache.org/common/docs/r0.17.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

We will use Mahout XmlInputFormat class to process the xml files. Now for processing xml files, we need three things

1- Drive Class to run the program
2- Mapper Class
3- XmlInputFormat class

I am not using reducers to make the example simple. Now Lets do some programming to work out these things.

Driver Class:

Here is the code for driver class. which is explained below.

import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
*
* @author root
*/
public class ParserDriver {

/**
* @param args the command line arguments
*/
public static void main(String[] args) {
try {
runJob(args[0], args[1]);

} catch (IOException ex) {
Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex);
}

}

public static void runJob(String input,
String output ) throws IOException {

Configuration conf = new Configuration();

conf.set("xmlinput.start", "");
conf.set("xmlinput.end", "");
conf
.set(
"io.serializations",
"org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");

Job job = new Job(conf, "jobName");

FileInputFormat.setInputPaths(job, input);
job.setJarByClass(ParserDriver.class);
job.setMapperClass(MyParserMapper.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(XmlInputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
Path outPath = new Path(output);
FileOutputFormat.setOutputPath(job, outPath);
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
if (dfs.exists(outPath)) {
dfs.delete(outPath, true);
}

try {

job.waitForCompletion(true);

} catch (InterruptedException ex) {
Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex);
} catch (ClassNotFoundException ex) {
Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex);
}

}

}

The code is mostly self explanatory. You need to define the starting and ending tag of to split a record from the xml file and it can be defined in the following lines

conf.set("xmlinput.start", "<startingTag>");
conf.set("xmlinput.end", "</endingTag>");

Then you need to set input path, output path which i am taking as command line arguments, need to set mapper class.

Next we will define our mapper.

Mapper:

To parse the xml files, you need some parser library, There are many ways to parse xml file in java like using SAX, DOM parser. I have used jdom library to parse the xml file. Here is the code for mapper class which is explained below.

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;

/**
*
* @author root
*/
public class MyParserMapper1   extends
    Mapper<LongWritable, Text, NullWritable, Text> {

    @Override
    public void map(LongWritable key, Text value1,Context context)

throws IOException, InterruptedException {

                String xmlString = value1.toString();

             SAXBuilder builder = new SAXBuilder();
            Reader in = new StringReader(xmlString);
    String value="";
        try {

            Document doc = builder.build(in);
            Element root = doc.getRootElement();

            String tag1 =root.getChild("tag").getChild("tag1").getTextTrim() ;

            String tag2 =root.getChild("tag").getChild("tag1").getChild("tag2").getTextTrim();
             value= tag1+ ","+tag2;
             context.write(NullWritable.get(), new Text(value));
        } catch (JDOMException ex) {
            Logger.getLogger(MyParserMapper.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(MyParserMapper.class.getName()).log(Level.SEVERE, null, ex);
        }

    }

}

The code is very simple, you are getting the record in value1 and then parsing the data and then sending the data using
context.write(NullWritable.get(), new Text(value));

I did not require key so i use NullWritable and value contains comma delimited record after parsing.

Next, i am also providing the Mahout XMLInputFormat class code which is compatible with new Hadoop API.

Mahout XMLinputFormat (Compatible with New Hadoop API):

import java.io.IOException;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

/**
* Reads records that are delimited by a specifc begin/end tag.
*/
public class XmlInputFormat extends TextInputFormat {

public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";

    @Override
    public RecordReader<LongWritable,Text> createRecordReader(InputSplit is, TaskAttemptContext tac) {



        return new XmlRecordReader();




    }
public static class XmlRecordReader extends RecordReader<LongWritable,Text> {
    private byte[] startTag;
    private byte[] endTag;
    private long start;
    private long end;
    private FSDataInputStream fsin;
    private DataOutputBuffer buffer = new DataOutputBuffer();
    private LongWritable key = new LongWritable();
    private Text value = new Text();



        @Override
        public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException, InterruptedException {
            FileSplit fileSplit= (FileSplit) is;
            startTag = tac.getConfiguration().get(START_TAG_KEY).getBytes("utf-8");
            endTag = tac.getConfiguration().get(END_TAG_KEY).getBytes("utf-8");


                start = fileSplit.getStart();
                end = start + fileSplit.getLength();
                Path file = fileSplit.getPath();

                FileSystem fs = file.getFileSystem(tac.getConfiguration());
                fsin = fs.open(fileSplit.getPath());
                fsin.seek(start);




        }

        @Override
        public boolean nextKeyValue() throws IOException, InterruptedException {
             if (fsin.getPos() < end) {
        if (readUntilMatch(startTag, false)) {
          try {
            buffer.write(startTag);
            if (readUntilMatch(endTag, true)) {

            value.set(buffer.getData(), 0, buffer.getLength());
            key.set(fsin.getPos());
                   return true;
            }
          } finally {
            buffer.reset();
          }
        }
      }
      return false;
        }

        @Override
        public LongWritable getCurrentKey() throws IOException, InterruptedException {
        return key;
        }

        @Override
        public Text getCurrentValue() throws IOException, InterruptedException {
                   return value;



        }

        @Override
        public float getProgress() throws IOException, InterruptedException {
            return (fsin.getPos() - start) / (float) (end - start);
        }

        @Override
        public void close() throws IOException {
            fsin.close();
        }
        private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {
      int i = 0;
      while (true) {
        int b = fsin.read();
        // end of file:
        if (b == -1) return false;
        // save to buffer:
        if (withinBlock) buffer.write(b);

        // check if we're matching:
        if (b == match[i]) {
          i++;
          if (i >= match.length) return true;
        } else i = 0;
        // see if we've passed the stop point:
        if (!withinBlock && i == 0 && fsin.getPos() >= end) return false;
      }
    }

}


}

To run this code, include the necessary jar files (jdom.jar,hadoop-core.jar) and you also need to make a single jar file. You can find the instructions to make a single jar file on the following link

http://java.sun.com/developer/technicalArticles/java_warehouse/single_jar/

Next, give the following command on the terminal to run the job.

hadoop jar MyParser.jar /user/root/Data/file.xml outputhere

Conclusion:

In this way, we can process large amount of xml files using hadoop and Mahout XML input format.

65 comments:

JorgeMarch 2, 2011 at 12:40 PM
This is exactly what I needed. I have just saved lot of time translating mahout example to the new api.
Thanks!!
ReplyDelete
Replies
akkiAugust 8, 2011 at 12:26 PM
Hi Shuja,

I have a doubt with xml processing. HDFS splits files in chunks of 64mbs and you program is going to lose records divided between end of a chunk and start of next one.
ReplyDelete
Replies
DeepakDecember 29, 2011 at 4:19 AM
Can you help me resolve this

11/12/30 05:34:45 INFO input.FileInputFormat: Total input paths to process : 1
11/12/30 05:34:45 INFO mapred.JobClient: Running job: job_201112300438_0010
11/12/30 05:34:46 INFO mapred.JobClient: map 0% reduce 0%
11/12/30 05:34:56 INFO mapred.JobClient: Task Id : attempt_201112300438_0010_m_000000_0, Status : FAILED
java.lang.NullPointerException
at xmlhadoop.ParserDriver$MyParserMapper1.map(ParserDriver.java:65)
at xmlhadoop.ParserDriver$MyParserMapper1.map(ParserDriver.java:46)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
ReplyDelete
Replies
Yogesh PFebruary 27, 2012 at 6:11 AM
for nested document tag --> improved code

private boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
int j = 0;
int nestedTags = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);

// check if we're matching for start tag again(nested tag) if you come here for search of end tag
if (withinBlock && b == startTag[j]) {
j++;
if (j >= startTag.length) {
nestedTags++;
j = 0;
}
}else {
j = 0;
}

// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length) {
if(nestedTags==0) // Break the loop if there were no nested tags
return true;
else {
--nestedTags; // Else decrement the count
i = 0; // Reset the index
}
}
} else {
i = 0;
}
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}
ReplyDelete
Replies
JaebaekMarch 17, 2012 at 12:27 AM
Thx :)
ReplyDelete
Replies
AnonymousDecember 10, 2012 at 12:42 PM
This has been an informative blog for me. My expert keyword research teammates needs this kind of templates for their next project. Thanks a lot, have a nice day!
ReplyDelete
Replies
ShujaJune 28, 2014 at 6:35 AM
I have started training for hadoop map reduce, hbase, and related projects, The training includes slides and labs. Moreover it could be delivered via skype and direct online sessions. You can view one of the lab at following link.

http://lab01mapreduce.blogspot.com

Anyone who needs training or consultancy in big data , can email me at shujamughal@gmail.com
ReplyDelete
Replies
AnonymousOctober 26, 2015 at 4:07 AM
Thanks always for posting reliable information and well researched topics on the hadoop subject which otherwise can be learned either at regular or only at hadoop online training centers.
ReplyDelete
Replies
UnknownNovember 7, 2016 at 4:37 AM
Hello,
Thank you for the Blog.Parana Impact help you reach the right target customers
to advertise your products and services.
Hadoop Users Email List
ReplyDelete
Replies
UnknownApril 11, 2017 at 9:59 PM
this blog really helpful to develop my knowledge in hadoop which really helpful to cracking hadoop interviews

hadoop training institute in velachery | big data training institute in velachery
ReplyDelete
Replies
UnknownApril 11, 2017 at 10:26 PM
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

big data training institute in adyar
ReplyDelete
Replies
UnknownSeptember 4, 2018 at 2:04 AM
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

rpa Training in tambaram

blueprism Training in tambaram

automation anywhere training in tambaram

iot Training in tambaram

rpa training in sholinganallur

blue prism training in sholinganallur

automation anywhere training in sholinganallur

iot training in sholinganallur

ReplyDelete
Replies
SaroSeptember 4, 2018 at 3:04 AM
Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.

rpa Training in Chennai

rpa Training in bangalore

rpa Training in pune

blueprism Training in Chennai

blueprism Training in bangalore

blueprism Training in pune

iot-training-in-chennai
ReplyDelete
Replies
nilashriSeptember 13, 2018 at 12:18 AM
Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us

Data Science Training in Chennai
Data science training in bangalore
Data science online training
Data science training in pune
Data Science training in kalyan nagar
Data Science training in OMR
selenium training in chennai
ReplyDelete
Replies
UnknownSeptember 18, 2018 at 10:15 PM
Thank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me.
python training in rajajinagar
Python training in btm
Python training in usa
ReplyDelete
Replies
poojaSeptember 24, 2018 at 10:35 PM
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
Blueprism training in tambaram

Blueprism training in annanagar

Blueprism training in velachery
ReplyDelete
Replies
sathya shriOctober 5, 2018 at 1:39 AM
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
angularjs-Training in sholinganallur

angularjs-Training in velachery

angularjs Training in bangalore

angularjs Training in bangalore

angularjs Training in btm
ReplyDelete
Replies
AnonymousNovember 17, 2018 at 11:33 PM
Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Website. Please feel free to visit our site. Thank you for sharing.Well written article Thank You Sharing with Us pmp training in velachery | project management certfication in chennai | project management training institute in chennai | pmp training fee | pmp certification course in chennai
ReplyDelete
Replies
Sadhana RathoreJanuary 14, 2019 at 3:36 AM
Interesting blog, it gives lots of information to me. Thanks for sharing such a nice blog.
Data Science Training in Chennai
Big Data Analytics Courses in Chennai
Machine Learning Training in Chennai
Microsoft Azure Training in Chennai
DevOps Training in Chennai
AWS Training in Chennai
Data Science Training in OMR
Data Science Training in Porur
ReplyDelete
Replies
karthikApril 3, 2019 at 1:21 AM
Very good information. Its very useful for me. We need learn from real time examples and for this we choose good training institute, we need to learn from experts . So we make use of demo classes . Recently we tried hadoop demo class of Apponix Technologies.
ReplyDelete
Replies
jvimalaApril 6, 2019 at 5:08 AM
I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
Linux Training in Chennai
Python Training in Chennai
Data Science Training in Chennai
RPA Training in Chennai
Devops Training in Chennai
ReplyDelete
Replies
sheela rajeshMay 6, 2019 at 12:31 AM
This blog is full of Innovative ideas.surely i will look into this insight.please add more information's like this soon.
Hadoop Training in Chennai
Big data training in chennai
hadoop training in velachery
JAVA Training in Chennai
Python Training in Chennai
Selenium Training in Chennai
Hadoop training in chennai
Big data training in chennai
hadoop training in Velachery
ReplyDelete
Replies
janithaMay 9, 2019 at 8:09 AM
Thank you so much for ding the impressive job here, everyone will surely like your post.
machine learning course malaysia
AI learning course malaysia
data analytics course malaysia
big data course malaysia
data science course malaysia
pmp certification malaysia
ReplyDelete
Replies
AadityaMay 16, 2019 at 1:39 AM
Very nice blog. A great and very informative post, Keep up the good work!

ExcelR Data Science Course in Bangalore

ReplyDelete
Replies
Data Science CourseJune 12, 2019 at 3:07 AM
I really enjoyed reading this post, big fan. Keep up the good work andplease tell me when can you publish more articles or where can I read more on the subject?

Data Science Course
ReplyDelete
Replies
lucy88June 27, 2019 at 2:09 AM
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!.
Data Science Courses
ReplyDelete
Replies
Sivakumari.BAugust 26, 2019 at 6:32 AM
Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
If you are looking for any python Related information please visit our website Python Training In Pune page!
ReplyDelete
Replies
coin kingSeptember 20, 2019 at 4:51 AM
SEO service in Durgapur
ReplyDelete
Replies
360digitmgOctober 16, 2019 at 2:04 AM
I want to say thanks to you. I have bookmark your site for future updates.
360 digitmg data analytics courses in malaysia
ReplyDelete
Replies
ManikantaOctober 18, 2019 at 8:03 AM

Such a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing. Click here for data science course in pune with placements
ReplyDelete
Replies
shriDecember 24, 2019 at 10:29 PM
super blogggssss...!
internship in chennai for ece students
internships in chennai for cse students 2019
Inplant training in chennai
internship for eee students
free internship in chennai
eee internship in chennai
internship for ece students in chennai
inplant training in bangalore for cse
inplant training in bangalore
ccna training in chennai

ReplyDelete
Replies
datasciencecourseMarch 15, 2020 at 9:07 PM
I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!

digital marketing course
ReplyDelete
Replies
nishaMay 21, 2020 at 10:55 PM
Nice Blog. very Impressive One.

Data Science Training Course In Chennai | Data Science Training Course In Anna Nagar | Data Science Training Course In OMR | Data Science Training Course In Porur | Data Science Training Course In Tambaram | Data Science Training Course In Velachery

ReplyDelete
Replies
divyaJune 5, 2020 at 8:27 PM

QuickBooksSoftwareSolutionOctober 24, 2019 at 12:09 AM
What an amazing post, thank you for this mind-boggling post, my mate. Keep it up and keep posting more. QuickBooks Accounting Software is one of the most sought after accounting software used by small and medium-sized businessesnice page.
Ai & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai
ReplyDelete
Replies
aarthiJuly 9, 2020 at 12:36 AM
An excellent article...the codings are working. Java training in Chennai | Certification | Online Course Training | Java training in Bangalore | Certification | Online Course Training | Java training in Hyderabad | Certification | Online Course Training | Java training in Coimbatore | Certification | Online Course Training | Java training in Online | Certification | Online Course Training

ReplyDelete
Replies
deviJuly 24, 2020 at 10:12 AM
Great post and informative blog. It was awesome to read, thanks for sharing this great content to my vision.Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating..Data Science Training In Chennai

Data Science Online Training In Chennai

Data Science Training In Bangalore

Data Science Training In Hyderabad

Data Science Training In Coimbatore

Data Science Training

Data Science Online Training
ReplyDelete
Replies
EXCELRSeptember 18, 2020 at 2:28 AM
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.Best data science courses in hyerabad
ReplyDelete
Replies
RohiniSeptember 18, 2020 at 3:55 AM
Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
machine learning courses in bangalore
ReplyDelete
Replies
RohiniSeptember 26, 2020 at 10:07 AM
Good to become visiting your weblog again, it has been months for me. Nicely this article that i've been waited for so long. I will need this post to total my assignment in the college, and it has exact same topic together with your write-up. Thanks, good share.
machine learning courses in bangalore
ReplyDelete
Replies
manasaMarch 23, 2021 at 10:49 PM
Hi, Thanks for sharing nice articles...

DevOps Training in Hyderabad
ReplyDelete
Replies
Data Science Course in Bhilai - 360DigiTMGMarch 24, 2021 at 11:14 PM
Outstanding blog appreciating your endless efforts in coming up with an extraordinary content. Which perhaps motivates the readers to feel excited in grasping the subject easily. This obviously makes every readers to thank the blogger and hope the similar creative content in future too.

Data Analytics Course in bhilai
ReplyDelete
Replies
ManeeshaJune 3, 2021 at 10:31 PM
This is a fabulous post I seen because of offer it. It is really what I expected to see trust in future you will continue in sharing such a mind boggling post
data scientist training and placement
ReplyDelete
Replies
RohiniJune 21, 2021 at 6:11 AM
Cool stuff you have and you keep overhaul every one of us, Great work.
pmp certification bangalore
ReplyDelete
Replies
UnknownJuly 25, 2021 at 10:48 PM
This brilliant site really has the entirety of the data I needed concerning this subject and didn't have a clue who to inquire.
tech updates
ReplyDelete
Replies
UnknownJuly 26, 2021 at 4:18 AM
Wow! Fantastic article man! Much obliged to you, However I am experiencing issues with your RSS. I don't comprehend the motivation behind why I can't go along with it. Is there any other person having a similar RSS issues? Anyone who realizes the appropriate response will you generously react? Much appreciated!! tech updates
ReplyDelete
Replies
Data Science Course in Bhilai - 360DigiTMGDecember 5, 2021 at 6:21 PM
Pretty excited to come across such an informative content. Felt happy with the blog, keep inspiring the folks with your unique content as always. Finally, thanks for sharing.

Data Science Training
ReplyDelete
Replies
Ramesh SampangiFebruary 3, 2022 at 6:25 AM
Nice blog and informative blog. Keep sharing more with us. Keep up this work in your further blogs.
Data Science Course Institute in Hyderabad
ReplyDelete
Replies
traininginstituteFebruary 6, 2022 at 3:34 AM
Amazing knowledge and I like to share this kind of information with my friends and hope they like it they why I do
cyber security training malaysia

ReplyDelete
Replies
AnonymousMay 17, 2022 at 10:02 PM
perde modelleri
sms onay
TURKCELL MOBİL ÖDEME BOZDURMA
nft nasıl alınır
ANKARA EVDEN EVE NAKLİYAT
trafik sigortası
DEDEKTOR
web sitesi kurma
aşk kitapları
ReplyDelete
Replies
AnonymousMay 31, 2022 at 1:51 PM
Smm Panel
SMM PANEL
iş ilanları
İnstagram Takipçi Satın Al
Hırdavatçı burada
beyazesyateknikservisi.com.tr
SERVİS
Tiktok hile indir
ReplyDelete
Replies
VaraprasadJuly 28, 2023 at 2:53 AM
Grateful for the excellent details about CA Coaching in Hyderabad
CA Coaching in Hyderabad
ReplyDelete
Replies
VARUNAugust 14, 2023 at 5:54 AM
Continue to promote this blog. It seems to have interesting stuff. Amazing information that I enjoy sharing with my friends in the hopes that they would find it as fascinating as I do.

Colleges for BBA In Hyderabad
ReplyDelete
Replies
nayanaSeptember 19, 2023 at 2:54 AM
A nice and educational blog. Continue to share with us. Continue this work in your upcoming posts.
Best B.Com Colleges In Hyderabad

ReplyDelete
Replies
DigiperformSeoApril 17, 2024 at 3:12 AM
Thanks! Very interesting to read. This is really very helpful. Best Data Science training in Jaipur
ReplyDelete
Replies
Data ScienceNovember 2, 2024 at 8:16 PM

This post provides a clear and informative guide on processing XML files in Hadoop using the Mahout XmlInputFormat class. By steering away from Hadoop streaming, the approach you've outlined emphasizes efficiency and leverage of the MapReduce framework.

The inclusion of the Driver class and Mapper class illustrates the essential components for setting up the job effectively. Your explanation of configuring the start and end tags for XML records highlights an important aspect of XML processing, which can often be complex due to its hierarchical nature.

Using the JDOM library for parsing XML in the Mapper class is a solid choice, as it simplifies the process of navigating and extracting data from XML documents. The way you've structured the output as a comma-delimited string is also practical for downstream processing.

Overall, this post serves as a helpful resource for developers looking to implement XML processing in Hadoop, providing them with the necessary code snippets and explanations to get started. The mention of Mahout's XmlInputFormat adds a valuable context for those already familiar with Hadoop and looking to explore XML data handling further. Great job! Data science courses in Gurgaon
ReplyDelete
Replies
SadhviNovember 7, 2024 at 1:35 AM
This is an incredibly detailed and well-structured guide on processing XML files in Hadoop! Your walkthrough from the Driver class setup to the Mapper logic and the custom XmlInputFormat class makes it easy to understand each component’s role in handling XML data effectively. Data science courses in Visakhapatnam
ReplyDelete
Replies
NEHA PATHARENovember 29, 2024 at 12:19 AM
"Thanks for making this topic accessible to everyone."
Data science course in mumbai.
ReplyDelete
Replies
RICHADecember 3, 2024 at 9:43 PM
"Excellent post on XML processing in Hadoop! I appreciate how you've explained the challenges and solutions for handling XML data in a distributed system like Hadoop. The detailed examples and explanations provide a solid foundation for developers looking to work with XML in big data environments. Thanks for sharing such valuable insights!"
Data science courses in the Netherlands
ReplyDelete
Replies
data scienceDecember 5, 2024 at 8:54 AM
What a fantastic read! The author makes the topic not only interesting but easy to follow as well
Data science Courses in London
ReplyDelete
Replies
kriti sharmaDecember 6, 2024 at 5:49 AM
Really great tutorial on processing XML in Hadoop. XML parsing can get complex when working with big data, and your tips will definitely help streamline that process. Thanks for sharing such a clear and concise guide
Data science courses in Glasgow
ReplyDelete
Replies
iim skills DikshaDecember 17, 2024 at 7:44 AM
Great article ! Really very amazing article.
Data science Courses in Ireland
ReplyDelete
Replies
usha singhJanuary 6, 2025 at 12:04 AM
Your article on XML processing in Hadoop is very informative and well-structured. The detailed explanations and examples are very helpful. Thanks for sharing your knowledge!digital marketing course in chennai fees
ReplyDelete
Replies
reenaiimskillsJanuary 23, 2025 at 4:46 AM
Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us
top 10 digital marketing agency in delhi
ReplyDelete
Replies
AnjaliJanuary 27, 2025 at 12:35 AM
I have read your article, it is very informative and helpful for me.I admire the valuable information you offer in your articles. Thanks for posting it..
digital marketing course in varanasi
ReplyDelete
Replies

Add comment