http://hadoop.apache.org/common/docs/r0.17.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F
We will use Mahout XmlInputFormat class to process the xml files. Now for processing xml files, we need three things
1- Drive Class to run the program
2- Mapper Class
3- XmlInputFormat class
I am not using reducers to make the example simple. Now Lets do some programming to work out these things.
Driver Class:
Here is the code for driver class. which is explained below.
import java.io.IOException; import java.util.logging.Level; import java.util.logging.Logger; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /** * * @author root */ public class ParserDriver { /** * @param args the command line arguments */ public static void main(String[] args) { try { runJob(args[0], args[1]); } catch (IOException ex) { Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex); } } public static void runJob(String input, String output ) throws IOException { Configuration conf = new Configuration(); conf.set("xmlinput.start", " conf.set("xmlinput.end", " conf .set( "io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization"); Job job = new Job(conf, "jobName"); FileInputFormat.setInputPaths(job, input); job.setJarByClass(ParserDriver.class); job.setMapperClass(MyParserMapper.class); job.setNumReduceTasks(0); job.setInputFormatClass(XmlInputFormat.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); Path outPath = new Path(output); FileOutputFormat.setOutputPath(job, outPath); FileSystem dfs = FileSystem.get(outPath.toUri(), conf); if (dfs.exists(outPath)) { dfs.delete(outPath, true); } try { job.waitForCompletion(true); } catch (InterruptedException ex) { Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex); } catch (ClassNotFoundException ex) { Logger.getLogger(ParserDriver.class.getName()).log(Level.SEVERE, null, ex); } } } |
conf.set("xmlinput.start", "<startingTag>");
conf.set("xmlinput.end", "</endingTag>");
Then you need to set input path, output path which i am taking as command line arguments, need to set mapper class.
Next we will define our mapper.
Mapper:
To parse the xml files, you need some parser library, There are many ways to parse xml file in java like using SAX, DOM parser. I have used jdom library to parse the xml file. Here is the code for mapper class which is explained below.
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
/**
*
* @author root
*/
public class MyParserMapper1 extends
Mapper<LongWritable, Text, NullWritable, Text> {
@Override
public void map(LongWritable key, Text value1,Context context)
throws IOException, InterruptedException {
String xmlString = value1.toString();
SAXBuilder builder = new SAXBuilder();
Reader in = new StringReader(xmlString);
String value="";
try {
Document doc = builder.build(in);
Element root = doc.getRootElement();
String tag1 =root.getChild("tag").getChild("tag1").getTextTrim() ;
String tag2 =root.getChild("tag").getChild("tag1").getChild("tag2").getTextTrim();
value= tag1+ ","+tag2;
context.write(NullWritable.get(), new Text(value));
} catch (JDOMException ex) {
Logger.getLogger(MyParserMapper.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(MyParserMapper.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
The code is very simple, you are getting the record in value1 and then parsing the data and then sending the data using
context.write(NullWritable.get(), new Text(value));
I did not require key so i use NullWritable and value contains comma delimited record after parsing.
Next, i am also providing the Mahout XMLInputFormat class code which is compatible with new Hadoop API.
Mahout XMLinputFormat (Compatible with New Hadoop API):
import java.io.IOException;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
/**
* Reads records that are delimited by a specifc begin/end tag.
*/
public class XmlInputFormat extends TextInputFormat {
public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";
@Override
public RecordReader<LongWritable,Text> createRecordReader(InputSplit is, TaskAttemptContext tac) {
return new XmlRecordReader();
}
public static class XmlRecordReader extends RecordReader<LongWritable,Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();
@Override
public void initialize(InputSplit is, TaskAttemptContext tac) throws IOException, InterruptedException {
FileSplit fileSplit= (FileSplit) is;
startTag = tac.getConfiguration().get(START_TAG_KEY).getBytes("utf-8");
endTag = tac.getConfiguration().get(END_TAG_KEY).getBytes("utf-8");
start = fileSplit.getStart();
end = start + fileSplit.getLength();
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(tac.getConfiguration());
fsin = fs.open(fileSplit.getPath());
fsin.seek(start);
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (fsin.getPos() < end) {
if (readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {
value.set(buffer.getData(), 0, buffer.getLength());
key.set(fsin.getPos());
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
@Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return (fsin.getPos() - start) / (float) (end - start);
}
@Override
public void close() throws IOException {
fsin.close();
}
private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1) return false;
// save to buffer:
if (withinBlock) buffer.write(b);
// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length) return true;
} else i = 0;
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end) return false;
}
}
}
}
To run this code, include the necessary jar files (jdom.jar,hadoop-core.jar) and you also need to make a single jar file. You can find the instructions to make a single jar file on the following link
http://java.sun.com/developer/technicalArticles/java_warehouse/single_jar/
Next, give the following command on the terminal to run the job.
hadoop jar MyParser.jar /user/root/Data/file.xml outputhere
Conclusion:
In this way, we can process large amount of xml files using hadoop and Mahout XML input format.
This is exactly what I needed. I have just saved lot of time translating mahout example to the new api.
ReplyDeleteThanks!!
Hi Shuja,
ReplyDeleteI have a doubt with xml processing. HDFS splits files in chunks of 64mbs and you program is going to lose records divided between end of a chunk and start of next one.
Can you help me resolve this
ReplyDelete11/12/30 05:34:45 INFO input.FileInputFormat: Total input paths to process : 1
11/12/30 05:34:45 INFO mapred.JobClient: Running job: job_201112300438_0010
11/12/30 05:34:46 INFO mapred.JobClient: map 0% reduce 0%
11/12/30 05:34:56 INFO mapred.JobClient: Task Id : attempt_201112300438_0010_m_000000_0, Status : FAILED
java.lang.NullPointerException
at xmlhadoop.ParserDriver$MyParserMapper1.map(ParserDriver.java:65)
at xmlhadoop.ParserDriver$MyParserMapper1.map(ParserDriver.java:46)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
for nested document tag --> improved code
ReplyDeleteprivate boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
int j = 0;
int nestedTags = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);
// check if we're matching for start tag again(nested tag) if you come here for search of end tag
if (withinBlock && b == startTag[j]) {
j++;
if (j >= startTag.length) {
nestedTags++;
j = 0;
}
}else {
j = 0;
}
// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length) {
if(nestedTags==0) // Break the loop if there were no nested tags
return true;
else {
--nestedTags; // Else decrement the count
i = 0; // Reset the index
}
}
} else {
i = 0;
}
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}
Thx :)
ReplyDeleteThis has been an informative blog for me. My expert keyword research teammates needs this kind of templates for their next project. Thanks a lot, have a nice day!
ReplyDeleteI have started training for hadoop map reduce, hbase, and related projects, The training includes slides and labs. Moreover it could be delivered via skype and direct online sessions. You can view one of the lab at following link.
ReplyDeletehttp://lab01mapreduce.blogspot.com
Anyone who needs training or consultancy in big data , can email me at shujamughal@gmail.com
Hi
DeleteI would like to know where to keep my input file and how to access it.
I try to run this from eclipse and get teh below exception
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:483)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:815)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:798)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:728)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:486)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:527)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:504)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:305)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:133)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:144)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at com.streamingedge.hadoop.avro.ParserDriver.runJob(ParserDriver.java:71)
at com.streamingedge.hadoop.avro.ParserDriver.main(ParserDriver.java:28)
Thanks always for posting reliable information and well researched topics on the hadoop subject which otherwise can be learned either at regular or only at hadoop online training centers.
ReplyDeleteHello,
ReplyDeleteThank you for the Blog.Parana Impact help you reach the right target customers
to advertise your products and services.
Hadoop Users Email List
this blog really helpful to develop my knowledge in hadoop which really helpful to cracking hadoop interviews
ReplyDeletehadoop training institute in velachery | big data training institute in velachery
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog
ReplyDeletebig data training institute in adyar
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
ReplyDeleterpa Training in tambaram
blueprism Training in tambaram
automation anywhere training in tambaram
iot Training in tambaram
rpa training in sholinganallur
blue prism training in sholinganallur
automation anywhere training in sholinganallur
iot training in sholinganallur
Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.
ReplyDeleterpa Training in Chennai
rpa Training in bangalore
rpa Training in pune
blueprism Training in Chennai
blueprism Training in bangalore
blueprism Training in pune
iot-training-in-chennai
Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us
ReplyDeleteData Science Training in Chennai
Data science training in bangalore
Data science online training
Data science training in pune
Data Science training in kalyan nagar
Data Science training in OMR
selenium training in chennai
Thank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me.
ReplyDeletepython training in rajajinagar
Python training in btm
Python training in usa
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
ReplyDeleteBlueprism training in tambaram
Blueprism training in annanagar
Blueprism training in velachery
Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
ReplyDeleteangularjs-Training in sholinganallur
angularjs-Training in velachery
angularjs Training in bangalore
angularjs Training in bangalore
angularjs Training in btm
Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Website. Please feel free to visit our site. Thank you for sharing.Well written article Thank You Sharing with Us pmp training in velachery | project management certfication in chennai | project management training institute in chennai | pmp training fee | pmp certification course in chennai
ReplyDeleteInteresting blog, it gives lots of information to me. Thanks for sharing such a nice blog.
ReplyDeleteData Science Training in Chennai
Big Data Analytics Courses in Chennai
Machine Learning Training in Chennai
Microsoft Azure Training in Chennai
DevOps Training in Chennai
AWS Training in Chennai
Data Science Training in OMR
Data Science Training in Porur
Very good information. Its very useful for me. We need learn from real time examples and for this we choose good training institute, we need to learn from experts . So we make use of demo classes . Recently we tried hadoop demo class of Apponix Technologies.
ReplyDeleteI believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
ReplyDeleteLinux Training in Chennai
Python Training in Chennai
Data Science Training in Chennai
RPA Training in Chennai
Devops Training in Chennai
This blog is full of Innovative ideas.surely i will look into this insight.please add more information's like this soon.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
hadoop training in velachery
JAVA Training in Chennai
Python Training in Chennai
Selenium Training in Chennai
Hadoop training in chennai
Big data training in chennai
hadoop training in Velachery
Thank you so much for ding the impressive job here, everyone will surely like your post.
ReplyDeletemachine learning course malaysia
AI learning course malaysia
data analytics course malaysia
big data course malaysia
data science course malaysia
pmp certification malaysia
Very nice blog. A great and very informative post, Keep up the good work!
ReplyDeleteExcelR Data Science Course in Bangalore
I really enjoyed reading this post, big fan. Keep up the good work andplease tell me when can you publish more articles or where can I read more on the subject?
ReplyDeleteData Science Course
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!.
ReplyDeleteData Science Courses
Your info is really amazing with impressive content..Excellent blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog..
ReplyDeleteIf you are looking for any python Related information please visit our website Python Training In Pune page!
SEO service in Durgapur
ReplyDeleteI want to say thanks to you. I have bookmark your site for future updates.
ReplyDelete360 digitmg data analytics courses in malaysia
ReplyDeleteSuch a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing. Click here for data science course in pune with placements
super blogggssss...!
ReplyDeleteinternship in chennai for ece students
internships in chennai for cse students 2019
Inplant training in chennai
internship for eee students
free internship in chennai
eee internship in chennai
internship for ece students in chennai
inplant training in bangalore for cse
inplant training in bangalore
ccna training in chennai
I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
ReplyDeletedigital marketing course
Nice Blog. very Impressive One.
ReplyDeleteData Science Training Course In Chennai | Data Science Training Course In Anna Nagar | Data Science Training Course In OMR | Data Science Training Course In Porur | Data Science Training Course In Tambaram | Data Science Training Course In Velachery
ReplyDeleteQuickBooksSoftwareSolutionOctober 24, 2019 at 12:09 AM
What an amazing post, thank you for this mind-boggling post, my mate. Keep it up and keep posting more. QuickBooks Accounting Software is one of the most sought after accounting software used by small and medium-sized businessesnice page.
Ai & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai
An excellent article...the codings are working. Java training in Chennai | Certification | Online Course Training | Java training in Bangalore | Certification | Online Course Training | Java training in Hyderabad | Certification | Online Course Training | Java training in Coimbatore | Certification | Online Course Training | Java training in Online | Certification | Online Course Training
ReplyDeleteGreat post and informative blog. It was awesome to read, thanks for sharing this great content to my vision.Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating..Data Science Training In Chennai
ReplyDeleteData Science Online Training In Chennai
Data Science Training In Bangalore
Data Science Training In Hyderabad
Data Science Training In Coimbatore
Data Science Training
Data Science Online Training
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.Best data science courses in hyerabad
ReplyDeletePretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
ReplyDeletemachine learning courses in bangalore
Good to become visiting your weblog again, it has been months for me. Nicely this article that i've been waited for so long. I will need this post to total my assignment in the college, and it has exact same topic together with your write-up. Thanks, good share.
ReplyDeletemachine learning courses in bangalore
Hi, Thanks for sharing nice articles...
ReplyDeleteDevOps Training in Hyderabad
Outstanding blog appreciating your endless efforts in coming up with an extraordinary content. Which perhaps motivates the readers to feel excited in grasping the subject easily. This obviously makes every readers to thank the blogger and hope the similar creative content in future too.
ReplyDeleteData Analytics Course in bhilai
This is a fabulous post I seen because of offer it. It is really what I expected to see trust in future you will continue in sharing such a mind boggling post
ReplyDeletedata scientist training and placement
Cool stuff you have and you keep overhaul every one of us, Great work.
ReplyDeletepmp certification bangalore
This brilliant site really has the entirety of the data I needed concerning this subject and didn't have a clue who to inquire.
ReplyDeletetech updates
Wow! Fantastic article man! Much obliged to you, However I am experiencing issues with your RSS. I don't comprehend the motivation behind why I can't go along with it. Is there any other person having a similar RSS issues? Anyone who realizes the appropriate response will you generously react? Much appreciated!! tech updates
ReplyDeletePretty excited to come across such an informative content. Felt happy with the blog, keep inspiring the folks with your unique content as always. Finally, thanks for sharing.
ReplyDeleteData Science Training
Nice blog and informative blog. Keep sharing more with us. Keep up this work in your further blogs.
ReplyDeleteData Science Course Institute in Hyderabad
Amazing knowledge and I like to share this kind of information with my friends and hope they like it they why I do
ReplyDeletecyber security training malaysia
perde modelleri
ReplyDeletesms onay
TURKCELL MOBİL ÖDEME BOZDURMA
nft nasıl alınır
ANKARA EVDEN EVE NAKLİYAT
trafik sigortası
DEDEKTOR
web sitesi kurma
aşk kitapları
Smm Panel
ReplyDeleteSMM PANEL
iş ilanları
İnstagram Takipçi Satın Al
Hırdavatçı burada
beyazesyateknikservisi.com.tr
SERVİS
Tiktok hile indir
Grateful for the excellent details about CA Coaching in Hyderabad
ReplyDeleteCA Coaching in Hyderabad
Continue to promote this blog. It seems to have interesting stuff. Amazing information that I enjoy sharing with my friends in the hopes that they would find it as fascinating as I do.
ReplyDeleteColleges for BBA In Hyderabad
A nice and educational blog. Continue to share with us. Continue this work in your upcoming posts.
ReplyDeleteBest B.Com Colleges In Hyderabad
Thanks! Very interesting to read. This is really very helpful. Best Data Science training in Jaipur
ReplyDeleteVery informative article on Xml Processing in hadoop and explained in very simple and easy way
ReplyDeleteData science courses in Nashik
ReplyDeleteThis post provides a clear and informative guide on processing XML files in Hadoop using the Mahout XmlInputFormat class. By steering away from Hadoop streaming, the approach you've outlined emphasizes efficiency and leverage of the MapReduce framework.
The inclusion of the Driver class and Mapper class illustrates the essential components for setting up the job effectively. Your explanation of configuring the start and end tags for XML records highlights an important aspect of XML processing, which can often be complex due to its hierarchical nature.
Using the JDOM library for parsing XML in the Mapper class is a solid choice, as it simplifies the process of navigating and extracting data from XML documents. The way you've structured the output as a comma-delimited string is also practical for downstream processing.
Overall, this post serves as a helpful resource for developers looking to implement XML processing in Hadoop, providing them with the necessary code snippets and explanations to get started. The mention of Mahout's XmlInputFormat adds a valuable context for those already familiar with Hadoop and looking to explore XML data handling further. Great job! Data science courses in Gurgaon
This is an incredibly detailed and well-structured guide on processing XML files in Hadoop! Your walkthrough from the Driver class setup to the Mapper logic and the custom XmlInputFormat class makes it easy to understand each component’s role in handling XML data effectively. Data science courses in Visakhapatnam
ReplyDelete