Hello Friends,
After publishing my blogs on my POC for processing Excel files in .xls format (97-2003 Format), I was getting requests to provide one usecase to showcase the processing of newer version of Excel File ie; .xlsx file (2007-13 Format).
To all my readers, hereby I am presenting this blog to read Both Old (.xls) and New (.xlsx) extension Excel file. As in my previous blog I have used the Excel Input Format inspired from the website https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/, here also the same has been modified to accept both kind of Excel Inputs using Apache POI.
The Apache POI is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2013).
The Old Excel Input coding required HSSF method to parse the excel sheet where as the New Excel Input coding required XSSF method to parse the excel sheet. In my coding I have used a common method to parse any of the two kind of Excel Sheet.
I will provide you with all possible inputs to drive the same by yourself.
I have used Apache POI 3.15 for my usecase in this blog.
You can download the Apache POI from any of the two below link.
The script requires additional third-party Java libraries to be manually imported into the environment.
STEP 1: Download the Apache POI binary distribution from the above link. .
STEP 1: Download the Apache POI binary distribution from the above link. .
- These scripts have been verified against version 3.15. The scripts may be compatible with other versions but not guaranteed.
- Obtain the *.zip file for Windows or *.tar.gz file for Linux. Examples:
- poi-bin-3.15-20160924.zip
- poi-bin-3.15-20160924.tar.gz
- poi-3.15.jar
- poi-ooxml-3.15.jar
- poi-ooxml-schemas-3.15.jar
- poi-examples-3.15
- poi-excelant-3.15
- poi-scratchpad-3.15
- Under ooxml-lib/folder xmlbeans-2.6.0.jar
- Entire jar under lib folder of POI-3.15.
Now its time to look into the Excel Input Format Coding. Though the only modification done is in Excel Parser but still I am giving entire coding for ease of usage:-
EXCEL INPUT FORMAT
(CUSTOM INPUT FORMAT TO READ ANY EXCEL FILES)
(CUSTOM INPUT FORMAT TO READ ANY EXCEL FILES)
package com.poc;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class ExcelInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
return new ExcelRecordReader();
}
}
EXCEL RECORD READER
(TO READ EXCEL FILE AND SEND AS KEY, VALUE FORMAT)
(TO READ EXCEL FILE AND SEND AS KEY, VALUE FORMAT)
package com.poc;
import java.io.IOException;
import java.io.InputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class ExcelRecordReader extends RecordReader<LongWritable, Text> {
private LongWritable key;
private Text value;
private InputStream is;
private String[] strArrayofLines;
@Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
is = fileIn;
String line = new ExcelParser().parseExcelData(is);
this.strArrayofLines = line.split("\n");
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable(0);
value = new Text(strArrayofLines[0]);
} else {
if (key.get() < (this.strArrayofLines.length - 1)) {
long pos = (int) key.get();
key.set(pos + 1);
value.set(this.strArrayofLines[(int) (pos + 1)]);
pos++;
} else {
return false;
}
}
if (key == null || value == null) {
return false;
} else {
return true;
}
}
@Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
@Override
public void close() throws IOException {
if (is != null) {
is.close();
}
}
}
EXCEL PARSER
(TO PARSE ANY EXCEL SHEET)
In this coding I have modified the code to use the common method to parse any kind of excel sheet.
package com.poc;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.poi.EncryptedDocumentException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
public class ExcelParser {
private static final Log LOG = LogFactory.getLog(ExcelParser.class);
private StringBuilder currentString = null;
private long bytesRead = 0;
@SuppressWarnings("deprecation")
public String parseExcelData(InputStream is) {
try {
Workbook workbook = WorkbookFactory.create(is);
// Taking first sheet from the workbook
Sheet sheet = workbook.getSheetAt(0);
// Iterate through each rows from first sheet
Iterator<Row> rowIterator = sheet.iterator();
currentString = new StringBuilder();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
// For each row, iterate through each columns
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
bytesRead++;
currentString.append(cell.getBooleanCellValue() + "\t");
break;
case Cell.CELL_TYPE_NUMERIC:
bytesRead++;
currentString.append(cell.getNumericCellValue() + "\t");
break;
case Cell.CELL_TYPE_STRING:
bytesRead++;
currentString.append(cell.getStringCellValue() + "\t");
break;
}
}
currentString.append("\n");
}
is.close();
} catch (IOException e) {
LOG.error("IO Exception : File not found " + e);
} catch (EncryptedDocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InvalidFormatException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return currentString.toString();
}
public long getBytesRead() {
return bytesRead;
}
}
EXCEL MAPPER
(HAVING SIMPLE MAPPER CODING JUST TO DISPLAY THE CONTENTS)
package com.poc;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text(""), value);
}
}
EXCEL INPUT DRIVER
(MAIN DRIVER CLASS)
package com.poc;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class PocDriver {
static public int count = 0;
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();
Job job = new Job(conf, "Any_Excel_File");
job.setJarByClass(PocDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(ExcelInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(0);
job.setMapperClass(TestMapper.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
That's all Friends...
Now get onto your computer & put your Thinking cap, take any Excel Format and have fun working around this coding. You can take examples from any of my previous blog and put this Excel Input Format, Excel Record Reader & Excel parser to check its working.
Hope you all understood the procedures...
Please do notify me for any corrections...
Kindly leave a comment for any queries/clarification...
I would like to thank Mr. Kalyan rambariki for sharing
his valuable suggestion for checking POI jar files.
(Detailed Description of each phase to be added soon).
ALL D BEST...
I read Many Post about Excel and Other Courses but I really Impressed about your Writing Way and How to Express to words.. It’s really helpful for us Thanks for sharing,, keep writing
ReplyDeleteAdvanced excel training in delhi
Advanced excel training in Noida
Advanced excel training in Gurgaon
Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Now
Delete>>>>> Download Full
Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download LINK
>>>>> Download Now
Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Full
>>>>> Download LINK 2p
Excellent article!!! Good work, your concept is really helpful for me. Thanks for your contribution in sharing such wonderful information...
ReplyDeleteAdvanced Excel Training In Noida
C C++ Training Institutes in Noida
I am satisfied to read your wonderful post and this content was very interesting. Truly well post and Keep doing...
ReplyDeletePega Training in Chennai
Pega Course in Chennai
Excel Training in Chennai
Corporate Training in Chennai
Embedded System Course Chennai
Linux Training in Chennai
Spark Training in Chennai
Tableau Training in Chennai
Pega Training in Tambaram
Pega Training in Porur
Such a great blog.Thanks for sharing.........
ReplyDeleteSoftware Testing Training in Chennai
Software Testing Course in Bangalore
Software Testing Training in Coimbatore
Software Testing Course in Madurai
Best Software Testing Institute in Bangalore
Software Testing Training in Bangalore
Software Testing Training Institute in Bangalore
Tally Course in Bangalore
Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Now
ReplyDelete>>>>> Download Full
Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download LINK
>>>>> Download Now
Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Full
>>>>> Download LINK
Kocaeli
ReplyDeleteDenizli
Bartın
Kocaeli
Adana
K7U
yozgat
ReplyDeletetunceli
hakkari
zonguldak
adıyaman
7OG
amasya evden eve nakliyat
ReplyDeleteeskişehir evden eve nakliyat
ardahan evden eve nakliyat
manisa evden eve nakliyat
karaman evden eve nakliyat
2İHM60
21CB4
ReplyDeleteBingöl Evden Eve Nakliyat
Ünye Çelik Kapı
Antep Lojistik
Iğdır Evden Eve Nakliyat
Manisa Şehir İçi Nakliyat
Bingöl Şehir İçi Nakliyat
Kars Şehir İçi Nakliyat
Düzce Parça Eşya Taşıma
Muş Evden Eve Nakliyat