Wednesday, January 11, 2017

HADOOP - EXCEL INPUT FORMAT TO READ ANY EXCEL FILE

Hello Friends,


After publishing my blogs on my POC for processing Excel files in .xls format (97-2003 Format), I was getting requests to provide one usecase to showcase the processing of newer version of Excel File ie; .xlsx file (2007-13 Format).

To all my readers, hereby I am presenting this blog to read Both Old (.xls) and New (.xlsx) extension Excel file. As in my previous blog I have used the Excel Input Format inspired from the website https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/, here also the same has been modified to accept both kind of Excel Inputs using Apache POI.

The Apache POI is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2013).

The Old Excel Input coding required HSSF method to parse the excel sheet where as the New Excel Input coding required XSSF method to parse the excel sheet. In my coding I have used a common method to parse any of the two kind of Excel Sheet.

I will provide you with all possible inputs to drive the same by yourself.

I have used Apache POI 3.15 for my usecase in this blog. 

You can download the Apache POI from any of the two below link.




The script requires additional third-party Java libraries to be manually imported into the environment.

STEP 1: Download the Apache POI binary distribution from the above link. .
  • These scripts have been verified against version 3.15. The scripts may be compatible with other versions but not guaranteed.
  • Obtain the *.zip file for Windows or *.tar.gz file for Linux. Examples:
    • poi-bin-3.15-20160924.zip
    • poi-bin-3.15-20160924.tar.gz
STEP 2: Extract the following jars from the archive and copy them to the Hadoop Lib directory & in Eclipse during code compilation: (For Safer side I copied all jar contained in lib,ooxml folder also)
    • poi-3.15.jar
    • poi-ooxml-3.15.jar
    • poi-ooxml-schemas-3.15.jar
    • poi-examples-3.15
    • poi-excelant-3.15
    • poi-scratchpad-3.15
    • Under ooxml-lib/folder xmlbeans-2.6.0.jar
    • Entire jar under lib folder of POI-3.15.

Now its time to look into the Excel Input Format Coding. Though the only modification done is in Excel Parser but still I am giving entire coding for ease of usage:-

EXCEL INPUT FORMAT 
(CUSTOM INPUT FORMAT TO READ ANY EXCEL FILES)

package com.poc;

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class ExcelInputFormat extends FileInputFormat<LongWritable, Text> {
            @Override
            public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context)
                                    throws IOException, InterruptedException {
                        return new ExcelRecordReader();
            }
}


EXCEL RECORD READER 
(TO READ EXCEL FILE AND SEND AS KEY, VALUE FORMAT)


package com.poc;


import java.io.IOException;
import java.io.InputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class ExcelRecordReader extends RecordReader<LongWritable, Text> {

private LongWritable key;
private Text value;
private InputStream is;
private String[] strArrayofLines;

@Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {

FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
final Path file = split.getPath();

FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());

is = fileIn;
String line = new ExcelParser().parseExcelData(is);
this.strArrayofLines = line.split("\n");
}

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {

if (key == null) {
key = new LongWritable(0);
value = new Text(strArrayofLines[0]);

} else {

if (key.get() < (this.strArrayofLines.length - 1)) {
long pos = (int) key.get();

key.set(pos + 1);
value.set(this.strArrayofLines[(int) (pos + 1)]);

pos++;
} else {
return false;
}

}

if (key == null || value == null) {
return false;
} else {
return true;
}

}

@Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {

return key;
}

@Override
public Text getCurrentValue() throws IOException, InterruptedException {

return value;
}

@Override
public float getProgress() throws IOException, InterruptedException {

return 0;

}

@Override
public void close() throws IOException {
if (is != null) {
is.close();
}

}

}


EXCEL PARSER 
(TO PARSE ANY EXCEL SHEET)

In this coding I have modified the code to use the common method to parse any kind of excel sheet.


package com.poc;

import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.poi.EncryptedDocumentException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;

public class ExcelParser {

private static final Log LOG = LogFactory.getLog(ExcelParser.class);
private StringBuilder currentString = null;
private long bytesRead = 0;

@SuppressWarnings("deprecation")
public String parseExcelData(InputStream is) {
try {
Workbook workbook = WorkbookFactory.create(is);

// Taking first sheet from the workbook
Sheet sheet = workbook.getSheetAt(0);

// Iterate through each rows from first sheet
Iterator<Row> rowIterator = sheet.iterator();
currentString = new StringBuilder();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();

// For each row, iterate through each columns
Iterator<Cell> cellIterator = row.cellIterator();

while (cellIterator.hasNext()) {

Cell cell = cellIterator.next();

switch (cell.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
bytesRead++;
currentString.append(cell.getBooleanCellValue() + "\t");
break;

case Cell.CELL_TYPE_NUMERIC:
bytesRead++;
currentString.append(cell.getNumericCellValue() + "\t");
break;

case Cell.CELL_TYPE_STRING:
bytesRead++;
currentString.append(cell.getStringCellValue() + "\t");
break;

}
}
currentString.append("\n");
}
is.close();
} catch (IOException e) {
LOG.error("IO Exception : File not found " + e);
} catch (EncryptedDocumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InvalidFormatException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return currentString.toString();

}

public long getBytesRead() {
return bytesRead;
}

}

EXCEL MAPPER
(HAVING SIMPLE MAPPER CODING JUST TO DISPLAY THE CONTENTS)

package com.poc;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

context.write(new Text(""), value);
}
}


EXCEL INPUT DRIVER 
(MAIN DRIVER CLASS)

package com.poc;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class PocDriver {

static public int count = 0;

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();

Job job = new Job(conf, "Any_Excel_File");
job.setJarByClass(PocDriver.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setInputFormatClass(ExcelInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(0);

job.setMapperClass(TestMapper.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

That's all Friends...  

Now get onto your computer & put your Thinking cap, take any Excel Format and have fun working around this coding. You can take examples from any of my previous blog and put this Excel Input Format, Excel Record Reader & Excel parser to check its working.


Hope you all understood the procedures... 
Please do notify me for any corrections...
Kindly leave a comment for any queries/clarification...
I would like to thank Mr. Kalyan rambariki for sharing 
his valuable suggestion for checking POI jar files.
(Detailed Description of each phase to be added soon).
ALL D BEST...









10 comments:

  1. I read Many Post about Excel and Other Courses but I really Impressed about your Writing Way and How to Express to words.. It’s really helpful for us Thanks for sharing,, keep writing
    Advanced excel training in delhi
    Advanced excel training in Noida
    Advanced excel training in Gurgaon

    ReplyDelete
    Replies
    1. Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Now

      >>>>> Download Full

      Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download LINK

      >>>>> Download Now

      Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Full

      >>>>> Download LINK 2p

      Delete
  2. Excellent article!!! Good work, your concept is really helpful for me. Thanks for your contribution in sharing such wonderful information...
    Advanced Excel Training In Noida
    C C++ Training Institutes in Noida

    ReplyDelete
  3. Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Now

    >>>>> Download Full

    Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download LINK

    >>>>> Download Now

    Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Full

    >>>>> Download LINK

    ReplyDelete