UNDERSTANDING HADOOP BY MAHESH MAHARANA: HADOOP - EXCEL INPUT FORMAT TO READ ANY EXCEL FILE

Hello Friends,

After publishing my blogs on my POC for processing Excel files in .xls format (97-2003 Format), I was getting requests to provide one usecase to showcase the processing of newer version of Excel File ie; .xlsx file (2007-13 Format).

To all my readers, hereby I am presenting this blog to read Both Old (.xls) and New (.xlsx) extension Excel file. As in my previous blog I have used the Excel Input Format inspired from the website https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/, here also the same has been modified to accept both kind of Excel Inputs using Apache POI.

The Apache POI is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2013).

The Old Excel Input coding required HSSF method to parse the excel sheet where as the New Excel Input coding required XSSF method to parse the excel sheet. In my coding I have used a common method to parse any of the two kind of Excel Sheet.

I will provide you with all possible inputs to drive the same by yourself.

I have used Apache POI 3.15 for my usecase in this blog.

You can download the Apache POI from any of the two below link.

http://www-us.apache.org/dist/poi/release/bin/poi-bin-3.15-20160924.zip

https://www.apache.org/dyn/closer.lua/poi/release/bin/poi-bin-3.15-20160924.zip

The script requires additional third-party Java libraries to be manually imported into the environment.

STEP 1: Download the Apache POI binary distribution from the above link. .

These scripts have been verified against version 3.15. The scripts may be compatible with other versions but not guaranteed.
Obtain the *.zip file for Windows or *.tar.gz file for Linux. Examples:
- poi-bin-3.15-20160924.zip
- poi-bin-3.15-20160924.tar.gz

STEP 2: Extract the following jars from the archive and copy them to the Hadoop Lib directory & in Eclipse during code compilation: (For Safer side I copied all jar contained in lib,ooxml folder also)

poi-3.15.jar
poi-ooxml-3.15.jar
poi-ooxml-schemas-3.15.jar
poi-examples-3.15
poi-excelant-3.15
poi-scratchpad-3.15
Under ooxml-lib/folder xmlbeans-2.6.0.jar
Entire jar under lib folder of POI-3.15.

Now its time to look into the Excel Input Format Coding. Though the only modification done is in Excel Parser but still I am giving entire coding for ease of usage:-

EXCEL INPUT FORMAT
(CUSTOM INPUT FORMAT TO READ ANY EXCEL FILES)

package com.poc;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

public class ExcelInputFormat extends FileInputFormat<LongWritable, Text> {

@Override

public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context)

throws IOException, InterruptedException {

return new ExcelRecordReader();

}

EXCEL RECORD READER
(TO READ EXCEL FILE AND SEND AS KEY, VALUE FORMAT)

package com.poc;

import java.io.IOException;

import java.io.InputStream;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.InputSplit;

import org.apache.hadoop.mapreduce.RecordReader;

import org.apache.hadoop.mapreduce.TaskAttemptContext;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class ExcelRecordReader extends RecordReader<LongWritable, Text> {

private LongWritable key;

private Text value;

private InputStream is;

private String[] strArrayofLines;

@Override

public void initialize(InputSplit genericSplit, TaskAttemptContext context)

throws IOException, InterruptedException {

FileSplit split = (FileSplit) genericSplit;

Configuration job = context.getConfiguration();

final Path file = split.getPath();

FileSystem fs = file.getFileSystem(job);

FSDataInputStream fileIn = fs.open(split.getPath());

is = fileIn;

String line = new ExcelParser().parseExcelData(is);

this.strArrayofLines = line.split("\n");

}

@Override

public boolean nextKeyValue() throws IOException, InterruptedException {

if (key == null) {

key = new LongWritable(0);

value = new Text(strArrayofLines[0]);

} else {

if (key.get() < (this.strArrayofLines.length - 1)) {

long pos = (int) key.get();

key.set(pos + 1);

value.set(this.strArrayofLines[(int) (pos + 1)]);

pos++;

} else {

return false;

}

if (key == null || value == null) {

return false;

} else {

return true;

}

@Override

public LongWritable getCurrentKey() throws IOException, InterruptedException {

return key;

}

@Override

public Text getCurrentValue() throws IOException, InterruptedException {

return value;

}

@Override

public float getProgress() throws IOException, InterruptedException {

return 0;

}

@Override

public void close() throws IOException {

if (is != null) {

is.close();

}

EXCEL PARSER

(TO PARSE ANY EXCEL SHEET)

In this coding I have modified the code to use the common method to parse any kind of excel sheet.

package com.poc;

import java.io.IOException;

import java.io.InputStream;

import java.util.Iterator;

import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.poi.EncryptedDocumentException;

import org.apache.poi.openxml4j.exceptions.InvalidFormatException;

import org.apache.poi.ss.usermodel.Cell;

import org.apache.poi.ss.usermodel.Row;

import org.apache.poi.ss.usermodel.Sheet;

import org.apache.poi.ss.usermodel.Workbook;

import org.apache.poi.ss.usermodel.WorkbookFactory;

public class ExcelParser {

private static final Log LOG = LogFactory.getLog(ExcelParser.class);

private StringBuilder currentString = null;

private long bytesRead = 0;

@SuppressWarnings("deprecation")

public String parseExcelData(InputStream is) {

try {

Workbook workbook = WorkbookFactory.create(is);

// Taking first sheet from the workbook

Sheet sheet = workbook.getSheetAt(0);

// Iterate through each rows from first sheet

Iterator<Row> rowIterator = sheet.iterator();

currentString = new StringBuilder();

while (rowIterator.hasNext()) {

Row row = rowIterator.next();

// For each row, iterate through each columns

Iterator<Cell> cellIterator = row.cellIterator();

while (cellIterator.hasNext()) {

Cell cell = cellIterator.next();

switch (cell.getCellType()) {

case Cell.CELL_TYPE_BOOLEAN:

bytesRead++;

currentString.append(cell.getBooleanCellValue() + "\t");

break;

case Cell.CELL_TYPE_NUMERIC:

bytesRead++;

currentString.append(cell.getNumericCellValue() + "\t");

break;

case Cell.CELL_TYPE_STRING:

bytesRead++;

currentString.append(cell.getStringCellValue() + "\t");

break;

}

currentString.append("\n");

}

is.close();

} catch (IOException e) {

LOG.error("IO Exception : File not found " + e);

} catch (EncryptedDocumentException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (InvalidFormatException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

return currentString.toString();

}

public long getBytesRead() {

return bytesRead;

}

EXCEL MAPPER

(HAVING SIMPLE MAPPER CODING JUST TO DISPLAY THE CONTENTS)

package com.poc;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class TestMapper extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

context.write(new Text(""), value);

}

EXCEL INPUT DRIVER

(MAIN DRIVER CLASS)

package com.poc;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class PocDriver {

static public int count = 0;

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

args = parser.getRemainingArgs();

Job job = new Job(conf, "Any_Excel_File");

job.setJarByClass(PocDriver.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setInputFormatClass(ExcelInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setNumReduceTasks(0);

job.setMapperClass(TestMapper.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

That's all Friends...

Now get onto your computer & put your Thinking cap, take any Excel Format and have fun working around this coding. You can take examples from any of my previous blog and put this Excel Input Format, Excel Record Reader & Excel parser to check its working.

Hope you all understood the procedures...

Please do notify me for any corrections...

Kindly leave a comment for any queries/clarification...

I would like to thank Mr. Kalyan rambariki for sharing

his valuable suggestion for checking POI jar files.

(Detailed Description of each phase to be added soon).

ALL D BEST...

AnonymousMarch 10, 2022 at 9:17 PM

Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Now

>>>>> Download Full

Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download LINK

>>>>> Download Now

Understanding Hadoop By Mahesh Maharana: Hadoop - Excel Input Format To Read Any Excel File >>>>> Download Full

>>>>> Download LINK 2p

UNDERSTANDING HADOOP BY MAHESH MAHARANA

Wednesday, January 11, 2017

HADOOP - EXCEL INPUT FORMAT TO READ ANY EXCEL FILE

Hello Friends,

You can download the Apache POI from any of the two below link.

In this coding I have modified the code to use the common method to parse any kind of excel sheet.

10 comments:

Blog Archive

Popular Posts

Wednesday, January 11, 2017

HADOOP - EXCEL INPUT FORMAT TO READ ANY EXCEL FILE

Hello Friends,

You can download the Apache POI from any of the two below link.

In this coding I have modified the code to use the common method to parse any kind of excel sheet.

10 comments:

Popular Posts

Subscribe To