UNDERSTANDING HADOOP BY MAHESH MAHARANA: A USECASE ON TRAVEL APP

Dear Friends,

Welcome Back....

Day by Day I am learning different thing which I like to share with you all.

As a great person said " Learning is a journey not a Destination".

I was little busy last few day, acquiring different know-how on Hadoop. I was asked whether I can solve this usecase by one of the Travel app provider to know where is the demand of their services so that they can take decision on how to keep offers/discounts to lure the customers to use their services.

In this blog, I am using this usecase to solve the problem and help the client to take decision for better business development. This usecase can be used to find the same for a particular state or a region. You can this blog as reference and modify this according to your need. Hope you all enjoy solving this usecase.

Eclipse IDE :- Neon.1

Hadoop Version:- 1.2.1

Ubuntu Version:- 12.04 LTS

Jars Used:- Apache POI 3.15

Problem Statement:

Travel app agency have loads and loads of data about various trips made all over India. They are unable to check where is the most demand and where they can use offers or discounts to lure customers in using the existing data.

1. Find the area/place where more and least demand is there so that appropriate offers and discounts can be given.

2. Remove duplicate entries and segregate customers according to area/place.

3. Use Graphical representation for decision making.

The data is in EXCEL format for which I used third party jar from Apache to parse excel data. (You can find the same from my previous blogs). I have created a small list of data as example for this usecase, which you can modify and bring a large set of data for more precise workflow/understanding.

The data is in following format:- (Download the data file HERE)

In our usecase the data is in excel format which we will load to HDFS for storing and using MR we will extract the data using a custom input format. After extraction of required data we will then load it into HIVE table using partition to divide it into different regions and the perform Count query to find the highest and lowest customers demand regions. Using Tableau to represent it graphically.

I kept this usecase as simple as I can. you can try different variation on the same to get desired result. For example; in this usecase I was asked to map the output along with the driver ID to know the best performing driver.

1. LOAD THE DATA IN HDFS:-

Our first step is to load the data in HDFS, Hope you all know that by this time. Still as a part of my Blog I will keep it.

To load the data in HDFS give the below command:-

Command > hadoop fs -mkdir /Input (To make a directory)

Command > hadoop fs -put Trip.xlsx /Input (To put the file in HDFS)

2. MAP REDUCE CODES:-

(DRIVER CLASS)

package com.poc.trip;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class PocDriver {

static public int count = 0;

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

args = parser.getRemainingArgs();

Job job = new Job(conf, "Trip_Log");

job.setJarByClass(PocDriver.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

job.setInputFormatClass(ExcelInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// job.setNumReduceTasks(0);

job.setMapperClass(MyMap.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

MY MAPPER

(HAVING MAPPER LOGIC)

package com.poc.trip;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;

public class MyMap extends Mapper<LongWritable, Text, Text, Text> {
MultipleOutputs<Text, Text> mos;

@Override
public void setup(Context context) {
mos = new MultipleOutputs<Text, Text>(context);
}

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] str1 = value.toString().split("\t");
String sr1 = str1[0] + "\t" + str1[1] + "\t" + str1[3] + "\t" + str1[5] + "\t" + str1[6] + "\t" + str1[7];

if (str1[3].contains("Bihar")) {
mos.write(new Text("Bihar"), new Text(sr1), ("/TripData/Bihar"));
} else if (str1[3].contains("Pondicherry")) {
mos.write(new Text("Pondicherry"), new Text(sr1), ("/TripData/Pondicherry"));

} else if (str1[3].contains("Uttarakhand")) {
mos.write(new Text("Uttarakhand"), new Text(sr1), "/TripData/Uttarakhand");

} else if (str1[3].contains("Chhattisgarh")) {
mos.write(new Text("Chhattisgarh"), new Text(sr1), "/TripData/Chhattisgarh");

} else if (str1[3].contains("Goa")) {
mos.write(new Text("Goa"), new Text(sr1), "/TripData/Goa");

} else if (str1[3].contains("Assam")) {
mos.write(new Text("Assam"), new Text(sr1), "/TripData/Assam");

} else if (str1[3].contains("Himachal_Pradesh")) {
mos.write(new Text("Himachal_Pradesh"), new Text(sr1), "/TripData/Himachal_Pradesh");

} else {
mos.write(new Text("Other"), new Text(sr1), "/TripData/Other");

}
}

@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}

We are not using reducer as we are already doing the entire work in Map phase only. There is no reduce work needed.

3. EXECUTING THE MAP REDUCE CODE

Run the Jar file by below command:-

Command > hadoop jar trip.jar com.poc.trip.PocDriver /Input/Trip.xlsx /Tripout

4. Create HIVE table and upload Processed data:-

In this blog I will be executing the hive command directly from terminal instead of going inside the HIVE shell. The execution time of the command will be more but I am doing this for a different approach. (You can go inside the HIVE shell and drive the commands)

First Creation of a Database and create a external table in the same to contain the MR data:-

You can achieve the same by the following command:-

Command > hive -e 'create database Trip; use Trip;'

Create table in a script file and save it as .sql:

Follow the below command for creation of table.

Command > nano hivetable.sql

Script > hive -e 'create external table Trip (state string,sid int,tid int,address string,trip int)

> row format delimited

> fields terminated by '\t'

> stored as textfile location '/TripData';'

Now run the script file using the following command.

Command > hive -f hivetable.sql

This will create an external table and store the data in that table.

Now its time to bring the data and count the number of trip made in each state. Follow the below command for the same.

Command > hive -e 'Select state, count(*) from trip group by state;' > /home/gopal/Desktop/Tripcount.txt

This will create a text file and store the output in that with tab delimited format.

To get the output in desired delimited format we can use a script for the same. Follow below command to get desired delimited format.

Command > nano hiveout.sql

Script > INSERT OVERWRITE LOCAL DIRECTORY '/home/gopal/Desktop/Tripcount'

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY ','

> select state, count(*) from trip group by state;

Now run the script file to get desired output by below command.

Command > hive -f hiveout.sql

Now use Tableau software to give a graphical representation.

I have used the tableau software and exported the results into a PDF file.

Hope you all understood the procedures...

Please do notify me for any corrections...

Kindly leave a comment for any queries/clarification...

ALL D BEST...

3 comments:

NikishaDecember 5, 2019 at 3:07 AM
The comprehensive migration solutions and the automated mapping of data between the target system and the source provided by your company help in successive data migration. It makes your company be one of the data migration service companies

Alfred AvinaJanuary 29, 2020 at 1:38 AM
As we know, AWS big data consultant is the future of the industries these days, this article helps me to figure out which language I need to learn to pursue the future in this field.

defneAugust 19, 2023 at 3:58 PM
kuşadası
milas
çeşme
bağcılar
adıyaman

NRF1TX

UNDERSTANDING HADOOP BY MAHESH MAHARANA

Sunday, March 12, 2017

A USECASE ON TRAVEL APP