Dear Friends,
We know that Hadoop's HIVE component is very good for structured data processing.
Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type (numeric, currency, alphabetic, name, date, address).
Structured data has the advantage of being easily entered, stored, queried and analyzed. At one time, because of the high cost and performance limitations of storage, memory and processing, relational databases and spreadsheets using structured data were the only way to effectively manage data. Anything that couldn't fit into a tightly organized structure would have to be stored on paper in a filing cabinet.
Hadoop has an advantage over this by using its component HIVE. Though HDFS solved the storing part but MR code writing & processing is bit lengthy. So, Hive is used for processing large amount of structured data more conveniently then any other component. Even though it uses MR as internal process the query writing is much simpler and easier comparing that with MR.
This is the use-case given to me by one of my friend and asked me not to use MR programming in this use-case.
Seriously, working with MR I am used to think like that but this POC made me to think in different way and I am glad that I solved it with using MR Programming. (Though MR is the background internal process happens in all Hadoop's components).
USE-CASE DEFINITION
Medicare Companies (Medical Insurance) tie up with different Hospitals for their customers to have cashless/problem free health checkups. Patients on recovery from certain heath issue don't go for checkup again (In fear of Cost/Travel/Time). They may no longer visit a hospital, or consult doctors altogether till they have some health issues again.
But a Medicare Companies shouldn’t give up on their customers without a fair try. Whether your customers have gone to the competition or just gone silent, they are worth your time and effort to revert them again with excellent proposals/policies.
There are number of Medicare-card holders who visit different Hospitals regularly. If some medicare-card holder doesn’t visit a particular Hospital for a period of time, that Hospital consider them as PASSIVE customer. Same way, if some Medicare-care holder visit a Hospital for first time, that customer is called ACTIVE. The problem is to find out ACTIVE & PASSIVE Mdicare-Card Holder for every Hospital from the data provided, on monthly basis to know how their business are performing ?
This will create a next steps for a luring that medicare-card holder who turned passive over a period of time by giving them offers, discounts etc.
Note # Passive period varies from Hospitals , For APPOLO, it may be 90 days, but for small Hospitals like DAVITA, DIALYSIS Care it may be 30 days.
PROBLEM STATEMENT
1. If PASSIVE period is 31 days & application is executing on 1 APRIL 2014 (New Financial Year) , Then the count of the Passive medicare-care holder that occurred as of 31st December 2013. It implies that the records to be used as quarterly basis.
2. If a Medicare-Card Holder is appearing first time in a Hospital or it has not appeared within period of 90 Days in same Hospital then it will be considered as new customer.
3. Output Report should contain daily Reports , Monthly Reports as well as Quarterly Reports.
4. Data to be used :
PASSIVE.info : Information about Passive period of a Hospital
HealthCare.csv: Total Transactions of the Hospitals WRT Medicare-Card Holder for the period of 2 years.
You can download the sample data used in this use-case from HERE.
USE_CASE SOLUTION ARCHITECTURE
In this Use-Case we will first sort the number of Hospitals for whom we need to find Active & Passive customers, by using HIVE partition table. Then using CTAS command in HIVE we will further divide each Hospital into Active & Passive customers based on their last date of activity and keep them in a separate table.
After the above is done now based on our requirement we can put any query such as JOIN to find only Active/Passive customers or both, quarterly data,etc.
1. LOAD THE DATA IN HDFS:-
Our first step is to load the data in HDFS, Hope you all know that by this time. Still as a part of my Blog I will keep it.
To load the data in HDFS give the below command:-
Command > hadoop fs -mkdir /Health (To make a directory)
Command > hadoop fs -put HealthCare.csv /Health (To put the file in HDFS)
2. CREATE A DATABASE AND EXTERNAL TABLE TO KEEP THE DATA:-
Now as we have our data in HDFS and we need to work in HIVE, we have to create a database for the same. The benefit of creating an external table is that we don't have to copy the entire data to Hive's warehouse but can point to the location where the actual data is residing. To that if we delete the external table only table schema information will be lost not the actual data.
Below is the command for creating the database and external table:
Command > create database HealthCare;
Command > create external table health
> (cid int,card int,age int,disease string, hospital string,date date,address string,unit int)
> row format delimited
> fields terminated by ','> stored as textfile location '/Health'
> tblproperties ("skip.header.line.count"="1"); (Used to remove head line)
Check that all data is showing or not by below command.
Command > select * from health;
(If everything is fine then you will see all results)
3. CREATE PARTITION TABLE FOR KEEPING EACH HOSPITAL'S RECORDS:-
Now we will create an external partition table to diving the entire records into individual hospitals and keep each records separately by below command:
Command > create external table HealthPart
> (cid int,card int,age int,disease string, phospital string,date date,address string,unit int) > PARTITIONED BY (hospital string) > row format delimited
> fields terminated by ','
> stored as textfile location '/PartHealth';
After creating the table insert data from Health table and it will automatically divide it into Hospital names. Below is the command for inserting data from Health table:
Command > insert overwrite table healthpart PARTITION (hospital)
> select cid,card,age,disease,hospital,date,address,unit,hospital from health;
4. CREATE TABLE FROM PARTITION TABLE FOR ACTIVE & PASSIVE CUSTOMERS FOR EACH HOSPITAL:-
Now we have to create two separate tables containing Active & Passive customers respectively. This can be achieved by CTAS(Create Table As Select), creating a table by selecting columns from partition table. Below is the command for achieving the same:-
Command > create table newdavita as select cid, card, age, disease, phospital, date, address, unit
> from healthpart
> where hospital='DAVITA'
> and date>date_sub('2014-04-01',30) and date<'2014-04-01';
date_sub('yyyy-mm-dd',no. of days):- Is used for taking date with difference as no. of days. (For Davita its 30 and Appolo its 90, so on..)
(This command will take selected columns from partition table based on the date for Active customers)
Command > create table olddavita as select cid, card, age, disease, phospital, date, address, unit
> from healthpart
> where hospital='DAVITA'
> and date<date_sub('2014-04-01',30) and date>'2014-01-01';
(This command will take selected columns from partition table based on the date for Passive customers). Here 30 is used as no. of days for Passiveness.
NOTE:- Do the same for each and every Hospital. You will get each Hospital's Active & Passive Customers.
(As APPOLO is having 90 days for passiveness all within that date range will be Active customer for that quarter and no Passive customer will be there.)
Check with command > select * from olddavita; to confirm about the data with date range.
NOTE:- You can workout with different date range according to the requirement, for month & year also.
5. REMOVE DUPLICATE ENTRIES FROM NEW & OLD TABLE:-
You have achieved the Active and Passive customer data from each Part-Hospital table. But the fact may be there a member may have come before Passive time difference and after passive time difference. So many Member Card holder will be registered in both New & Old table.
We will only consider the Member coming in Active data and remove them from Passive data for data overwrite or duplicate's. This can be achieve through JOIN Command, which you can find below:-
For Getting Active Data without overlapping from Passive Data:-
Command > select a.cid,a.card,a.date,a.phospital from newdavita a left outer join olddavita b
> on a.card=b.card WHERE a.card IS NULL;
(Here we are taking everything on left hand side ie; Active data and joining with right hand side ie; Passive data and using null for removing the same/duplicate data entries from Active data.)
OR Using semi joinCommand > select a.cid,a.card,a.date,a.hospital from newdavita a left semi join olddavita b
> on a.card=b.card;
> on a.card=b.card WHERE a.card IS NULL;
(Here we are taking everything on left hand side ie; Active data and joining with right hand side ie; Passive data and using null for removing the same/duplicate data entries from Passive data.)
NOTE:- For my use-case it didn't worked as both data contained same members. You can try to make some changes and try.
After removing duplicates/overlapping entries from Active & Passive data. Use group by option to group the same Member card number for display. Below is the command for the same.
Command > select card,max(date) from newdavita group by card;
This will remove repeating card number and return only 1 card number for each entries.
6. CONSOLIDATE EVERY NEW TABLE & OLD TABLE TO GET ACTIVE & PASSIVE CUSTOMERS BY CTAS & UNION ALL:-
Now, Once we have all Active & Passive customers for each Hospital, consolidate it to get Active & Passive Customers for the Quarter.
Below is the command to join all Active customer data by UNION ALL and keep it in a table for further queries/analysis:-
Command > create table active_customer as select * from newdavita
> union all select * from newappolo;
NOTE:- You can join multiple table using Union All to keep in one table like for eg; CTAS select * from new1 union all select * from new2 union all select * from new3 union all select * from new4 and so on....
STORE DATA IN HDFS:-
If you want to save the Active/Passive Customer Union data then below is the command:-
Command > insert overwrite directory '/NewCustomer'
> select * from newdavita union all
> select * from newappolo;
STORE DATA IN LFS:-
1. From Hive shell:-
If you want to keep it in .csv format give the below command:-
Command > insert overwrite local directory '/home/gopal/Desktop/NewCustomer'
> row format delimited
> fields terminated by ','
> select * from active_customer;
2. From Terminal:-
If you want to keep it in .tsv format follow the below procedure:-
2. Open a new terminal:-
Goto the directory where hive is running/Metastore is present and give the below command:-
Command > hive -e 'use healthcare; select * from newdavita union all select * from newappolo' > /home/gopal/Desktop/ActiveCustomer.tsv
Now go on and try different aspects of query to get variable results like monthly active and passive customers, count of each hospital customers, etc.
Hope you all understood the procedures...
Please do notify me for any corrections...
Kindly leave a comment for any queries/clarification...
ALL D BEST...