Data Wrangling

The process of turning raw data into more appropriate format. in the end, the result should have acceptable Data Quality.

Contents included in this notebook:

Data Exploration
Data Cleaning
Merging Datasets

Also this notebook will introduce you to Pandas which faciliate the manipulation task.

Datasets:

from os.path import join
import numpy as np
import pandas as pd
%pdb 1

Automatic pdb calling has been turned ON

Samsung after-sales service in Iran

Data Exploration

The purpose of this part is to gain a better understanding of the data. what’s data is about? what’s the structure? what’s the values? …

Let’s read Iran’s Earthquakes dataset into a DataFrame.

service_df = pd.read_csv(join('Datasets', 'service', 'samdata.csv'))
service_df

	No	Service_type	Cost_Type	Product_Group	Serial_No	City	Product_Date	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
0	1	0	Normal Warranty	AUDIO - HOME THEAT	2F/LAHFZ7E-IP-X	Tehran	NaN	2/5/2012	2/5/2012	1/21/2013	0	351	0	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Free Of Charge	Tech15	0	0.00	0.000	0.000
1	2	1	Normal Warranty	REFRIGERATOR-SBS	AIP244AACA/4E7	Tehran	NaN	7/7/2012	9/10/2012	1/31/2013	65	208	1	SHORT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0	3437481.60	687486.320	2748885.280
2	3	1	Normal Warranty	REFRIGERATOR-SBS	JK-C4BBCAEKL72	Tehran	NaN	7/7/2012	8/15/2012	1/31/2013	38	208	1	OPEN	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0	4228756.56	845751.312	3383005.248
3	4	0	Out of Warranty	MONITOR - TFT LCD	IP2JO4/-A-47	Tehran	NaN	8/7/2012	8/7/2012	1/13/2013	0	158	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech15	0	0.00	0.000	0.000
4	5	0	Out of Warranty	AUDIO - MP3 Player	AU/BCQE2A/P	Tehran	NaN	8/7/2012	8/9/2012	1/7/2013	2	153	1	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Medium	Tech07	120000	765488.60	177087.820	708381.680
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3131	3132	0	Normal Warranty	LCD-40	IP2JRIP/QE2KLKL	REZVANSHAHR	NaN	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech13	420000	0.00	84000.000	336000.000
3132	3133	0	Normal Warranty	LED46	IP2JRC/&/2/IP	ESFAHAN	NaN	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech13	420000	0.00	84000.000	336000.000
3133	3134	0	Extended Warranty	HARD DISK DRIVE	S2/BJ2/B2KL4Z2	Tehran	40552.0	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	REPAIR REFUSAL	Minimum	Tech07	50000	0.00	10000.000	40000.000
3134	3135	0	Out of Warranty	HARD DISK DRIVE	S2/BJ2/SC2/QE2	Tehran	38825.0	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	REPAIR REFUSAL	Free Of Charge	Tech07	0	0.00	0.000	0.000
3135	3136	0	Out of Warranty	DVD PLAYER/ RECORD	A&QEKLCDQAEA7AZ	Tehran	38448.0	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	MECHANICAL PARTS REPLACEMENT	Minor	Tech18	150000	1488482.00	328686.400	1318785.600

3136 rows × 22 columns

shape attribute indicate how large the resulting DataFrame is.

service_df.shape

(3136, 22)

print("pandas only displays " + str(pd.options.display.max_rows) + " rows" ) 
print("and " + str(pd.options.display.max_columns) + " columns at max by default" )

pandas only displays 60 rows
and None columns at max by default

The size of the dataset is not that big and beside that, it only have 2 more columns than pandas maximum displayed columns. so let’s tell pandas to show all columns for convenience.

 pd.set_option('display.max_columns', None)

use loc() and iloc() for selecting rows in specific range.

service_df.iloc[26:33:3]

	No	Cost_Type	Product_Group	Serial_No	City	Product_Date	Receipt_Date	Appoint_Date	Complete_Date	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Discount_Amount	Total_Invoice_Amount
26	27	Normal Warranty	DIGITAL CAMERA	AAKLMCN/CIPE/GT	Tehran	NaN	11/22/2012	11/22/2012	1/26/2013	65	1	DEFORMED	OUTCASE CRACK	ELECTRICAL PARTS REPLACEMENT	Major	Tech11	250000	50000.0	200000.0
29	30	Normal Warranty	TFT18	H2MQ2/2/472	Tehran	NaN	12/1/2012	12/1/2012	1/6/2013	36	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech16	0	0.0	0.0
32	33	Normal Warranty	TFT18	H2MQ2/AE27	Tehran	NaN	12/2/2012	12/2/2012	1/6/2013	35	1	WORN-OUT	NO POWER	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech16	0	0.0	0.0

sample() returns random rows to look at.

service_df.sample(13)

	No	Service_type	Cost_Type	Product_Group	Serial_No	City	Product_Date	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
130	131	0	Normal Warranty	LED40	IP2JQC/KLIP2IPIP	Tehran	NaN	12/18/2012	12/18/2012	1/30/2013	0	43	1	WORN-OUT	NO PICTURE/NO RASTER	SET EXCHANGE	Free Of Charge	Tech23	0	0.00	0.000	0.000000e+00
1807	1808	0	Out of Warranty	MONITOR - TFT LCD	H2MPA/2KL74/	Tehran	38083.0	1/13/2013	1/14/2013	1/23/2013	1	10	0	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech16	0	0.00	0.000	0.000000e+00
2971	2972	1	Normal Warranty	LED40	IP2JQA/&427A	Tehran	NaN	1/27/2013	1/27/2013	1/28/2013	0	1	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech15	420000	14086824.28	2803384.856	1.161354e+07
3043	3044	0	Out of Warranty	DIGITAL CAMERA	AAKLMCN/CAE/LAJ	Tehran	40808.0	1/28/2013	1/28/2013	1/30/2013	0	2	1	DEFORMED	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech11	250000	584648.70	166828.740	6.677189e+05
2331	2332	0	Normal Warranty	LCD-40	IP2JRA/72A/KL	Tehran	NaN	1/20/2013	1/20/2013	1/21/2013	0	1	1	WORN-OUT	OTHER PROBLEMS	SET EXCHANGE	Free Of Charge	Tech14	0	0.00	0.000	0.000000e+00
101	102	0	Out of Warranty	MONITOR - TFT LCD	H2MM2/A/4IPKL	Tehran	38081.0	12/16/2012	12/17/2012	1/5/2013	1	20	0	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech16	0	0.00	0.000	0.000000e+00
3096	3097	0	Normal Warranty	LED40	IP2JQA/&KL/4/	Tehran	NaN	1/28/2013	1/28/2013	1/31/2013	0	3	1	WORN-OUT	NO PICTURE/NO RASTER	SET EXCHANGE	Free Of Charge	Tech23	0	0.00	0.000	0.000000e+00
2181	2182	0	Normal Warranty	DVD PLAYER	ZT2QARDBC/2&IP2Z	Tehran	NaN	1/19/2013	1/19/2013	1/20/2013	0	1	1	WORN-OUT	READ ERROR	ELECTRICAL PARTS REPLACEMENT	Major	Tech18	170000	1840063.68	402012.736	1.608051e+06
2141	2142	0	Normal Warranty	LED40	IP2JQA/&QE4/	Tehran	NaN	1/19/2013	1/19/2013	1/21/2013	0	2	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Minor	Tech14	215000	0.00	43000.000	1.720000e+05
935	936	0	Normal Warranty	LCD-40	IP2JRA/7/7&IP	Tehran	NaN	1/5/2013	1/5/2013	1/6/2013	0	1	1	WORN-OUT	OTHER PROBLEMS	SET EXCHANGE	Free Of Charge	Tech08	0	0.00	0.000	0.000000e+00
1698	1699	0	Normal Warranty	LCD-40	IP2JQA/&2772	Tehran	NaN	1/10/2013	1/19/2013	1/20/2013	8	10	1	WORN-OUT	OTHER PROBLEMS	SET EXCHANGE	Free Of Charge	Tech13	0	0.00	0.000	0.000000e+00
343	344	1	Normal Warranty	LED40	IP2JR2/-2/IP4	Tehran	NaN	12/27/2012	12/31/2012	1/1/2013	4	5	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech10	420000	24887810.32	5063582.064	2.025433e+07
1875	1876	1	Normal Warranty	LCD- 46	IP2JQ7/42/42	Tehran	NaN	1/14/2013	1/17/2013	1/20/2013	3	6	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech15	0	0.00	0.000	0.000000e+00

Dataset itself have a column for index values, let’s be sure that it’s really unique for each row, then we can condiser No column as index.

service_df['No'].is_unique

True

service_df.set_index("No", inplace=True)

head(n) returns first n rows. tail(n) returns last n rows. Default value is considered to be 5.

service_df.head()

	Service_type	Cost_Type	Product_Group	Serial_No	City	Product_Date	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
No
1	0	Normal Warranty	AUDIO - HOME THEAT	2F/LAHFZ7E-IP-X	Tehran	NaN	2/5/2012	2/5/2012	1/21/2013	0	351	0	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Free Of Charge	Tech15	0	0.00	0.000	0.000
2	1	Normal Warranty	REFRIGERATOR-SBS	AIP244AACA/4E7	Tehran	NaN	7/7/2012	9/10/2012	1/31/2013	65	208	1	SHORT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0	3437481.60	687486.320	2748885.280
3	1	Normal Warranty	REFRIGERATOR-SBS	JK-C4BBCAEKL72	Tehran	NaN	7/7/2012	8/15/2012	1/31/2013	38	208	1	OPEN	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0	4228756.56	845751.312	3383005.248
4	0	Out of Warranty	MONITOR - TFT LCD	IP2JO4/-A-47	Tehran	NaN	8/7/2012	8/7/2012	1/13/2013	0	158	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech15	0	0.00	0.000	0.000
5	0	Out of Warranty	AUDIO - MP3 Player	AU/BCQE2A/P	Tehran	NaN	8/7/2012	8/9/2012	1/7/2013	2	153	1	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Medium	Tech07	120000	765488.60	177087.820	708381.680

Columns attribute returns all column names.

service_df.columns

Index(['Service_type', 'Cost_Type', 'Product_Group', 'Serial_No', 'City',
       'Product_Date', 'Receipt_Date', 'Appoint_Date', 'Complete_Date',
       'TAT01', 'TAT02', 'Job_Satus', 'Defect_Des', 'Symptom_Desc',
       'Repair_Action_Desc', 'Labor_Charge_Desc', 'Engineer',
       'Labor_Charge_Amount', 'Parts_Amount', 'Discount_Amount',
       'Total_Invoice_Amount'],
      dtype='object')

dtypes shows each column data type.

service_df.dtypes

Service_type              int64
Cost_Type                object
Product_Group            object
Serial_No                object
City                     object
Product_Date            float64
Receipt_Date             object
Appoint_Date             object
Complete_Date            object
TAT01                     int64
TAT02                     int64
Job_Satus                 int64
Defect_Des               object
Symptom_Desc             object
Repair_Action_Desc       object
Labor_Charge_Desc        object
Engineer                 object
Labor_Charge_Amount       int64
Parts_Amount            float64
Discount_Amount         float64
Total_Invoice_Amount    float64
dtype: object

info() function summarise the dadaset.

service_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3136 entries, 1 to 3136
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Service_type          3136 non-null   int64  
 1   Cost_Type             3136 non-null   object 
 2   Product_Group         3126 non-null   object 
 3   Serial_No             3127 non-null   object 
 4   City                  3136 non-null   object 
 5   Product_Date          724 non-null    float64
 6   Receipt_Date          3136 non-null   object 
 7   Appoint_Date          3136 non-null   object 
 8   Complete_Date         3136 non-null   object 
 9   TAT01                 3136 non-null   int64  
 10  TAT02                 3136 non-null   int64  
 11  Job_Satus             3136 non-null   int64  
 12  Defect_Des            3136 non-null   object 
 13  Symptom_Desc          3136 non-null   object 
 14  Repair_Action_Desc    3136 non-null   object 
 15  Labor_Charge_Desc     3136 non-null   object 
 16  Engineer              3136 non-null   object 
 17  Labor_Charge_Amount   3136 non-null   int64  
 18  Parts_Amount          3136 non-null   float64
 19  Discount_Amount       3136 non-null   float64
 20  Total_Invoice_Amount  3136 non-null   float64
dtypes: float64(4), int64(5), object(12)
memory usage: 539.0+ KB

For numerical values describe() function shows statical features of each column.

service_df.describe()

	Service_type	Product_Date	TAT01	TAT02	Job_Satus	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
count	3136.000000	724.000000	3136.000000	3136.000000	3136.000000	3136.000000	3.136000e+03	3.136000e+03	3.136000e+03
mean	0.105230	39227.758287	1.289222	6.647959	0.977360	129877.232143	2.303481e+06	4.844214e+05	1.924006e+06
std	0.306898	1152.746282	5.060422	14.012700	0.148778	176843.065376	6.682868e+06	1.349372e+06	5.400547e+06
min	0.000000	33870.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00
25%	0.000000	38451.000000	0.000000	1.000000	1.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00
50%	0.000000	38823.000000	0.000000	3.000000	1.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00
75%	0.000000	40188.000000	1.000000	6.000000	1.000000	250000.000000	7.510464e+04	8.738318e+04	3.827746e+05
max	1.000000	40813.000000	81.000000	351.000000	1.000000	540000.000000	4.480716e+07	8.088432e+06	3.635773e+07

Columns can be accessed as a attribute to the dataframe. Each column in fact is a Series. we can count each repetative value by using value_count() function.

service_df.Product_Group.value_counts()

LED40                 1008
LCD-40                 529
DIGITAL CAMERA         282
LED46                  250
PRINTER - MONO LAS     151
DVD PLAYER             119
AUDIO - MP3 Player     109
HARD DISK DRIVE        102
AUDIO - HOME THEAT     102
REFRIGERATOR-SBS        68
DVD PLAYER/ RECORD      65
PLASMA-42               62
CAMCORDER               57
PLASMA-43               49
PRINTER - COLOR LA      35
MONITOR - TFT LCD       24
VACUUM CLEANER          24
TFT18                   18
WASHING MACHINE         16
PLASMA-51               15
LED55                   15
OVEN - MICROWAVE        11
LCD- 46                  8
PLASMA-50                4
AIR CONDITIONER          3
Name: Product_Group, dtype: int64

Also we can group rows by columns. and get each group size.

service_df.groupby(by =["Product_Group", "Cost_Type"]).size()

Product_Group       Cost_Type        
AIR CONDITIONER     Normal Warranty         2
                    Out of Warranty         1
AUDIO - HOME THEAT  Normal Warranty         3
                    Out of Warranty        99
AUDIO - MP3 Player  Normal Warranty        28
                    Out of Warranty        81
CAMCORDER           Normal Warranty         5
                    Out of Warranty        52
DIGITAL CAMERA      Normal Warranty       149
                    Out of Warranty       133
DVD PLAYER          Normal Warranty        53
                    Out of Warranty        66
DVD PLAYER/ RECORD  Extended Warranty      11
                    Normal Warranty        18
                    Out of Warranty        36
HARD DISK DRIVE     Extended Warranty      63
                    Normal Warranty        14
                    Out of Warranty        25
LCD- 46             Normal Warranty         5
                    Out of Warranty         3
LCD-40              Normal Warranty       522
                    Out of Warranty         7
LED40               Normal Warranty      1003
                    Out of Warranty         5
LED46               Normal Warranty       247
                    Out of Warranty         3
LED55               Normal Warranty        14
                    Out of Warranty         1
MONITOR - TFT LCD   Normal Warranty         5
                    Out of Warranty        19
OVEN - MICROWAVE    Normal Warranty         3
                    Out of Warranty         8
PLASMA-42           Normal Warranty        55
                    Out of Warranty         7
PLASMA-43           Normal Warranty        45
                    Out of Warranty         4
PLASMA-50           Normal Warranty         3
                    Out of Warranty         1
PLASMA-51           Normal Warranty        12
                    Out of Warranty         3
PRINTER - COLOR LA  Normal Warranty        14
                    Out of Warranty        21
PRINTER - MONO LAS  Normal Warranty        57
                    Out of Warranty        94
REFRIGERATOR-SBS    Normal Warranty        32
                    Out of Warranty        36
TFT18               Extended Warranty       2
                    Normal Warranty        15
                    Out of Warranty         1
VACUUM CLEANER      Normal Warranty         1
                    Out of Warranty        23
WASHING MACHINE     Out of Warranty        16
dtype: int64

In fact the result of above command is a Series which have multiple indexes. by using unstack(), we reshape the series into a Dataframe

service_df.groupby(by =["Product_Group", "Cost_Type"]).size().unstack()

Cost_Type	Extended Warranty	Normal Warranty	Out of Warranty
Product_Group
AIR CONDITIONER	NaN	2.0	1.0
AUDIO - HOME THEAT	NaN	3.0	99.0
AUDIO - MP3 Player	NaN	28.0	81.0
CAMCORDER	NaN	5.0	52.0
DIGITAL CAMERA	NaN	149.0	133.0
DVD PLAYER	NaN	53.0	66.0
DVD PLAYER/ RECORD	11.0	18.0	36.0
HARD DISK DRIVE	63.0	14.0	25.0
LCD- 46	NaN	5.0	3.0
LCD-40	NaN	522.0	7.0
LED40	NaN	1003.0	5.0
LED46	NaN	247.0	3.0
LED55	NaN	14.0	1.0
MONITOR - TFT LCD	NaN	5.0	19.0
OVEN - MICROWAVE	NaN	3.0	8.0
PLASMA-42	NaN	55.0	7.0
PLASMA-43	NaN	45.0	4.0
PLASMA-50	NaN	3.0	1.0
PLASMA-51	NaN	12.0	3.0
PRINTER - COLOR LA	NaN	14.0	21.0
PRINTER - MONO LAS	NaN	57.0	94.0
REFRIGERATOR-SBS	NaN	32.0	36.0
TFT18	2.0	15.0	1.0
VACUUM CLEANER	NaN	1.0	23.0
WASHING MACHINE	NaN	NaN	16.0

let’s count NaN values in each columns.

service_df.isna().sum()

Service_type               0
Cost_Type                  0
Product_Group             10
Serial_No                  9
City                       0
Product_Date            2412
Receipt_Date               0
Appoint_Date               0
Complete_Date              0
TAT01                      0
TAT02                      0
Job_Satus                  0
Defect_Des                 0
Symptom_Desc               0
Repair_Action_Desc         0
Labor_Charge_Desc          0
Engineer                   0
Labor_Charge_Amount        0
Parts_Amount               0
Discount_Amount            0
Total_Invoice_Amount       0
dtype: int64

let’s see rows with NaN values

service_df[service_df[['Product_Group','Serial_No']].isna().any(1)]

	Service_type	Cost_Type	Product_Group	Serial_No	City	Product_Date	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
No
412	0	Out of Warranty	NaN	2P-/BAXS-E/KLKLK	Tehran	38818.0	12/29/2012	12/31/2012	1/8/2013	2	10	1	WORN-OUT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech16	0	0.00	0.000	0.000
542	1	Normal Warranty	PRINTER - MONO LAS	NaN	Tehran	NaN	12/31/2012	12/31/2012	1/1/2013	0	1	1	WORN-OUT	OTHER PROBLEMS	OTHER REPAIR	Maximum	Tech20	350000	0.00	70000.000	280000.000
626	0	Out of Warranty	NaN	ZADCKLVKSAE2KL2K	MASHHAD	38814.0	1/1/2013	1/8/2013	1/13/2013	7	12	1	DEFORMED	Test OK	No Action	Free Of Charge	Tech08	0	0.00	0.000	0.000
704	0	Out of Warranty	NaN	A/&2GV2B&E/KL2K	Tehran	40548.0	1/1/2013	1/2/2013	1/5/2013	1	4	1	Software Bug	ZOOM PROBLEM	FIRMWARE UPGRADE	Major	Tech08	250000	0.00	50000.000	200000.000
876	0	Out of Warranty	VACUUM CLEANER	NaN	Tehran	NaN	1/2/2013	1/6/2013	1/6/2013	4	4	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech17	0	0.00	0.000	0.000
934	0	Normal Warranty	NaN	ZTAPKLCABA/2IP24L	Tehran	NaN	1/5/2013	1/5/2013	1/5/2013	0	0	1	WORN-OUT	READ ERROR	MECHANICAL PARTS REPLACEMENT	Major	Tech12	0	1158705.72	231741.144	826864.576
1059	0	Normal Warranty	NaN	ZTAPKLCIBAE&&/D	Tehran	NaN	1/5/2013	1/6/2013	1/7/2013	1	2	1	WORN-OUT	READ ERROR	MECHANICAL PARTS REPLACEMENT	Medium	Tech12	0	1158705.72	231741.144	826864.576
1207	0	Normal Warranty	NaN	ZTAPKLCIBB/&IP47A	Tehran	NaN	1/6/2013	1/6/2013	1/10/2013	0	4	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech14	170000	1158705.72	265741.144	1062864.576
1213	0	Out of Warranty	NaN	2P7IPBAKPQEKL27/N	Tehran	38080.0	1/6/2013	1/6/2013	1/7/2013	0	1	1	WORN-OUT	Test OK	No Action	Free Of Charge	Tech20	0	0.00	0.000	0.000
1235	0	Out of Warranty	VACUUM CLEANER	NaN	Tehran	NaN	1/6/2013	1/8/2013	1/9/2013	2	3	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Minor	Tech17	145000	15526.44	32105.288	128421.152
1550	0	Normal Warranty	NaN	ZTAPKLCIB2/A/7KLZ	Tehran	NaN	1/9/2013	1/9/2013	1/9/2013	0	0	1	WORN-OUT	READ ERROR	MECHANICAL PARTS REPLACEMENT	Medium	Tech12	0	1158705.72	231741.144	826864.576
1571	0	Normal Warranty	NaN	ZTAPKLCIBA/A4QEP	ARDEBIL	NaN	1/9/2013	1/9/2013	1/27/2013	0	18	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech18	0	0.00	0.000	0.000
1617	0	Out of Warranty	NaN	A/&KLKLVBB&E/2AN	Tehran	40548.0	1/9/2013	1/10/2013	1/13/2013	1	4	1	DEFORMED	NO POWER	ELECTRICAL PARTS REPLACEMENT	Major	Tech08	250000	0.00	50000.000	200000.000
1814	0	Out of Warranty	DVD PLAYER	NaN	Tehran	NaN	1/14/2013	1/14/2013	1/26/2013	0	12	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Minor	Tech14	120000	868118.56	187823.812	781285.648
2032	0	Out of Warranty	DIGITAL CAMERA	NaN	Tehran	NaN	1/15/2013	1/15/2013	1/21/2013	0	6	1	DEFORMED	OTHER PROBLEMS	MECHANICAL PARTS REPLACEMENT	Free Of Charge	Tech11	0	0.00	0.000	0.000
2036	0	Out of Warranty	VACUUM CLEANER	NaN	Tehran	NaN	1/16/2013	1/20/2013	1/20/2013	4	4	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech17	0	18858.40	3871.880	15887.520
2269	0	Out of Warranty	AUDIO - HOME THEAT	NaN	Tehran	NaN	1/20/2013	1/20/2013	1/20/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech18	0	1450087.28	280018.456	1160077.824
2772	0	Out of Warranty	DVD PLAYER	NaN	Tehran	NaN	1/24/2013	1/24/2013	1/26/2013	0	2	1	WORN-OUT	OTHER PROBLEMS	MECHANICAL PARTS REPLACEMENT	Minimum	Tech18	70000	868118.56	187823.812	751285.648
2892	0	Out of Warranty	DIGITAL CAMERA	NaN	ABHAR	NaN	1/27/2013	1/27/2013	1/28/2013	0	1	1	DEFORMED	OTHER PROBLEMS	SCRAP	Free Of Charge	Tech11	0	0.00	0.000	0.000

Data Cleaning

In this part, based on what we want to do with the data, we restructure, reformat and validate the data.

for start, let’s remove columns that’s not useful for our hypothetical purpose.

Does Serial_No really help the model to decide how much time it takes to be repaired?
or is Product_Date values meaningful?

if the answers to these questions are No, we can drop those columns.

service_df = service_df.drop(columns=['Product_Date', 'Serial_No'])
service_df

	Service_type	Cost_Type	Product_Group	City	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
No
1	0	Normal Warranty	AUDIO - HOME THEAT	Tehran	2/5/2012	2/5/2012	1/21/2013	0	351	0	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Free Of Charge	Tech15	0	0.00	0.000	0.000
2	1	Normal Warranty	REFRIGERATOR-SBS	Tehran	7/7/2012	9/10/2012	1/31/2013	65	208	1	SHORT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0	3437481.60	687486.320	2748885.280
3	1	Normal Warranty	REFRIGERATOR-SBS	Tehran	7/7/2012	8/15/2012	1/31/2013	38	208	1	OPEN	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0	4228756.56	845751.312	3383005.248
4	0	Out of Warranty	MONITOR - TFT LCD	Tehran	8/7/2012	8/7/2012	1/13/2013	0	158	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech15	0	0.00	0.000	0.000
5	0	Out of Warranty	AUDIO - MP3 Player	Tehran	8/7/2012	8/9/2012	1/7/2013	2	153	1	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Medium	Tech07	120000	765488.60	177087.820	708381.680
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3132	0	Normal Warranty	LCD-40	REZVANSHAHR	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech13	420000	0.00	84000.000	336000.000
3133	0	Normal Warranty	LED46	ESFAHAN	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech13	420000	0.00	84000.000	336000.000
3134	0	Extended Warranty	HARD DISK DRIVE	Tehran	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	REPAIR REFUSAL	Minimum	Tech07	50000	0.00	10000.000	40000.000
3135	0	Out of Warranty	HARD DISK DRIVE	Tehran	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	REPAIR REFUSAL	Free Of Charge	Tech07	0	0.00	0.000	0.000
3136	0	Out of Warranty	DVD PLAYER/ RECORD	Tehran	1/31/2013	1/31/2013	1/31/2013	0	0	1	WORN-OUT	OTHER PROBLEMS	MECHANICAL PARTS REPLACEMENT	Minor	Tech18	150000	1488482.00	328686.400	1318785.600

3136 rows × 19 columns

Dataset at hand have some NaN values. what does it mean? is it that the value exist but not provided or there isn’t anything to fill that field?

possible solutions

drop data
- drop the whole row
- drop the whole column
Imputation
- replace it by mean
- replace it by frequency
- adding unkown category
- …

it’s important to have a logical reason. it’s not reasonable to fill NaN values with mean when NaN occupies half of that column.

Where’s the NaN values ?

service_df.isna().sum()

Service_type             0
Cost_Type                0
Product_Group           10
City                     0
Receipt_Date             0
Appoint_Date             0
Complete_Date            0
TAT01                    0
TAT02                    0
Job_Satus                0
Defect_Des               0
Symptom_Desc             0
Repair_Action_Desc       0
Labor_Charge_Desc        0
Engineer                 0
Labor_Charge_Amount      0
Parts_Amount             0
Discount_Amount          0
Total_Invoice_Amount     0
dtype: int64

Let’s see rows with NaN values.

service_df[pd.isna(service_df.Product_Group)]

	Service_type	Cost_Type	Product_Group	City	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
No
412	0	Out of Warranty	NaN	Tehran	12/29/2012	12/31/2012	1/8/2013	2	10	1	WORN-OUT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech16	0	0.00	0.000	0.000
626	0	Out of Warranty	NaN	MASHHAD	1/1/2013	1/8/2013	1/13/2013	7	12	1	DEFORMED	Test OK	No Action	Free Of Charge	Tech08	0	0.00	0.000	0.000
704	0	Out of Warranty	NaN	Tehran	1/1/2013	1/2/2013	1/5/2013	1	4	1	Software Bug	ZOOM PROBLEM	FIRMWARE UPGRADE	Major	Tech08	250000	0.00	50000.000	200000.000
934	0	Normal Warranty	NaN	Tehran	1/5/2013	1/5/2013	1/5/2013	0	0	1	WORN-OUT	READ ERROR	MECHANICAL PARTS REPLACEMENT	Major	Tech12	0	1158705.72	231741.144	826864.576
1059	0	Normal Warranty	NaN	Tehran	1/5/2013	1/6/2013	1/7/2013	1	2	1	WORN-OUT	READ ERROR	MECHANICAL PARTS REPLACEMENT	Medium	Tech12	0	1158705.72	231741.144	826864.576
1207	0	Normal Warranty	NaN	Tehran	1/6/2013	1/6/2013	1/10/2013	0	4	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech14	170000	1158705.72	265741.144	1062864.576
1213	0	Out of Warranty	NaN	Tehran	1/6/2013	1/6/2013	1/7/2013	0	1	1	WORN-OUT	Test OK	No Action	Free Of Charge	Tech20	0	0.00	0.000	0.000
1550	0	Normal Warranty	NaN	Tehran	1/9/2013	1/9/2013	1/9/2013	0	0	1	WORN-OUT	READ ERROR	MECHANICAL PARTS REPLACEMENT	Medium	Tech12	0	1158705.72	231741.144	826864.576
1571	0	Normal Warranty	NaN	ARDEBIL	1/9/2013	1/9/2013	1/27/2013	0	18	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech18	0	0.00	0.000	0.000
1617	0	Out of Warranty	NaN	Tehran	1/9/2013	1/10/2013	1/13/2013	1	4	1	DEFORMED	NO POWER	ELECTRICAL PARTS REPLACEMENT	Major	Tech08	250000	0.00	50000.000	200000.000

based on the task you about to do with data after the data is prepared, decide how much droping rows is logical. if 10 rows is not a lot, we can drop them.

service_df.dropna(inplace=True)

what cities are in the dataset?

print(sorted(service_df.City.unique()))

['ABADAN', 'ABHAR', 'AHVAZ', 'AMOL', 'ARAK', 'ARDEBI', 'ARDEBIL', 'BABOL', 'BABOLSAR', 'BANDAR ABBAS', 'BANDAR ANZALI', 'BANDARABBAS', 'BARDSIR', 'BEHBAHAN', 'BIJAR', 'BIRJAND', 'BOJNORD', 'BOKAN', 'BOMEHEN', 'BOROJERD', 'CHABAHAR', 'CHALOOS', 'CHALOUS', 'CHENARAN', 'DEZFOL', 'DEZFOUL', 'DOROD', 'ESFAHAN', 'ESFAHN', 'ESFERAYN', 'ESLAM ABAD', 'ESLAM SHAHR', 'ESLAMSHAHR', 'FARDIS', 'FARHANG', 'GARMSAR', 'GHAEM SHAHR', 'GHAEMSHAHR', 'GHARCHAK', 'GOLPAYGAN', 'GONBAD', 'GONBAD KAVOS', 'GORGAN', 'HAMEDAN', 'HASHTGERD', 'HSMEDAN', 'ILAM', 'IZEH', 'JAHROM', 'JEYROFT', 'KARAJ', 'KASHAN', 'KASHMAR', 'KERMAN', 'KERMANSHAH', 'KHOMAM', 'KHOMEIN', 'KHORAM ABAD', 'KHOY', 'KhoramAbad', 'LAHIJAN', 'LANGEROD', 'LANGEROOD', 'MAHMOUD ABAD', 'MAHSHAHR', 'MALAYER', 'MANJIL', 'MASHHAD', 'MEHRSHAHR', 'MIYANEH', 'NAHAVAND', 'NIYSHABOUR', 'NOR', 'NOSHAHR', 'NOUR', 'NOUSHAHR', 'OROMIYEH', 'PARS ABAD', 'QAZVIN', 'QOM', 'RAFSANJAN', 'RASHT', 'REZVANSHAHR', 'ROBAT KARIM', 'ROBATKARIM', 'RODSAR', 'SABZEVAR', 'SALMAS', 'SARAF ZADEH', 'SARBANDAR', 'SARI', 'SAVAD KOOH', 'SAVEH', 'SAVOJBOLAGH', 'SEMNAN', 'SHAHR GHODS', 'SHAHRE GHODS', 'SHAHRE KORD', 'SHAHRGHODS', 'SHAHROD', 'SHAHROOD', 'SHAHRYAR', 'SHAZAND', 'SHIRAZ', 'SIRJAN', 'TABAS', 'TABRIZ', 'TALESH', 'TEHRAN', 'TONEKABON', 'Tehran', 'VARAMIN', 'YAZD', 'ZAHEDAN', 'ZANJAN']

Look at the values above, what you see? we have multiple ways of writing same city name.

‘ARDEBI’, ‘ARDEBIL’
‘BANDAR ABBAS’, ‘BANDARABBAS’
‘CHALOOS’, ‘CHALOUS’
‘DEZFOL’, ‘DEZFOUL’
‘ESFAHAN’, ‘ESFAHN’
‘ESLAMSHAHR’, ‘ESLAM SHAHR’
‘GHAEM SHAHR’, ‘GHAEMSHAHR’
‘HAMEDAN’, ‘HSMEDAN’
‘KHORAM ABAD’, ‘KhoramAbad’
‘LANGEROD’, ‘LANGEROOD’
‘NOR’, ‘NOUR’
‘NOSHAHR’, ‘NOUSHAHR’
‘ROBAT KARIM’, ‘ROBATKARIM’
‘SHAHR GHODS’, ‘SHAHRE GHODS’, ‘SHAHRGHODS’
‘SHAHROD’, ‘SHAHROOD’
‘TEHRAN’, ‘Tehran’

so we should transform them into a common format. it’s a Data standardization.

city_dictionary = {'ARDEBIL': 'ARDEBI', 'BANDARABBAS': 'BANDAR ABBAS', 'CHALOUS': 'CHALOOS', 'DEZFOUL': 'DEZFOL', 'ESFAHN':  'ESFAHAN', 'ESLAM SHAHR': 'ESLAMSHAHR', 'GHAEMSHAHR': 'GHAEM SHAHR', 'HAMEDAN': 'HSMEDAN', 'KhoramAbad':  'KHORAM ABAD', 'LANGEROOD': 'LANGEROD', 'NOUR': 'NOR', 'NOUSHAHR': 'NOSHAHR', 'ROBATKARIM': 'ROBAT KARIM','SHAHRE GHODS': 'SHAHR GHODS', 'SHAHRGHODS': 'SHAHR GHODS', 'SHAHROOD': 'SHAHROD', 'Tehran': 'TEHRAN'}
service_df = service_df.replace({'City': city_dictionary})

Now let’s check if dataframe have duplicate rows or not.

service_df.duplicated().any()

True

unfortunately, it does. since we’re looking at the dataframe as Analytical base table(ABT), it’s better to drop them.

service_df.drop_duplicates(inplace=True)
service_df.shape

(1971, 19)

let’s look again at dtypes. we can convert them into more sufficent types.

service_df.dtypes

Service_type              int64
Cost_Type                object
Product_Group            object
City                     object
Receipt_Date             object
Appoint_Date             object
Complete_Date            object
TAT01                     int64
TAT02                     int64
Job_Satus                 int64
Defect_Des               object
Symptom_Desc             object
Repair_Action_Desc       object
Labor_Charge_Desc        object
Engineer                 object
Labor_Charge_Amount       int64
Parts_Amount            float64
Discount_Amount         float64
Total_Invoice_Amount    float64
dtype: object

Receipt_Date, Appoint_Date and Complete_Date elements are object, let’s convert them to datetime type. Remember that we have explored the data. so we know data is using this (%m/%d/%Y) time format.

service_df[["Receipt_Date", "Appoint_Date", "Complete_Date"]] = service_df[["Receipt_Date", "Appoint_Date", "Complete_Date"]].apply(lambda _: pd.to_datetime(_, format='%m/%d/%Y'))

Now let’s see all values for each column with object dtype

for column in service_df.select_dtypes(include=[object]).columns:
    print(column, sorted(service_df[column].unique()), sep="\n\t")
    print("-------------------------")

Cost_Type
	['Extended Warranty', 'Normal Warranty', 'Out of Warranty']
-------------------------
Product_Group
	['AIR CONDITIONER ', 'AUDIO - HOME THEAT', 'AUDIO - MP3 Player', 'CAMCORDER', 'DIGITAL CAMERA', 'DVD PLAYER', 'DVD PLAYER/ RECORD', 'HARD DISK DRIVE', 'LCD- 46', 'LCD-40', 'LED40', 'LED46', 'LED55', 'MONITOR - TFT LCD ', 'OVEN - MICROWAVE', 'PLASMA-42', 'PLASMA-43', 'PLASMA-50', 'PLASMA-51', 'PRINTER - COLOR LA', 'PRINTER - MONO LAS', 'REFRIGERATOR-SBS', 'TFT18', 'VACUUM CLEANER', 'WASHING MACHINE']
-------------------------
City
	['ABADAN', 'ABHAR', 'AHVAZ', 'AMOL', 'ARAK', 'ARDEBI', 'BABOL', 'BABOLSAR', 'BANDAR ABBAS', 'BANDAR ANZALI', 'BARDSIR', 'BEHBAHAN', 'BIJAR', 'BIRJAND', 'BOJNORD', 'BOKAN', 'BOMEHEN', 'BOROJERD', 'CHABAHAR', 'CHALOOS', 'CHENARAN', 'DEZFOL', 'DOROD', 'ESFAHAN', 'ESFERAYN', 'ESLAM ABAD', 'ESLAMSHAHR', 'FARDIS', 'FARHANG', 'GARMSAR', 'GHAEM SHAHR', 'GHARCHAK', 'GOLPAYGAN', 'GONBAD', 'GONBAD KAVOS', 'GORGAN', 'HASHTGERD', 'HSMEDAN', 'ILAM', 'IZEH', 'JAHROM', 'JEYROFT', 'KARAJ', 'KASHAN', 'KASHMAR', 'KERMAN', 'KERMANSHAH', 'KHOMAM', 'KHOMEIN', 'KHORAM ABAD', 'KHOY', 'LAHIJAN', 'LANGEROD', 'MAHMOUD ABAD', 'MAHSHAHR', 'MALAYER', 'MANJIL', 'MASHHAD', 'MEHRSHAHR', 'MIYANEH', 'NAHAVAND', 'NIYSHABOUR', 'NOR', 'NOSHAHR', 'OROMIYEH', 'PARS ABAD', 'QAZVIN', 'QOM', 'RAFSANJAN', 'RASHT', 'REZVANSHAHR', 'ROBAT KARIM', 'RODSAR', 'SABZEVAR', 'SALMAS', 'SARAF ZADEH', 'SARBANDAR', 'SARI', 'SAVAD KOOH', 'SAVEH', 'SAVOJBOLAGH', 'SEMNAN', 'SHAHR GHODS', 'SHAHRE KORD', 'SHAHROD', 'SHAHRYAR', 'SHAZAND', 'SHIRAZ', 'SIRJAN', 'TABAS', 'TABRIZ', 'TALESH', 'TEHRAN', 'TONEKABON', 'VARAMIN', 'YAZD', 'ZAHEDAN', 'ZANJAN']
-------------------------
Defect_Des
	['COLD SOLDER', 'DEFORMED', 'DIRTY', 'DRY', 'LEAKING', 'OPEN', 'SHORT', 'Software Bug', 'WORN-OUT', 'WRONG PARTS']
-------------------------
Symptom_Desc
	['BEATING ON PICTURE', 'BUTTON/DIAL NOT WORKING', 'Broken External Flash', 'Cosmetic replacement(customer request)', 'DISPLAY DIM', 'Defective Date Mode', 'Defective Eye Cup', 'Defective LCD Touch Screen', 'Defective Usb Cable', 'EXCESSIVE COLOR', 'Exterior Problem/damage', 'FAULTY COUNTER/TIMER', 'FAULTY OSD', 'FIRMWARE UPAGRADE REQUESTED', 'HORIZONTAL LINE PRINTED', 'MEMORY ERASED/MISSED', 'NO AUDIO', 'NO LOADING', 'NO PICTURE/NO RASTER', 'NO POWER', 'No Power', 'OTHER FAULTY INDICATION', 'OTHER PROBLEMS', 'OUTCASE CRACK', 'PICTURE SLIDING', 'PICTURE TOO BRIGHT', 'PICTURE TOO DARK', 'PIXEL MISSING', 'POOR FOCUS', 'POWER OFF NOT WORKING', 'POWER ON, BUT NO OPERATION', 'Power On, But No Operation', 'READ ERROR', 'RESET NOT WORKING', 'SHORT BATTERY CHARGING', 'SHORT BATTERY LIFE', 'SNOWY PICTURE', 'SOUND CLIPPING', 'Test OK', 'UNIT AUTO-OFF', 'VERTICAL LINE PRINTED']
-------------------------
Repair_Action_Desc
	['ADJUSTMENT', 'CHANGE LCD', 'CHANGE OUTCASE', 'CLEANING', 'Cosmetic replacement(customer request)', "DECK ASS'Y REPAIR/REPLACEMENT", 'DECK CLEANING', 'ELECTRICAL PARTS REPLACEMENT', 'FIRMWARE UPGRADE', 'HEAD CLEANING', 'MECHANICAL PARTS REPLACEMENT', 'MECHANICAL RE-ASSEMBLING', 'No Action', 'OTHER CLEANING', 'OTHER ELECTRICAL ADJUSTMENT', 'OTHER MECHANICAL ADJUSTMENT', 'OTHER REPAIR', 'PARTS REMOVAL', 'PCB REPLACEMENT', 'PICKUP CLEANING', 'REPAIR REFUSAL', 'SCRAP', 'SET EXCHANGE', 'SOFTWARE UPGRADE', 'SOLDERING', 'TELECOM CARRIER PROBLEM']
-------------------------
Labor_Charge_Desc
	['Free Of Charge', 'Major', 'Maximum', 'Medium', 'Minimum', 'Minor']
-------------------------
Engineer
	['Tech01', 'Tech02', 'Tech03', 'Tech04', 'Tech05', 'Tech06', 'Tech07', 'Tech08', 'Tech10', 'Tech11', 'Tech12', 'Tech13', 'Tech14', 'Tech15', 'Tech16', 'Tech17', 'Tech18', 'Tech20', 'Tech21', 'Tech22', 'Tech23', 'Tech24']
-------------------------

it seems all columns with object dtype can be seen as category type.

for column in service_df.select_dtypes(include=[object]).columns:
    service_df[column] = service_df[column].astype('category')

Now let’s see all values for each column with integer dtype

for column in service_df.select_dtypes(include=['int64']).columns:
    print(column, sorted(service_df[column].unique()), sep="\n\t")
    print("-------------------------")

Service_type
	[0, 1]
-------------------------
TAT01
	[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 26, 27, 28, 30, 31, 32, 33, 34, 36, 37, 38, 40, 42, 44, 50, 54, 55, 58, 64, 65, 67, 68, 81]
-------------------------
TAT02
	[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 54, 56, 58, 60, 65, 68, 73, 75, 77, 78, 80, 81, 82, 84, 87, 88, 105, 122, 135, 153, 158, 208, 351]
-------------------------
Job_Satus
	[0, 1]
-------------------------
Labor_Charge_Amount
	[0, 50000, 70000, 85000, 120000, 145000, 150000, 170000, 180000, 200000, 215000, 240000, 250000, 300000, 350000, 360000, 400000, 420000, 540000]
-------------------------

also Service_type and Job_Satus can be seen as categorical data type.

service_df['Service_type'] = service_df['Service_type'].astype('category')
service_df['Job_Satus'] = service_df['Job_Satus'].astype('category')

Scaling and Normalization

Labor_Charge_Amount, Parts_Amount, Discount_Amount and Total_Invoice_Amount values shows how big their are. we can scale them down

service_df['Labor_Charge_Amount'] = service_df['Labor_Charge_Amount'].div(1000)
service_df['Parts_Amount'] = service_df['Parts_Amount'].div(1000)
service_df['Discount_Amount'] = service_df['Discount_Amount'].div(1000)
service_df['Total_Invoice_Amount'] = service_df['Total_Invoice_Amount'].div(1000)

service_df.head()

	Service_type	Cost_Type	Product_Group	City	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount
No
1	0	Normal Warranty	AUDIO - HOME THEAT	TEHRAN	2012-02-05	2012-02-05	2013-01-21	0	351	0	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Free Of Charge	Tech15	0.0	0.00000	0.000000	0.000000
2	1	Normal Warranty	REFRIGERATOR-SBS	TEHRAN	2012-07-07	2012-09-10	2013-01-31	65	208	1	SHORT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0.0	3437.48160	687.486320	2748.885280
3	1	Normal Warranty	REFRIGERATOR-SBS	TEHRAN	2012-07-07	2012-08-15	2013-01-31	38	208	1	OPEN	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0.0	4228.75656	845.751312	3383.005248
4	0	Out of Warranty	MONITOR - TFT LCD	TEHRAN	2012-08-07	2012-08-07	2013-01-13	0	158	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech15	0.0	0.00000	0.000000	0.000000
5	0	Out of Warranty	AUDIO - MP3 Player	TEHRAN	2012-08-07	2012-08-09	2013-01-07	2	153	1	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Medium	Tech07	120.0	765.48860	177.087820	708.381680

Dummy variables

how about Regression model? but all variables should be numerical whereas dataframe have some columns with categorical type.

to address, An indicator variable rendered as a numerical variable can be used to label categorical variables.

dummy_variable_cost_type = pd.get_dummies(service_df["Cost_Type"])
dummy_variable_cost_type

	Extended Warranty	Normal Warranty	Out of Warranty
No
1	0	1	0
2	0	1	0
3	0	1	0
4	0	0	1
5	0	0	1
...	...	...	...
3132	0	1	0
3133	0	1	0
3134	1	0	0
3135	0	0	1
3136	0	0	1

1971 rows × 3 columns

as you can see, there’s only one cell in each row with value one.

Merging Datasets

so we have the service_df Dataframe and dummy_variable_cost_type Dataframe. we want to merge them into one Dataframe. in this case we append columns from dummy_variable_cost_type into service_df.

this link is a comprehensive description to merging

service_df = pd.concat([service_df, dummy_variable_cost_type], axis=1)

Droping original “Cost_Type” column since it’s redundant.

service_df.drop('Cost_Type', axis = 1, inplace=True)
service_df

	Service_type	Product_Group	City	Receipt_Date	Appoint_Date	Complete_Date	TAT01	TAT02	Job_Satus	Defect_Des	Symptom_Desc	Repair_Action_Desc	Labor_Charge_Desc	Engineer	Labor_Charge_Amount	Parts_Amount	Discount_Amount	Total_Invoice_Amount	Extended Warranty	Normal Warranty	Out of Warranty
No
1	0	AUDIO - HOME THEAT	TEHRAN	2012-02-05	2012-02-05	2013-01-21	0	351	0	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Free Of Charge	Tech15	0.0	0.00000	0.000000	0.000000	0	1	0
2	1	REFRIGERATOR-SBS	TEHRAN	2012-07-07	2012-09-10	2013-01-31	65	208	1	SHORT	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0.0	3437.48160	687.486320	2748.885280	0	1	0
3	1	REFRIGERATOR-SBS	TEHRAN	2012-07-07	2012-08-15	2013-01-31	38	208	1	OPEN	OTHER PROBLEMS	OTHER REPAIR	Free Of Charge	Tech03	0.0	4228.75656	845.751312	3383.005248	0	1	0
4	0	MONITOR - TFT LCD	TEHRAN	2012-08-07	2012-08-07	2013-01-13	0	158	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Free Of Charge	Tech15	0.0	0.00000	0.000000	0.000000	0	0	1
5	0	AUDIO - MP3 Player	TEHRAN	2012-08-07	2012-08-09	2013-01-07	2	153	1	WORN-OUT	OTHER PROBLEMS	SOFTWARE UPGRADE	Medium	Tech07	120.0	765.48860	177.087820	708.381680	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3132	0	LCD-40	REZVANSHAHR	2013-01-31	2013-01-31	2013-01-31	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech13	420.0	0.00000	84.000000	336.000000	0	1	0
3133	0	LED46	ESFAHAN	2013-01-31	2013-01-31	2013-01-31	0	0	1	WORN-OUT	OTHER PROBLEMS	ELECTRICAL PARTS REPLACEMENT	Major	Tech13	420.0	0.00000	84.000000	336.000000	0	1	0
3134	0	HARD DISK DRIVE	TEHRAN	2013-01-31	2013-01-31	2013-01-31	0	0	1	WORN-OUT	OTHER PROBLEMS	REPAIR REFUSAL	Minimum	Tech07	50.0	0.00000	10.000000	40.000000	1	0	0
3135	0	HARD DISK DRIVE	TEHRAN	2013-01-31	2013-01-31	2013-01-31	0	0	1	WORN-OUT	OTHER PROBLEMS	REPAIR REFUSAL	Free Of Charge	Tech07	0.0	0.00000	0.000000	0.000000	0	0	1
3136	0	DVD PLAYER/ RECORD	TEHRAN	2013-01-31	2013-01-31	2013-01-31	0	0	1	WORN-OUT	OTHER PROBLEMS	MECHANICAL PARTS REPLACEMENT	Minor	Tech18	150.0	1488.48200	328.686400	1318.785600	0	0	1

1971 rows × 21 columns

now store the prepared dataframe. and let’s say you are just interested in “HARD DISK DRIVE”s. to_csv() and to_pickle will do this.

Harddrive_df = service_df.loc[service_df['Product_Group'] == 'HARD DISK DRIVE']
Harddrive_df.to_csv('Harddrive.csv')

Iran’s Earthquakes through history

Opening Excel Files

as another example. let’s look at this new database

earthquake_dfs = pd.read_excel(join('Datasets', 'earthquake', 'data.xls'), sheet_name=None)
earthquake_dfs.keys()

dict_keys(['Historical', '1900-1963', '1964-2000'])

data is divided between three sheet. let’s look at two of them.

earthquake_df1 = earthquake_dfs['1900-1963']
earthquake_df1.head()

	No.	Date	Unnamed: 2	Unnamed: 3	Time	Location	Unnamed: 6	Unnamed: 7	Magnitude	Unnamed: 9	Ref.
0	NaN	Y	M	D	NaN	Lat.	Long.	Dept	mb	Ms	NaN
1	1.0	1900	6	NaN	NaN	38.5	43.3	NaN	NaN	5	MEA
2	2.0	1900	7	12	NaN	40.28	43.1	NaN	NaN	5.9	ULM
3	3.0	1901	4	3	57.0	40	44.3	15	NaN	3.9	MEA
4	4.0	1902	NaN	NaN	NaN	39	43.3	NaN	NaN	5	MEA

earthquake_df2 = earthquake_dfs['1964-2000']
earthquake_df2.head()

	No.	Date	Unnamed: 2	Unnamed: 3	Time	Location	Unnamed: 6	Unnamed: 7	Magnitude	Unnamed: 9	Unnamed: 10	Ref.
0	NaN	Y	M	D	NaN	Lat.	Long.	Dept	mb	Ms	Mw	NaN
1	1.0	1964	1	12	124550.0	31.52	49.23	56	5.1	NaN	NaN	ISC
2	2.0	1964	1	19	91353.0	26.79	54	38	5.6	NaN	NaN	ISC
3	3.0	1964	1	30	122310.0	32.7	47.8	33	NaN	NaN	NaN	PDE
4	4.0	1964	2	5	102423.0	40.37	46.1	9	4.5	NaN	NaN	ISC

second row is header too. we could have use header=[0, 1] as argument for csv_read to fix this. but now let’s do it manually.

earthquake_df1.columns = ['No', 'Year', 'Month', 'Day', 'Time', 'Lat', 'Long', 'Dept', 'mb', 'Ms', 'ref']
earthquake_df1 = earthquake_df1.iloc[1:].reset_index(drop=True)
earthquake_df2.columns = ['No', 'Year', 'Month', 'Day', 'Time', 'Lat', 'Long', 'Dept', 'mb', 'Ms', 'Mw', 'ref']
earthquake_df2 = earthquake_df2.iloc[1:].reset_index(drop=True)
earthquake_df1.head()

	No	Year	Month	Day	Time	Lat	Long	Dept	mb	Ms	ref
0	1.0	1900	6	NaN	NaN	38.5	43.3	NaN	NaN	5	MEA
1	2.0	1900	7	12	NaN	40.28	43.1	NaN	NaN	5.9	ULM
2	3.0	1901	4	3	57.0	40	44.3	15	NaN	3.9	MEA
3	4.0	1902	NaN	NaN	NaN	39	43.3	NaN	NaN	5	MEA
4	5.0	1902	2	13	93906.0	40.72	48.71	33	6	NaN	B77

Concatenation

we concat these two dataframe into one. notice that how new columns are the union of two dataframes’ columns.

earthquake_df_all = pd.concat([earthquake_df1, earthquake_df2])
earthquake_df_all.head()

	No	Year	Month	Day	Time	Lat	Long	Dept	mb	Ms	ref	Mw
0	1.0	1900	6	NaN	NaN	38.5	43.3	NaN	NaN	5	MEA	NaN
1	2.0	1900	7	12	NaN	40.28	43.1	NaN	NaN	5.9	ULM	NaN
2	3.0	1901	4	3	57.0	40	44.3	15	NaN	3.9	MEA	NaN
3	4.0	1902	NaN	NaN	NaN	39	43.3	NaN	NaN	5	MEA	NaN
4	5.0	1902	2	13	93906.0	40.72	48.71	33	6	NaN	B77	NaN

Imputation

Also let’s replace NaNs in mb column with mean of that column.

earthquake_df_all['mb'].fillna((earthquake_df_all['mb'].mean()), inplace=True)
earthquake_df_all.head()

	No	Year	Month	Day	Time	Lat	Long	Dept	mb	Ms	ref	Mw
0	1.0	1900	6	NaN	NaN	38.5	43.3	NaN	4.459897	5	MEA	NaN
1	2.0	1900	7	12	NaN	40.28	43.1	NaN	4.459897	5.9	ULM	NaN
2	3.0	1901	4	3	57.0	40	44.3	15	4.459897	3.9	MEA	NaN
3	4.0	1902	NaN	NaN	NaN	39	43.3	NaN	4.459897	5	MEA	NaN
4	5.0	1902	2	13	93906.0	40.72	48.71	33	6.000000	NaN	B77	NaN

Further work can be done on this dataframe.

Json files (Bonus)

In this part, we will use request to gather search results.

import requests
import json

query = {'q': 'playstation 5', 'sort': 'popularity', 'size' : 50}
response = requests.get("https://api.torob.com/v4/base-product/search/", params=query)

response_json =  response.json()
response_json.keys()

dict_keys(['results', 'count', 'max_price', 'min_price', 'next', 'previous', 'spellcheck', 'available_filters', 'filter_by_category_title', 'categories', 'parent_categories'])

response_df = pd.json_normalize(response_json['results'])
response_df.head()

	image_url	discount_info	random_key	name1	name2	more_info_url	web_client_absolute_url	price	price_text	price_text_mode	shop_text
0	https://storage.torob.com/backend-api/base/ima...	[]	2e71cfe4-7b77-4e80-837b-de93b5fb95d7	کنسول بازی سونی PlayStation 5 Digital پلی استی...	Playstation 5 Digital Editoin	https://api.torob.com/v4/base-product/details-...	/p/2e71cfe4-7b77-4e80-837b-de93b5fb95d7/%DA%A9...	20120000	از ۲۰٫۱۲۰٫۰۰۰ تومان	active	در ۱۰۱ فروشگاه
1	https://storage.torob.com/backend-api/base/ima...	[]	9825c33c-07e6-4ac7-8bd6-12015ebb3e3c	کنسول بازی سونی PlayStation 5 Standard پلی اس...	Playstation 5 Standard Editoin	https://api.torob.com/v4/base-product/details-...	/p/9825c33c-07e6-4ac7-8bd6-12015ebb3e3c/%DA%A9...	21700000	از ۲۱٫۷۰۰٫۰۰۰ تومان	active	در ۱۳۶ فروشگاه
2	https://pcjanebi.com/wp-content/uploads/2021/0...	[]	9687bb29-671b-42e4-986b-5f56ca3ade59	باندل کنسول بازی سونی PlayStation 5 Standard پ...	Standard PlayStation 5 bundle+DualSense+Pulse+...	https://api.torob.com/v4/base-product/details-...	/p/9687bb29-671b-42e4-986b-5f56ca3ade59/%D8%A8...	29990000	از ۲۹٫۹۹۰٫۰۰۰ تومان	active	در ۵ فروشگاه
3	https://storage.torob.com/backend-api/base/ima...	[]	3d19000e-b1a3-48de-8ef2-70c23c89b534	پلی استیشن 5 دیجیتال به همراه بازی باندل 2 دسته		https://api.torob.com/v4/base-product/details-...	/p/3d19000e-b1a3-48de-8ef2-70c23c89b534/%D9%BE...	23000000	۲۳٫۰۰۰٫۰۰۰ تومان	active	در مرکزخریددات کام
4	https://storage.torob.com/backend-api/base/ima...	[]	88d90bf1-b3c9-4a6f-8abc-0605f0e8a008	کنسول بازی سونی پلی استیشن 5 ظرفیت 825 گیگابایت		https://api.torob.com/v4/base-product/details-...	/p/88d90bf1-b3c9-4a6f-8abc-0605f0e8a008/%DA%A9...	21299000	۲۱٫۲۹۹٫۰۰۰ تومان	active	در ژیوان

Data Science