Data Wrangling

The process of turning raw data into more appropriate format. in the end, the result should have acceptable Data Quality.

Contents included in this notebook:

  • Data Exploration
  • Data Cleaning
  • Merging Datasets

Also this notebook will introduce you to Pandas which faciliate the manipulation task.

Datasets:

from os.path import join
import numpy as np
import pandas as pd
%pdb 1
Automatic pdb calling has been turned ON

Samsung after-sales service in Iran

Data Exploration

The purpose of this part is to gain a better understanding of the data. what’s data is about? what’s the structure? what’s the values? …

Let’s read Iran’s Earthquakes dataset into a DataFrame.

service_df = pd.read_csv(join('Datasets', 'service', 'samdata.csv'))
service_df
No Service_type Cost_Type Product_Group Serial_No City Product_Date Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
0 1 0 Normal Warranty AUDIO - HOME THEAT 2F/LAHFZ7E-IP-X Tehran NaN 2/5/2012 2/5/2012 1/21/2013 0 351 0 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Free Of Charge Tech15 0 0.00 0.000 0.000
1 2 1 Normal Warranty REFRIGERATOR-SBS AIP244AACA/4E7 Tehran NaN 7/7/2012 9/10/2012 1/31/2013 65 208 1 SHORT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0 3437481.60 687486.320 2748885.280
2 3 1 Normal Warranty REFRIGERATOR-SBS JK-C4BBCAEKL72 Tehran NaN 7/7/2012 8/15/2012 1/31/2013 38 208 1 OPEN OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0 4228756.56 845751.312 3383005.248
3 4 0 Out of Warranty MONITOR - TFT LCD IP2JO4/-A-47 Tehran NaN 8/7/2012 8/7/2012 1/13/2013 0 158 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech15 0 0.00 0.000 0.000
4 5 0 Out of Warranty AUDIO - MP3 Player AU/BCQE2A/P Tehran NaN 8/7/2012 8/9/2012 1/7/2013 2 153 1 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Medium Tech07 120000 765488.60 177087.820 708381.680
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3131 3132 0 Normal Warranty LCD-40 IP2JRIP/QE2KLKL REZVANSHAHR NaN 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech13 420000 0.00 84000.000 336000.000
3132 3133 0 Normal Warranty LED46 IP2JRC/&/2/IP ESFAHAN NaN 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech13 420000 0.00 84000.000 336000.000
3133 3134 0 Extended Warranty HARD DISK DRIVE S2/BJ2/B2KL4Z2 Tehran 40552.0 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS REPAIR REFUSAL Minimum Tech07 50000 0.00 10000.000 40000.000
3134 3135 0 Out of Warranty HARD DISK DRIVE S2/BJ2/SC2/QE2 Tehran 38825.0 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS REPAIR REFUSAL Free Of Charge Tech07 0 0.00 0.000 0.000
3135 3136 0 Out of Warranty DVD PLAYER/ RECORD A&QEKLCDQAEA7AZ Tehran 38448.0 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS MECHANICAL PARTS REPLACEMENT Minor Tech18 150000 1488482.00 328686.400 1318785.600

3136 rows × 22 columns

shape attribute indicate how large the resulting DataFrame is.

service_df.shape
(3136, 22)
print("pandas only displays " + str(pd.options.display.max_rows) + " rows" ) 
print("and " + str(pd.options.display.max_columns) + " columns at max by default" ) 
pandas only displays 60 rows
and None columns at max by default

The size of the dataset is not that big and beside that, it only have 2 more columns than pandas maximum displayed columns. so let’s tell pandas to show all columns for convenience.

 pd.set_option('display.max_columns', None)

use loc() and iloc() for selecting rows in specific range.

service_df.iloc[26:33:3]
No Service_type Cost_Type Product_Group Serial_No City Product_Date Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
26 27 0 Normal Warranty DIGITAL CAMERA AAKLMCN/CIPE/GT Tehran NaN 11/22/2012 11/22/2012 1/26/2013 0 65 1 DEFORMED OUTCASE CRACK ELECTRICAL PARTS REPLACEMENT Major Tech11 250000 0.0 50000.0 200000.0
29 30 0 Normal Warranty TFT18 H2MQ2/2/472 Tehran NaN 12/1/2012 12/1/2012 1/6/2013 0 36 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech16 0 0.0 0.0 0.0
32 33 0 Normal Warranty TFT18 H2MQ2/AE27 Tehran NaN 12/2/2012 12/2/2012 1/6/2013 0 35 1 WORN-OUT NO POWER ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech16 0 0.0 0.0 0.0

sample() returns random rows to look at.

service_df.sample(13)
No Service_type Cost_Type Product_Group Serial_No City Product_Date Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
130 131 0 Normal Warranty LED40 IP2JQC/KLIP2IPIP Tehran NaN 12/18/2012 12/18/2012 1/30/2013 0 43 1 WORN-OUT NO PICTURE/NO RASTER SET EXCHANGE Free Of Charge Tech23 0 0.00 0.000 0.000000e+00
1807 1808 0 Out of Warranty MONITOR - TFT LCD H2MPA/2KL74/ Tehran 38083.0 1/13/2013 1/14/2013 1/23/2013 1 10 0 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech16 0 0.00 0.000 0.000000e+00
2971 2972 1 Normal Warranty LED40 IP2JQA/&427A Tehran NaN 1/27/2013 1/27/2013 1/28/2013 0 1 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech15 420000 14086824.28 2803384.856 1.161354e+07
3043 3044 0 Out of Warranty DIGITAL CAMERA AAKLMCN/CAE/LAJ Tehran 40808.0 1/28/2013 1/28/2013 1/30/2013 0 2 1 DEFORMED OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech11 250000 584648.70 166828.740 6.677189e+05
2331 2332 0 Normal Warranty LCD-40 IP2JRA/72A/KL Tehran NaN 1/20/2013 1/20/2013 1/21/2013 0 1 1 WORN-OUT OTHER PROBLEMS SET EXCHANGE Free Of Charge Tech14 0 0.00 0.000 0.000000e+00
101 102 0 Out of Warranty MONITOR - TFT LCD H2MM2/A/4IPKL Tehran 38081.0 12/16/2012 12/17/2012 1/5/2013 1 20 0 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech16 0 0.00 0.000 0.000000e+00
3096 3097 0 Normal Warranty LED40 IP2JQA/&KL/4/ Tehran NaN 1/28/2013 1/28/2013 1/31/2013 0 3 1 WORN-OUT NO PICTURE/NO RASTER SET EXCHANGE Free Of Charge Tech23 0 0.00 0.000 0.000000e+00
2181 2182 0 Normal Warranty DVD PLAYER ZT2QARDBC/2&IP2Z Tehran NaN 1/19/2013 1/19/2013 1/20/2013 0 1 1 WORN-OUT READ ERROR ELECTRICAL PARTS REPLACEMENT Major Tech18 170000 1840063.68 402012.736 1.608051e+06
2141 2142 0 Normal Warranty LED40 IP2JQA/&QE4/ Tehran NaN 1/19/2013 1/19/2013 1/21/2013 0 2 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Minor Tech14 215000 0.00 43000.000 1.720000e+05
935 936 0 Normal Warranty LCD-40 IP2JRA/7/7&IP Tehran NaN 1/5/2013 1/5/2013 1/6/2013 0 1 1 WORN-OUT OTHER PROBLEMS SET EXCHANGE Free Of Charge Tech08 0 0.00 0.000 0.000000e+00
1698 1699 0 Normal Warranty LCD-40 IP2JQA/&2772 Tehran NaN 1/10/2013 1/19/2013 1/20/2013 8 10 1 WORN-OUT OTHER PROBLEMS SET EXCHANGE Free Of Charge Tech13 0 0.00 0.000 0.000000e+00
343 344 1 Normal Warranty LED40 IP2JR2/-2/IP4 Tehran NaN 12/27/2012 12/31/2012 1/1/2013 4 5 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech10 420000 24887810.32 5063582.064 2.025433e+07
1875 1876 1 Normal Warranty LCD- 46 IP2JQ7/42/42 Tehran NaN 1/14/2013 1/17/2013 1/20/2013 3 6 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech15 0 0.00 0.000 0.000000e+00

Dataset itself have a column for index values, let’s be sure that it’s really unique for each row, then we can condiser No column as index.

service_df['No'].is_unique
True
service_df.set_index("No", inplace=True)

head(n) returns first n rows. tail(n) returns last n rows. Default value is considered to be 5.

service_df.head()
Service_type Cost_Type Product_Group Serial_No City Product_Date Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
No
1 0 Normal Warranty AUDIO - HOME THEAT 2F/LAHFZ7E-IP-X Tehran NaN 2/5/2012 2/5/2012 1/21/2013 0 351 0 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Free Of Charge Tech15 0 0.00 0.000 0.000
2 1 Normal Warranty REFRIGERATOR-SBS AIP244AACA/4E7 Tehran NaN 7/7/2012 9/10/2012 1/31/2013 65 208 1 SHORT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0 3437481.60 687486.320 2748885.280
3 1 Normal Warranty REFRIGERATOR-SBS JK-C4BBCAEKL72 Tehran NaN 7/7/2012 8/15/2012 1/31/2013 38 208 1 OPEN OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0 4228756.56 845751.312 3383005.248
4 0 Out of Warranty MONITOR - TFT LCD IP2JO4/-A-47 Tehran NaN 8/7/2012 8/7/2012 1/13/2013 0 158 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech15 0 0.00 0.000 0.000
5 0 Out of Warranty AUDIO - MP3 Player AU/BCQE2A/P Tehran NaN 8/7/2012 8/9/2012 1/7/2013 2 153 1 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Medium Tech07 120000 765488.60 177087.820 708381.680

Columns attribute returns all column names.

service_df.columns
Index(['Service_type', 'Cost_Type', 'Product_Group', 'Serial_No', 'City',
       'Product_Date', 'Receipt_Date', 'Appoint_Date', 'Complete_Date',
       'TAT01', 'TAT02', 'Job_Satus', 'Defect_Des', 'Symptom_Desc',
       'Repair_Action_Desc', 'Labor_Charge_Desc', 'Engineer',
       'Labor_Charge_Amount', 'Parts_Amount', 'Discount_Amount',
       'Total_Invoice_Amount'],
      dtype='object')

dtypes shows each column data type.

service_df.dtypes
Service_type              int64
Cost_Type                object
Product_Group            object
Serial_No                object
City                     object
Product_Date            float64
Receipt_Date             object
Appoint_Date             object
Complete_Date            object
TAT01                     int64
TAT02                     int64
Job_Satus                 int64
Defect_Des               object
Symptom_Desc             object
Repair_Action_Desc       object
Labor_Charge_Desc        object
Engineer                 object
Labor_Charge_Amount       int64
Parts_Amount            float64
Discount_Amount         float64
Total_Invoice_Amount    float64
dtype: object

info() function summarise the dadaset.

service_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3136 entries, 1 to 3136
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Service_type          3136 non-null   int64  
 1   Cost_Type             3136 non-null   object 
 2   Product_Group         3126 non-null   object 
 3   Serial_No             3127 non-null   object 
 4   City                  3136 non-null   object 
 5   Product_Date          724 non-null    float64
 6   Receipt_Date          3136 non-null   object 
 7   Appoint_Date          3136 non-null   object 
 8   Complete_Date         3136 non-null   object 
 9   TAT01                 3136 non-null   int64  
 10  TAT02                 3136 non-null   int64  
 11  Job_Satus             3136 non-null   int64  
 12  Defect_Des            3136 non-null   object 
 13  Symptom_Desc          3136 non-null   object 
 14  Repair_Action_Desc    3136 non-null   object 
 15  Labor_Charge_Desc     3136 non-null   object 
 16  Engineer              3136 non-null   object 
 17  Labor_Charge_Amount   3136 non-null   int64  
 18  Parts_Amount          3136 non-null   float64
 19  Discount_Amount       3136 non-null   float64
 20  Total_Invoice_Amount  3136 non-null   float64
dtypes: float64(4), int64(5), object(12)
memory usage: 539.0+ KB

For numerical values describe() function shows statical features of each column.

service_df.describe()
Service_type Product_Date TAT01 TAT02 Job_Satus Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
count 3136.000000 724.000000 3136.000000 3136.000000 3136.000000 3136.000000 3.136000e+03 3.136000e+03 3.136000e+03
mean 0.105230 39227.758287 1.289222 6.647959 0.977360 129877.232143 2.303481e+06 4.844214e+05 1.924006e+06
std 0.306898 1152.746282 5.060422 14.012700 0.148778 176843.065376 6.682868e+06 1.349372e+06 5.400547e+06
min 0.000000 33870.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
25% 0.000000 38451.000000 0.000000 1.000000 1.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
50% 0.000000 38823.000000 0.000000 3.000000 1.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
75% 0.000000 40188.000000 1.000000 6.000000 1.000000 250000.000000 7.510464e+04 8.738318e+04 3.827746e+05
max 1.000000 40813.000000 81.000000 351.000000 1.000000 540000.000000 4.480716e+07 8.088432e+06 3.635773e+07

Columns can be accessed as a attribute to the dataframe. Each column in fact is a Series. we can count each repetative value by using value_count() function.

service_df.Product_Group.value_counts()
LED40                 1008
LCD-40                 529
DIGITAL CAMERA         282
LED46                  250
PRINTER - MONO LAS     151
DVD PLAYER             119
AUDIO - MP3 Player     109
HARD DISK DRIVE        102
AUDIO - HOME THEAT     102
REFRIGERATOR-SBS        68
DVD PLAYER/ RECORD      65
PLASMA-42               62
CAMCORDER               57
PLASMA-43               49
PRINTER - COLOR LA      35
MONITOR - TFT LCD       24
VACUUM CLEANER          24
TFT18                   18
WASHING MACHINE         16
PLASMA-51               15
LED55                   15
OVEN - MICROWAVE        11
LCD- 46                  8
PLASMA-50                4
AIR CONDITIONER          3
Name: Product_Group, dtype: int64

Also we can group rows by columns. and get each group size.

service_df.groupby(by =["Product_Group", "Cost_Type"]).size()
Product_Group       Cost_Type        
AIR CONDITIONER     Normal Warranty         2
                    Out of Warranty         1
AUDIO - HOME THEAT  Normal Warranty         3
                    Out of Warranty        99
AUDIO - MP3 Player  Normal Warranty        28
                    Out of Warranty        81
CAMCORDER           Normal Warranty         5
                    Out of Warranty        52
DIGITAL CAMERA      Normal Warranty       149
                    Out of Warranty       133
DVD PLAYER          Normal Warranty        53
                    Out of Warranty        66
DVD PLAYER/ RECORD  Extended Warranty      11
                    Normal Warranty        18
                    Out of Warranty        36
HARD DISK DRIVE     Extended Warranty      63
                    Normal Warranty        14
                    Out of Warranty        25
LCD- 46             Normal Warranty         5
                    Out of Warranty         3
LCD-40              Normal Warranty       522
                    Out of Warranty         7
LED40               Normal Warranty      1003
                    Out of Warranty         5
LED46               Normal Warranty       247
                    Out of Warranty         3
LED55               Normal Warranty        14
                    Out of Warranty         1
MONITOR - TFT LCD   Normal Warranty         5
                    Out of Warranty        19
OVEN - MICROWAVE    Normal Warranty         3
                    Out of Warranty         8
PLASMA-42           Normal Warranty        55
                    Out of Warranty         7
PLASMA-43           Normal Warranty        45
                    Out of Warranty         4
PLASMA-50           Normal Warranty         3
                    Out of Warranty         1
PLASMA-51           Normal Warranty        12
                    Out of Warranty         3
PRINTER - COLOR LA  Normal Warranty        14
                    Out of Warranty        21
PRINTER - MONO LAS  Normal Warranty        57
                    Out of Warranty        94
REFRIGERATOR-SBS    Normal Warranty        32
                    Out of Warranty        36
TFT18               Extended Warranty       2
                    Normal Warranty        15
                    Out of Warranty         1
VACUUM CLEANER      Normal Warranty         1
                    Out of Warranty        23
WASHING MACHINE     Out of Warranty        16
dtype: int64

In fact the result of above command is a Series which have multiple indexes. by using unstack(), we reshape the series into a Dataframe

service_df.groupby(by =["Product_Group", "Cost_Type"]).size().unstack()
Cost_Type Extended Warranty Normal Warranty Out of Warranty
Product_Group
AIR CONDITIONER NaN 2.0 1.0
AUDIO - HOME THEAT NaN 3.0 99.0
AUDIO - MP3 Player NaN 28.0 81.0
CAMCORDER NaN 5.0 52.0
DIGITAL CAMERA NaN 149.0 133.0
DVD PLAYER NaN 53.0 66.0
DVD PLAYER/ RECORD 11.0 18.0 36.0
HARD DISK DRIVE 63.0 14.0 25.0
LCD- 46 NaN 5.0 3.0
LCD-40 NaN 522.0 7.0
LED40 NaN 1003.0 5.0
LED46 NaN 247.0 3.0
LED55 NaN 14.0 1.0
MONITOR - TFT LCD NaN 5.0 19.0
OVEN - MICROWAVE NaN 3.0 8.0
PLASMA-42 NaN 55.0 7.0
PLASMA-43 NaN 45.0 4.0
PLASMA-50 NaN 3.0 1.0
PLASMA-51 NaN 12.0 3.0
PRINTER - COLOR LA NaN 14.0 21.0
PRINTER - MONO LAS NaN 57.0 94.0
REFRIGERATOR-SBS NaN 32.0 36.0
TFT18 2.0 15.0 1.0
VACUUM CLEANER NaN 1.0 23.0
WASHING MACHINE NaN NaN 16.0

let’s count NaN values in each columns.

service_df.isna().sum()
Service_type               0
Cost_Type                  0
Product_Group             10
Serial_No                  9
City                       0
Product_Date            2412
Receipt_Date               0
Appoint_Date               0
Complete_Date              0
TAT01                      0
TAT02                      0
Job_Satus                  0
Defect_Des                 0
Symptom_Desc               0
Repair_Action_Desc         0
Labor_Charge_Desc          0
Engineer                   0
Labor_Charge_Amount        0
Parts_Amount               0
Discount_Amount            0
Total_Invoice_Amount       0
dtype: int64

let’s see rows with NaN values

service_df[service_df[['Product_Group','Serial_No']].isna().any(1)]
Service_type Cost_Type Product_Group Serial_No City Product_Date Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
No
412 0 Out of Warranty NaN 2P-/BAXS-E/KLKLK Tehran 38818.0 12/29/2012 12/31/2012 1/8/2013 2 10 1 WORN-OUT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech16 0 0.00 0.000 0.000
542 1 Normal Warranty PRINTER - MONO LAS NaN Tehran NaN 12/31/2012 12/31/2012 1/1/2013 0 1 1 WORN-OUT OTHER PROBLEMS OTHER REPAIR Maximum Tech20 350000 0.00 70000.000 280000.000
626 0 Out of Warranty NaN ZADCKLVKSAE2KL2K MASHHAD 38814.0 1/1/2013 1/8/2013 1/13/2013 7 12 1 DEFORMED Test OK No Action Free Of Charge Tech08 0 0.00 0.000 0.000
704 0 Out of Warranty NaN A/&2GV2B&E/KL2K Tehran 40548.0 1/1/2013 1/2/2013 1/5/2013 1 4 1 Software Bug ZOOM PROBLEM FIRMWARE UPGRADE Major Tech08 250000 0.00 50000.000 200000.000
876 0 Out of Warranty VACUUM CLEANER NaN Tehran NaN 1/2/2013 1/6/2013 1/6/2013 4 4 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech17 0 0.00 0.000 0.000
934 0 Normal Warranty NaN ZTAPKLCABA/2IP24L Tehran NaN 1/5/2013 1/5/2013 1/5/2013 0 0 1 WORN-OUT READ ERROR MECHANICAL PARTS REPLACEMENT Major Tech12 0 1158705.72 231741.144 826864.576
1059 0 Normal Warranty NaN ZTAPKLCIBAE&&/D Tehran NaN 1/5/2013 1/6/2013 1/7/2013 1 2 1 WORN-OUT READ ERROR MECHANICAL PARTS REPLACEMENT Medium Tech12 0 1158705.72 231741.144 826864.576
1207 0 Normal Warranty NaN ZTAPKLCIBB/&IP47A Tehran NaN 1/6/2013 1/6/2013 1/10/2013 0 4 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech14 170000 1158705.72 265741.144 1062864.576
1213 0 Out of Warranty NaN 2P7IPBAKPQEKL27/N Tehran 38080.0 1/6/2013 1/6/2013 1/7/2013 0 1 1 WORN-OUT Test OK No Action Free Of Charge Tech20 0 0.00 0.000 0.000
1235 0 Out of Warranty VACUUM CLEANER NaN Tehran NaN 1/6/2013 1/8/2013 1/9/2013 2 3 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Minor Tech17 145000 15526.44 32105.288 128421.152
1550 0 Normal Warranty NaN ZTAPKLCIB2/A/7KLZ Tehran NaN 1/9/2013 1/9/2013 1/9/2013 0 0 1 WORN-OUT READ ERROR MECHANICAL PARTS REPLACEMENT Medium Tech12 0 1158705.72 231741.144 826864.576
1571 0 Normal Warranty NaN ZTAPKLCIBA/A4QEP ARDEBIL NaN 1/9/2013 1/9/2013 1/27/2013 0 18 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech18 0 0.00 0.000 0.000
1617 0 Out of Warranty NaN A/&KLKLVBB&E/2AN Tehran 40548.0 1/9/2013 1/10/2013 1/13/2013 1 4 1 DEFORMED NO POWER ELECTRICAL PARTS REPLACEMENT Major Tech08 250000 0.00 50000.000 200000.000
1814 0 Out of Warranty DVD PLAYER NaN Tehran NaN 1/14/2013 1/14/2013 1/26/2013 0 12 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Minor Tech14 120000 868118.56 187823.812 781285.648
2032 0 Out of Warranty DIGITAL CAMERA NaN Tehran NaN 1/15/2013 1/15/2013 1/21/2013 0 6 1 DEFORMED OTHER PROBLEMS MECHANICAL PARTS REPLACEMENT Free Of Charge Tech11 0 0.00 0.000 0.000
2036 0 Out of Warranty VACUUM CLEANER NaN Tehran NaN 1/16/2013 1/20/2013 1/20/2013 4 4 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech17 0 18858.40 3871.880 15887.520
2269 0 Out of Warranty AUDIO - HOME THEAT NaN Tehran NaN 1/20/2013 1/20/2013 1/20/2013 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech18 0 1450087.28 280018.456 1160077.824
2772 0 Out of Warranty DVD PLAYER NaN Tehran NaN 1/24/2013 1/24/2013 1/26/2013 0 2 1 WORN-OUT OTHER PROBLEMS MECHANICAL PARTS REPLACEMENT Minimum Tech18 70000 868118.56 187823.812 751285.648
2892 0 Out of Warranty DIGITAL CAMERA NaN ABHAR NaN 1/27/2013 1/27/2013 1/28/2013 0 1 1 DEFORMED OTHER PROBLEMS SCRAP Free Of Charge Tech11 0 0.00 0.000 0.000

Data Cleaning

In this part, based on what we want to do with the data, we restructure, reformat and validate the data.

for start, let’s remove columns that’s not useful for our hypothetical purpose.

  • Does Serial_No really help the model to decide how much time it takes to be repaired?
  • or is Product_Date values meaningful?

if the answers to these questions are No, we can drop those columns.

service_df = service_df.drop(columns=['Product_Date', 'Serial_No'])
service_df
Service_type Cost_Type Product_Group City Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
No
1 0 Normal Warranty AUDIO - HOME THEAT Tehran 2/5/2012 2/5/2012 1/21/2013 0 351 0 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Free Of Charge Tech15 0 0.00 0.000 0.000
2 1 Normal Warranty REFRIGERATOR-SBS Tehran 7/7/2012 9/10/2012 1/31/2013 65 208 1 SHORT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0 3437481.60 687486.320 2748885.280
3 1 Normal Warranty REFRIGERATOR-SBS Tehran 7/7/2012 8/15/2012 1/31/2013 38 208 1 OPEN OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0 4228756.56 845751.312 3383005.248
4 0 Out of Warranty MONITOR - TFT LCD Tehran 8/7/2012 8/7/2012 1/13/2013 0 158 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech15 0 0.00 0.000 0.000
5 0 Out of Warranty AUDIO - MP3 Player Tehran 8/7/2012 8/9/2012 1/7/2013 2 153 1 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Medium Tech07 120000 765488.60 177087.820 708381.680
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3132 0 Normal Warranty LCD-40 REZVANSHAHR 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech13 420000 0.00 84000.000 336000.000
3133 0 Normal Warranty LED46 ESFAHAN 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech13 420000 0.00 84000.000 336000.000
3134 0 Extended Warranty HARD DISK DRIVE Tehran 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS REPAIR REFUSAL Minimum Tech07 50000 0.00 10000.000 40000.000
3135 0 Out of Warranty HARD DISK DRIVE Tehran 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS REPAIR REFUSAL Free Of Charge Tech07 0 0.00 0.000 0.000
3136 0 Out of Warranty DVD PLAYER/ RECORD Tehran 1/31/2013 1/31/2013 1/31/2013 0 0 1 WORN-OUT OTHER PROBLEMS MECHANICAL PARTS REPLACEMENT Minor Tech18 150000 1488482.00 328686.400 1318785.600

3136 rows × 19 columns

Dataset at hand have some NaN values. what does it mean? is it that the value exist but not provided or there isn’t anything to fill that field?

possible solutions

  • drop data
    • drop the whole row
    • drop the whole column
  • Imputation
    • replace it by mean
    • replace it by frequency
    • adding unkown category

it’s important to have a logical reason. it’s not reasonable to fill NaN values with mean when NaN occupies half of that column.

Where’s the NaN values ?

service_df.isna().sum()
Service_type             0
Cost_Type                0
Product_Group           10
City                     0
Receipt_Date             0
Appoint_Date             0
Complete_Date            0
TAT01                    0
TAT02                    0
Job_Satus                0
Defect_Des               0
Symptom_Desc             0
Repair_Action_Desc       0
Labor_Charge_Desc        0
Engineer                 0
Labor_Charge_Amount      0
Parts_Amount             0
Discount_Amount          0
Total_Invoice_Amount     0
dtype: int64

Let’s see rows with NaN values.

service_df[pd.isna(service_df.Product_Group)]
Service_type Cost_Type Product_Group City Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
No
412 0 Out of Warranty NaN Tehran 12/29/2012 12/31/2012 1/8/2013 2 10 1 WORN-OUT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech16 0 0.00 0.000 0.000
626 0 Out of Warranty NaN MASHHAD 1/1/2013 1/8/2013 1/13/2013 7 12 1 DEFORMED Test OK No Action Free Of Charge Tech08 0 0.00 0.000 0.000
704 0 Out of Warranty NaN Tehran 1/1/2013 1/2/2013 1/5/2013 1 4 1 Software Bug ZOOM PROBLEM FIRMWARE UPGRADE Major Tech08 250000 0.00 50000.000 200000.000
934 0 Normal Warranty NaN Tehran 1/5/2013 1/5/2013 1/5/2013 0 0 1 WORN-OUT READ ERROR MECHANICAL PARTS REPLACEMENT Major Tech12 0 1158705.72 231741.144 826864.576
1059 0 Normal Warranty NaN Tehran 1/5/2013 1/6/2013 1/7/2013 1 2 1 WORN-OUT READ ERROR MECHANICAL PARTS REPLACEMENT Medium Tech12 0 1158705.72 231741.144 826864.576
1207 0 Normal Warranty NaN Tehran 1/6/2013 1/6/2013 1/10/2013 0 4 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech14 170000 1158705.72 265741.144 1062864.576
1213 0 Out of Warranty NaN Tehran 1/6/2013 1/6/2013 1/7/2013 0 1 1 WORN-OUT Test OK No Action Free Of Charge Tech20 0 0.00 0.000 0.000
1550 0 Normal Warranty NaN Tehran 1/9/2013 1/9/2013 1/9/2013 0 0 1 WORN-OUT READ ERROR MECHANICAL PARTS REPLACEMENT Medium Tech12 0 1158705.72 231741.144 826864.576
1571 0 Normal Warranty NaN ARDEBIL 1/9/2013 1/9/2013 1/27/2013 0 18 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech18 0 0.00 0.000 0.000
1617 0 Out of Warranty NaN Tehran 1/9/2013 1/10/2013 1/13/2013 1 4 1 DEFORMED NO POWER ELECTRICAL PARTS REPLACEMENT Major Tech08 250000 0.00 50000.000 200000.000

based on the task you about to do with data after the data is prepared, decide how much droping rows is logical. if 10 rows is not a lot, we can drop them.

service_df.dropna(inplace=True)

what cities are in the dataset?

print(sorted(service_df.City.unique()))
['ABADAN', 'ABHAR', 'AHVAZ', 'AMOL', 'ARAK', 'ARDEBI', 'ARDEBIL', 'BABOL', 'BABOLSAR', 'BANDAR ABBAS', 'BANDAR ANZALI', 'BANDARABBAS', 'BARDSIR', 'BEHBAHAN', 'BIJAR', 'BIRJAND', 'BOJNORD', 'BOKAN', 'BOMEHEN', 'BOROJERD', 'CHABAHAR', 'CHALOOS', 'CHALOUS', 'CHENARAN', 'DEZFOL', 'DEZFOUL', 'DOROD', 'ESFAHAN', 'ESFAHN', 'ESFERAYN', 'ESLAM ABAD', 'ESLAM SHAHR', 'ESLAMSHAHR', 'FARDIS', 'FARHANG', 'GARMSAR', 'GHAEM SHAHR', 'GHAEMSHAHR', 'GHARCHAK', 'GOLPAYGAN', 'GONBAD', 'GONBAD KAVOS', 'GORGAN', 'HAMEDAN', 'HASHTGERD', 'HSMEDAN', 'ILAM', 'IZEH', 'JAHROM', 'JEYROFT', 'KARAJ', 'KASHAN', 'KASHMAR', 'KERMAN', 'KERMANSHAH', 'KHOMAM', 'KHOMEIN', 'KHORAM ABAD', 'KHOY', 'KhoramAbad', 'LAHIJAN', 'LANGEROD', 'LANGEROOD', 'MAHMOUD ABAD', 'MAHSHAHR', 'MALAYER', 'MANJIL', 'MASHHAD', 'MEHRSHAHR', 'MIYANEH', 'NAHAVAND', 'NIYSHABOUR', 'NOR', 'NOSHAHR', 'NOUR', 'NOUSHAHR', 'OROMIYEH', 'PARS ABAD', 'QAZVIN', 'QOM', 'RAFSANJAN', 'RASHT', 'REZVANSHAHR', 'ROBAT KARIM', 'ROBATKARIM', 'RODSAR', 'SABZEVAR', 'SALMAS', 'SARAF ZADEH', 'SARBANDAR', 'SARI', 'SAVAD KOOH', 'SAVEH', 'SAVOJBOLAGH', 'SEMNAN', 'SHAHR GHODS', 'SHAHRE GHODS', 'SHAHRE KORD', 'SHAHRGHODS', 'SHAHROD', 'SHAHROOD', 'SHAHRYAR', 'SHAZAND', 'SHIRAZ', 'SIRJAN', 'TABAS', 'TABRIZ', 'TALESH', 'TEHRAN', 'TONEKABON', 'Tehran', 'VARAMIN', 'YAZD', 'ZAHEDAN', 'ZANJAN']

Look at the values above, what you see? we have multiple ways of writing same city name.

  • ‘ARDEBI’, ‘ARDEBIL’
  • ‘BANDAR ABBAS’, ‘BANDARABBAS’
  • ‘CHALOOS’, ‘CHALOUS’
  • ‘DEZFOL’, ‘DEZFOUL’
  • ‘ESFAHAN’, ‘ESFAHN’
  • ‘ESLAMSHAHR’, ‘ESLAM SHAHR’
  • ‘GHAEM SHAHR’, ‘GHAEMSHAHR’
  • ‘HAMEDAN’, ‘HSMEDAN’
  • ‘KHORAM ABAD’, ‘KhoramAbad’
  • ‘LANGEROD’, ‘LANGEROOD’
  • ‘NOR’, ‘NOUR’
  • ‘NOSHAHR’, ‘NOUSHAHR’
  • ‘ROBAT KARIM’, ‘ROBATKARIM’
  • ‘SHAHR GHODS’, ‘SHAHRE GHODS’, ‘SHAHRGHODS’
  • ‘SHAHROD’, ‘SHAHROOD’
  • ‘TEHRAN’, ‘Tehran’

so we should transform them into a common format. it’s a Data standardization.

city_dictionary = {'ARDEBIL': 'ARDEBI', 'BANDARABBAS': 'BANDAR ABBAS', 'CHALOUS': 'CHALOOS', 'DEZFOUL': 'DEZFOL', 'ESFAHN':  'ESFAHAN', 'ESLAM SHAHR': 'ESLAMSHAHR', 'GHAEMSHAHR': 'GHAEM SHAHR', 'HAMEDAN': 'HSMEDAN', 'KhoramAbad':  'KHORAM ABAD', 'LANGEROOD': 'LANGEROD', 'NOUR': 'NOR', 'NOUSHAHR': 'NOSHAHR', 'ROBATKARIM': 'ROBAT KARIM','SHAHRE GHODS': 'SHAHR GHODS', 'SHAHRGHODS': 'SHAHR GHODS', 'SHAHROOD': 'SHAHROD', 'Tehran': 'TEHRAN'}
service_df = service_df.replace({'City': city_dictionary})

Now let’s check if dataframe have duplicate rows or not.

service_df.duplicated().any()
True

unfortunately, it does. since we’re looking at the dataframe as Analytical base table(ABT), it’s better to drop them.

service_df.drop_duplicates(inplace=True)
service_df.shape
(1971, 19)

let’s look again at dtypes. we can convert them into more sufficent types.

service_df.dtypes
Service_type              int64
Cost_Type                object
Product_Group            object
City                     object
Receipt_Date             object
Appoint_Date             object
Complete_Date            object
TAT01                     int64
TAT02                     int64
Job_Satus                 int64
Defect_Des               object
Symptom_Desc             object
Repair_Action_Desc       object
Labor_Charge_Desc        object
Engineer                 object
Labor_Charge_Amount       int64
Parts_Amount            float64
Discount_Amount         float64
Total_Invoice_Amount    float64
dtype: object

Receipt_Date, Appoint_Date and Complete_Date elements are object, let’s convert them to datetime type. Remember that we have explored the data. so we know data is using this (%m/%d/%Y) time format.

service_df[["Receipt_Date", "Appoint_Date", "Complete_Date"]] = service_df[["Receipt_Date", "Appoint_Date", "Complete_Date"]].apply(lambda _: pd.to_datetime(_, format='%m/%d/%Y'))

Now let’s see all values for each column with object dtype

for column in service_df.select_dtypes(include=[object]).columns:
    print(column, sorted(service_df[column].unique()), sep="\n\t")
    print("-------------------------")
Cost_Type
	['Extended Warranty', 'Normal Warranty', 'Out of Warranty']
-------------------------
Product_Group
	['AIR CONDITIONER ', 'AUDIO - HOME THEAT', 'AUDIO - MP3 Player', 'CAMCORDER', 'DIGITAL CAMERA', 'DVD PLAYER', 'DVD PLAYER/ RECORD', 'HARD DISK DRIVE', 'LCD- 46', 'LCD-40', 'LED40', 'LED46', 'LED55', 'MONITOR - TFT LCD ', 'OVEN - MICROWAVE', 'PLASMA-42', 'PLASMA-43', 'PLASMA-50', 'PLASMA-51', 'PRINTER - COLOR LA', 'PRINTER - MONO LAS', 'REFRIGERATOR-SBS', 'TFT18', 'VACUUM CLEANER', 'WASHING MACHINE']
-------------------------
City
	['ABADAN', 'ABHAR', 'AHVAZ', 'AMOL', 'ARAK', 'ARDEBI', 'BABOL', 'BABOLSAR', 'BANDAR ABBAS', 'BANDAR ANZALI', 'BARDSIR', 'BEHBAHAN', 'BIJAR', 'BIRJAND', 'BOJNORD', 'BOKAN', 'BOMEHEN', 'BOROJERD', 'CHABAHAR', 'CHALOOS', 'CHENARAN', 'DEZFOL', 'DOROD', 'ESFAHAN', 'ESFERAYN', 'ESLAM ABAD', 'ESLAMSHAHR', 'FARDIS', 'FARHANG', 'GARMSAR', 'GHAEM SHAHR', 'GHARCHAK', 'GOLPAYGAN', 'GONBAD', 'GONBAD KAVOS', 'GORGAN', 'HASHTGERD', 'HSMEDAN', 'ILAM', 'IZEH', 'JAHROM', 'JEYROFT', 'KARAJ', 'KASHAN', 'KASHMAR', 'KERMAN', 'KERMANSHAH', 'KHOMAM', 'KHOMEIN', 'KHORAM ABAD', 'KHOY', 'LAHIJAN', 'LANGEROD', 'MAHMOUD ABAD', 'MAHSHAHR', 'MALAYER', 'MANJIL', 'MASHHAD', 'MEHRSHAHR', 'MIYANEH', 'NAHAVAND', 'NIYSHABOUR', 'NOR', 'NOSHAHR', 'OROMIYEH', 'PARS ABAD', 'QAZVIN', 'QOM', 'RAFSANJAN', 'RASHT', 'REZVANSHAHR', 'ROBAT KARIM', 'RODSAR', 'SABZEVAR', 'SALMAS', 'SARAF ZADEH', 'SARBANDAR', 'SARI', 'SAVAD KOOH', 'SAVEH', 'SAVOJBOLAGH', 'SEMNAN', 'SHAHR GHODS', 'SHAHRE KORD', 'SHAHROD', 'SHAHRYAR', 'SHAZAND', 'SHIRAZ', 'SIRJAN', 'TABAS', 'TABRIZ', 'TALESH', 'TEHRAN', 'TONEKABON', 'VARAMIN', 'YAZD', 'ZAHEDAN', 'ZANJAN']
-------------------------
Defect_Des
	['COLD SOLDER', 'DEFORMED', 'DIRTY', 'DRY', 'LEAKING', 'OPEN', 'SHORT', 'Software Bug', 'WORN-OUT', 'WRONG PARTS']
-------------------------
Symptom_Desc
	['BEATING ON PICTURE', 'BUTTON/DIAL NOT WORKING', 'Broken External Flash', 'Cosmetic replacement(customer request)', 'DISPLAY DIM', 'Defective Date Mode', 'Defective Eye Cup', 'Defective LCD Touch Screen', 'Defective Usb Cable', 'EXCESSIVE COLOR', 'Exterior Problem/damage', 'FAULTY COUNTER/TIMER', 'FAULTY OSD', 'FIRMWARE UPAGRADE REQUESTED', 'HORIZONTAL LINE PRINTED', 'MEMORY ERASED/MISSED', 'NO AUDIO', 'NO LOADING', 'NO PICTURE/NO RASTER', 'NO POWER', 'No Power', 'OTHER FAULTY INDICATION', 'OTHER PROBLEMS', 'OUTCASE CRACK', 'PICTURE SLIDING', 'PICTURE TOO BRIGHT', 'PICTURE TOO DARK', 'PIXEL MISSING', 'POOR FOCUS', 'POWER OFF NOT WORKING', 'POWER ON, BUT NO OPERATION', 'Power On, But No Operation', 'READ ERROR', 'RESET NOT WORKING', 'SHORT BATTERY CHARGING', 'SHORT BATTERY LIFE', 'SNOWY PICTURE', 'SOUND CLIPPING', 'Test OK', 'UNIT AUTO-OFF', 'VERTICAL LINE PRINTED']
-------------------------
Repair_Action_Desc
	['ADJUSTMENT', 'CHANGE LCD', 'CHANGE OUTCASE', 'CLEANING', 'Cosmetic replacement(customer request)', "DECK ASS'Y REPAIR/REPLACEMENT", 'DECK CLEANING', 'ELECTRICAL PARTS REPLACEMENT', 'FIRMWARE UPGRADE', 'HEAD CLEANING', 'MECHANICAL PARTS REPLACEMENT', 'MECHANICAL RE-ASSEMBLING', 'No Action', 'OTHER CLEANING', 'OTHER ELECTRICAL ADJUSTMENT', 'OTHER MECHANICAL ADJUSTMENT', 'OTHER REPAIR', 'PARTS REMOVAL', 'PCB REPLACEMENT', 'PICKUP CLEANING', 'REPAIR REFUSAL', 'SCRAP', 'SET EXCHANGE', 'SOFTWARE UPGRADE', 'SOLDERING', 'TELECOM CARRIER PROBLEM']
-------------------------
Labor_Charge_Desc
	['Free Of Charge', 'Major', 'Maximum', 'Medium', 'Minimum', 'Minor']
-------------------------
Engineer
	['Tech01', 'Tech02', 'Tech03', 'Tech04', 'Tech05', 'Tech06', 'Tech07', 'Tech08', 'Tech10', 'Tech11', 'Tech12', 'Tech13', 'Tech14', 'Tech15', 'Tech16', 'Tech17', 'Tech18', 'Tech20', 'Tech21', 'Tech22', 'Tech23', 'Tech24']
-------------------------

it seems all columns with object dtype can be seen as category type.

for column in service_df.select_dtypes(include=[object]).columns:
    service_df[column] = service_df[column].astype('category')

Now let’s see all values for each column with integer dtype

for column in service_df.select_dtypes(include=['int64']).columns:
    print(column, sorted(service_df[column].unique()), sep="\n\t")
    print("-------------------------")
Service_type
	[0, 1]
-------------------------
TAT01
	[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 26, 27, 28, 30, 31, 32, 33, 34, 36, 37, 38, 40, 42, 44, 50, 54, 55, 58, 64, 65, 67, 68, 81]
-------------------------
TAT02
	[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 54, 56, 58, 60, 65, 68, 73, 75, 77, 78, 80, 81, 82, 84, 87, 88, 105, 122, 135, 153, 158, 208, 351]
-------------------------
Job_Satus
	[0, 1]
-------------------------
Labor_Charge_Amount
	[0, 50000, 70000, 85000, 120000, 145000, 150000, 170000, 180000, 200000, 215000, 240000, 250000, 300000, 350000, 360000, 400000, 420000, 540000]
-------------------------

also Service_type and Job_Satus can be seen as categorical data type.

service_df['Service_type'] = service_df['Service_type'].astype('category')
service_df['Job_Satus'] = service_df['Job_Satus'].astype('category')

Scaling and Normalization

Labor_Charge_Amount, Parts_Amount, Discount_Amount and Total_Invoice_Amount values shows how big their are. we can scale them down

service_df['Labor_Charge_Amount'] = service_df['Labor_Charge_Amount'].div(1000)
service_df['Parts_Amount'] = service_df['Parts_Amount'].div(1000)
service_df['Discount_Amount'] = service_df['Discount_Amount'].div(1000)
service_df['Total_Invoice_Amount'] = service_df['Total_Invoice_Amount'].div(1000)
service_df.head()
Service_type Cost_Type Product_Group City Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount
No
1 0 Normal Warranty AUDIO - HOME THEAT TEHRAN 2012-02-05 2012-02-05 2013-01-21 0 351 0 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Free Of Charge Tech15 0.0 0.00000 0.000000 0.000000
2 1 Normal Warranty REFRIGERATOR-SBS TEHRAN 2012-07-07 2012-09-10 2013-01-31 65 208 1 SHORT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0.0 3437.48160 687.486320 2748.885280
3 1 Normal Warranty REFRIGERATOR-SBS TEHRAN 2012-07-07 2012-08-15 2013-01-31 38 208 1 OPEN OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0.0 4228.75656 845.751312 3383.005248
4 0 Out of Warranty MONITOR - TFT LCD TEHRAN 2012-08-07 2012-08-07 2013-01-13 0 158 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech15 0.0 0.00000 0.000000 0.000000
5 0 Out of Warranty AUDIO - MP3 Player TEHRAN 2012-08-07 2012-08-09 2013-01-07 2 153 1 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Medium Tech07 120.0 765.48860 177.087820 708.381680

Dummy variables

how about Regression model? but all variables should be numerical whereas dataframe have some columns with categorical type.

to address, An indicator variable rendered as a numerical variable can be used to label categorical variables.

dummy_variable_cost_type = pd.get_dummies(service_df["Cost_Type"])
dummy_variable_cost_type
Extended Warranty Normal Warranty Out of Warranty
No
1 0 1 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 1
... ... ... ...
3132 0 1 0
3133 0 1 0
3134 1 0 0
3135 0 0 1
3136 0 0 1

1971 rows × 3 columns

as you can see, there’s only one cell in each row with value one.

Merging Datasets

so we have the service_df Dataframe and dummy_variable_cost_type Dataframe. we want to merge them into one Dataframe. in this case we append columns from dummy_variable_cost_type into service_df.

this link is a comprehensive description to merging

service_df = pd.concat([service_df, dummy_variable_cost_type], axis=1)

Droping original “Cost_Type” column since it’s redundant.

service_df.drop('Cost_Type', axis = 1, inplace=True)
service_df
Service_type Product_Group City Receipt_Date Appoint_Date Complete_Date TAT01 TAT02 Job_Satus Defect_Des Symptom_Desc Repair_Action_Desc Labor_Charge_Desc Engineer Labor_Charge_Amount Parts_Amount Discount_Amount Total_Invoice_Amount Extended Warranty Normal Warranty Out of Warranty
No
1 0 AUDIO - HOME THEAT TEHRAN 2012-02-05 2012-02-05 2013-01-21 0 351 0 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Free Of Charge Tech15 0.0 0.00000 0.000000 0.000000 0 1 0
2 1 REFRIGERATOR-SBS TEHRAN 2012-07-07 2012-09-10 2013-01-31 65 208 1 SHORT OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0.0 3437.48160 687.486320 2748.885280 0 1 0
3 1 REFRIGERATOR-SBS TEHRAN 2012-07-07 2012-08-15 2013-01-31 38 208 1 OPEN OTHER PROBLEMS OTHER REPAIR Free Of Charge Tech03 0.0 4228.75656 845.751312 3383.005248 0 1 0
4 0 MONITOR - TFT LCD TEHRAN 2012-08-07 2012-08-07 2013-01-13 0 158 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Free Of Charge Tech15 0.0 0.00000 0.000000 0.000000 0 0 1
5 0 AUDIO - MP3 Player TEHRAN 2012-08-07 2012-08-09 2013-01-07 2 153 1 WORN-OUT OTHER PROBLEMS SOFTWARE UPGRADE Medium Tech07 120.0 765.48860 177.087820 708.381680 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3132 0 LCD-40 REZVANSHAHR 2013-01-31 2013-01-31 2013-01-31 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech13 420.0 0.00000 84.000000 336.000000 0 1 0
3133 0 LED46 ESFAHAN 2013-01-31 2013-01-31 2013-01-31 0 0 1 WORN-OUT OTHER PROBLEMS ELECTRICAL PARTS REPLACEMENT Major Tech13 420.0 0.00000 84.000000 336.000000 0 1 0
3134 0 HARD DISK DRIVE TEHRAN 2013-01-31 2013-01-31 2013-01-31 0 0 1 WORN-OUT OTHER PROBLEMS REPAIR REFUSAL Minimum Tech07 50.0 0.00000 10.000000 40.000000 1 0 0
3135 0 HARD DISK DRIVE TEHRAN 2013-01-31 2013-01-31 2013-01-31 0 0 1 WORN-OUT OTHER PROBLEMS REPAIR REFUSAL Free Of Charge Tech07 0.0 0.00000 0.000000 0.000000 0 0 1
3136 0 DVD PLAYER/ RECORD TEHRAN 2013-01-31 2013-01-31 2013-01-31 0 0 1 WORN-OUT OTHER PROBLEMS MECHANICAL PARTS REPLACEMENT Minor Tech18 150.0 1488.48200 328.686400 1318.785600 0 0 1

1971 rows × 21 columns

now store the prepared dataframe. and let’s say you are just interested in “HARD DISK DRIVE”s. to_csv() and to_pickle will do this.

Harddrive_df = service_df.loc[service_df['Product_Group'] == 'HARD DISK DRIVE']
Harddrive_df.to_csv('Harddrive.csv') 

Iran’s Earthquakes through history

Opening Excel Files

as another example. let’s look at this new database

earthquake_dfs = pd.read_excel(join('Datasets', 'earthquake', 'data.xls'), sheet_name=None)
earthquake_dfs.keys()
dict_keys(['Historical', '1900-1963', '1964-2000'])

data is divided between three sheet. let’s look at two of them.

earthquake_df1 = earthquake_dfs['1900-1963']
earthquake_df1.head()
No. Date Unnamed: 2 Unnamed: 3 Time Location Unnamed: 6 Unnamed: 7 Magnitude Unnamed: 9 Ref.
0 NaN Y M D NaN Lat. Long. Dept mb Ms NaN
1 1.0 1900 6 NaN NaN 38.5 43.3 NaN NaN 5 MEA
2 2.0 1900 7 12 NaN 40.28 43.1 NaN NaN 5.9 ULM
3 3.0 1901 4 3 57.0 40 44.3 15 NaN 3.9 MEA
4 4.0 1902 NaN NaN NaN 39 43.3 NaN NaN 5 MEA
earthquake_df2 = earthquake_dfs['1964-2000']
earthquake_df2.head()
No. Date Unnamed: 2 Unnamed: 3 Time Location Unnamed: 6 Unnamed: 7 Magnitude Unnamed: 9 Unnamed: 10 Ref.
0 NaN Y M D NaN Lat. Long. Dept mb Ms Mw NaN
1 1.0 1964 1 12 124550.0 31.52 49.23 56 5.1 NaN NaN ISC
2 2.0 1964 1 19 91353.0 26.79 54 38 5.6 NaN NaN ISC
3 3.0 1964 1 30 122310.0 32.7 47.8 33 NaN NaN NaN PDE
4 4.0 1964 2 5 102423.0 40.37 46.1 9 4.5 NaN NaN ISC

second row is header too. we could have use header=[0, 1] as argument for csv_read to fix this. but now let’s do it manually.

earthquake_df1.columns = ['No', 'Year', 'Month', 'Day', 'Time', 'Lat', 'Long', 'Dept', 'mb', 'Ms', 'ref']
earthquake_df1 = earthquake_df1.iloc[1:].reset_index(drop=True)
earthquake_df2.columns = ['No', 'Year', 'Month', 'Day', 'Time', 'Lat', 'Long', 'Dept', 'mb', 'Ms', 'Mw', 'ref']
earthquake_df2 = earthquake_df2.iloc[1:].reset_index(drop=True)
earthquake_df1.head()
No Year Month Day Time Lat Long Dept mb Ms ref
0 1.0 1900 6 NaN NaN 38.5 43.3 NaN NaN 5 MEA
1 2.0 1900 7 12 NaN 40.28 43.1 NaN NaN 5.9 ULM
2 3.0 1901 4 3 57.0 40 44.3 15 NaN 3.9 MEA
3 4.0 1902 NaN NaN NaN 39 43.3 NaN NaN 5 MEA
4 5.0 1902 2 13 93906.0 40.72 48.71 33 6 NaN B77

Concatenation

we concat these two dataframe into one. notice that how new columns are the union of two dataframes’ columns.

earthquake_df_all = pd.concat([earthquake_df1, earthquake_df2])
earthquake_df_all.head()
No Year Month Day Time Lat Long Dept mb Ms ref Mw
0 1.0 1900 6 NaN NaN 38.5 43.3 NaN NaN 5 MEA NaN
1 2.0 1900 7 12 NaN 40.28 43.1 NaN NaN 5.9 ULM NaN
2 3.0 1901 4 3 57.0 40 44.3 15 NaN 3.9 MEA NaN
3 4.0 1902 NaN NaN NaN 39 43.3 NaN NaN 5 MEA NaN
4 5.0 1902 2 13 93906.0 40.72 48.71 33 6 NaN B77 NaN

Imputation

Also let’s replace NaNs in mb column with mean of that column.

earthquake_df_all['mb'].fillna((earthquake_df_all['mb'].mean()), inplace=True)
earthquake_df_all.head()
No Year Month Day Time Lat Long Dept mb Ms ref Mw
0 1.0 1900 6 NaN NaN 38.5 43.3 NaN 4.459897 5 MEA NaN
1 2.0 1900 7 12 NaN 40.28 43.1 NaN 4.459897 5.9 ULM NaN
2 3.0 1901 4 3 57.0 40 44.3 15 4.459897 3.9 MEA NaN
3 4.0 1902 NaN NaN NaN 39 43.3 NaN 4.459897 5 MEA NaN
4 5.0 1902 2 13 93906.0 40.72 48.71 33 6.000000 NaN B77 NaN

Further work can be done on this dataframe.

Json files (Bonus)

In this part, we will use request to gather search results.

import requests
import json
query = {'q': 'playstation 5', 'sort': 'popularity', 'size' : 50}
response = requests.get("https://api.torob.com/v4/base-product/search/", params=query)
response_json =  response.json()
response_json.keys()
dict_keys(['results', 'count', 'max_price', 'min_price', 'next', 'previous', 'spellcheck', 'available_filters', 'filter_by_category_title', 'categories', 'parent_categories'])
response_df = pd.json_normalize(response_json['results'])
response_df.head()
image_url discount_info random_key name1 name2 more_info_url web_client_absolute_url price price_text price_text_mode shop_text
0 https://storage.torob.com/backend-api/base/ima... [] 2e71cfe4-7b77-4e80-837b-de93b5fb95d7 کنسول بازی سونی PlayStation 5 Digital پلی استی... Playstation 5 Digital Editoin https://api.torob.com/v4/base-product/details-... /p/2e71cfe4-7b77-4e80-837b-de93b5fb95d7/%DA%A9... 20120000 از ۲۰٫۱۲۰٫۰۰۰ تومان active در ۱۰۱ فروشگاه
1 https://storage.torob.com/backend-api/base/ima... [] 9825c33c-07e6-4ac7-8bd6-12015ebb3e3c کنسول بازی سونی PlayStation 5 Standard پلی اس... Playstation 5 Standard Editoin https://api.torob.com/v4/base-product/details-... /p/9825c33c-07e6-4ac7-8bd6-12015ebb3e3c/%DA%A9... 21700000 از ۲۱٫۷۰۰٫۰۰۰ تومان active در ۱۳۶ فروشگاه
2 https://pcjanebi.com/wp-content/uploads/2021/0... [] 9687bb29-671b-42e4-986b-5f56ca3ade59 باندل کنسول بازی سونی PlayStation 5 Standard پ... Standard PlayStation 5 bundle+DualSense+Pulse+... https://api.torob.com/v4/base-product/details-... /p/9687bb29-671b-42e4-986b-5f56ca3ade59/%D8%A8... 29990000 از ۲۹٫۹۹۰٫۰۰۰ تومان active در ۵ فروشگاه
3 https://storage.torob.com/backend-api/base/ima... [] 3d19000e-b1a3-48de-8ef2-70c23c89b534 پلی استیشن 5 دیجیتال به همراه بازی باندل 2 دسته https://api.torob.com/v4/base-product/details-... /p/3d19000e-b1a3-48de-8ef2-70c23c89b534/%D9%BE... 23000000 ۲۳٫۰۰۰٫۰۰۰ تومان active در مرکزخریددات کام
4 https://storage.torob.com/backend-api/base/ima... [] 88d90bf1-b3c9-4a6f-8abc-0605f0e8a008 کنسول بازی سونی پلی استیشن 5 ظرفیت 825 گیگابایت https://api.torob.com/v4/base-product/details-... /p/88d90bf1-b3c9-4a6f-8abc-0605f0e8a008/%DA%A9... 21299000 ۲۱٫۲۹۹٫۰۰۰ تومان active در ژیوان