Label encoding, Mapping and One hot encoding

Label encoding, Mapping and One hot encoding

This post outlines data pre-processing of categorical variables using Label encoding, Mapping and One Hot Encoding.
# Import libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import warnings # We do this to ignore several specific warnings
warnings.filterwarnings("ignore")

# Craete a simple dataframe
customer_demo = pd.DataFrame({'customerID':['A021', 'B341', 'C006', 'D122', 'E874', 'F442', 'G433', 'H343', 'I532', 'J451'],
                              'gender':['F', 'M', 'M', 'M', 'F',  'M', 'F', 'M', 'F', 'M'],
                              'affluency':['Low', 'High', 'Medium', 'Low', 'Medium', 'High', 'High', 'High', 'Low', 'Low'],
                              'region':['West', 'Central', 'East', 'East', 'Central', 'East', 'West', 'Central', 'Central', 'West']
                           })
print("Initial dataset:")
print(customer_demo)                           


"""
1. Label Encoding
This is the simplest approach to encode categorical values. It is done with a technique called Label Encoding, 
which allows you to convert each value in a column to a number. 
Gender 'F' is encoded as 0 and 'M' as 1 - usually alphabetical order.
""" 

# Label encoding
le = LabelEncoder()
customer_demo['gender_cat'] = le.fit_transform(customer_demo.gender)
# print(customer_demo)


"""
2. Mapping
LabelEncoder finds the unique values present in a column and map the values in range [0, n-1], n being the 
number of unique values in the column. The values are mapped in alphabetical order. Thus, in the previous 
case, 'F' is mapped to 0 and 'M' is mapped to 1.

If we use same approach to encode the affluency column, the mapping will be:
High: 0
Low: 1
Medium: 2

This mapping is not ideal since affluency is an ordinal categorical feature. The three categories, Low, Medium 
and High, have an order associated with them. We would like to have this mapping instead:
Low: 0
Medium: 1
High: 2

There are multiple ways to achieve this. One of the simplest ways is to use a pandas Series map() function 
as shown below.
"""

# Unique values in affluency
customer_demo.affluency.unique()

# Define mapping dictionary
mapping_dict = {'Low':0, 'Medium':1, 'High':2}

# Encode affluency column
customer_demo['affluency_cat'] = customer_demo.affluency.map(mapping_dict)
# print(customer_demo)


"""
3. One Hot Encoding
When you have a sizeable categories in a column, a common alternative approach is called One Hot Encoding. 
The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) 
value to the column. This has the benefit of not weighting a value improperly but does have the downside of 
adding more columns to the data set.

Again, there are multiple ways to do One Hot Encoding. The pandas way of get_dummies function is used here, 
there are other techniques. We encode 'region' in the following code cell. Three extra columns are 
created, one for each unique values in region: region_Central, region_East and region_West. 
Depending on the value of region, only one out of the three dummy columns has value 1.
"""
print()
# duplicate region column to keep original values
customer_demo['region_cat'] = customer_demo.region

# convert region_cat to dummy variables.
customer_demo_final = pd.get_dummies(customer_demo, columns=["region_cat"], prefix=["region"])
print("Final pre-processing output:")
customer_demo_final


"""
You may then drop the categorical columns and proceed with your modeling.
It is important to note that there are more encoding techniques for columns having large number of categories 
but the approach to that can be researched when you need.
"""


6