Unlocking the Power of Categorical Data: Strategies for Encoding and Feature Transformation"
Introduction:
In the world of machine learning and data analysis, encoding plays a crucial role in converting categorical features into numerical values. This transformation is necessary because machine learning models typically work with numerical data. In this article, we will delve into the different types of encoding techniques, focusing on nominal and ordinal encoding. We will explore their definitions, use cases, and implementation in Python using popular libraries like scikit-learn. By the end of this article, you will have a clear understanding of how to encode categorical data effectively.
Encoding is the technique of converting the categorical feature of a dataset into numerical values. As in machine learning a model expect to get numerical values for training the model so it is important to transform various categorical feature into numeric feature.
Encoding
There are two types of Encoding
Nominal Encoding
Nominal encoding is used when there is no particular order or ranking among the categories.
Each category is assigned a unique numerical value without any specific order.
The assigned numerical values are arbitrary and do not represent any inherent order.
Examples: Red, Green, and Blue can be encoded as 1, 2, 3 or 3, 1, 2.
Ordinal Encoding
Ordinal encoding is used when there is an inherent order or ranking among the categories.
Each category is assigned a numerical value based on its relative position or rank.
The assigned numerical values convey the order or hierarchy among the categories.
Examples: Low, Medium, and High can be encoded as 0, 1, 2 or 1, 2, 3.
Ordinal Encoding
Ordinal Encoding is just about giving ranks to the categories in the dataset. Let us understand using the dataset.
Applying ordinal encoding in the following dataset
We will apply ordinal encoding in the first two columns as they follow ranking.
First, we will split the training and testing data using sklearn.model_selection library.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df.drop('purchased',axis=1),df[['review','education']],test_size=0.2,random_state=0)
Now we will apply Ordinal encoding using sklearn library
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School', 'UG', 'PG']])
The sklearn library will assign a rank to the i.e poor<Average<Good (0,1,2 ) similarly school<UG<PG (0,1,2)
After training the model we can see that:
Here the poor are assigned as 0 , the Good as 1 and in the education column School is assigned as 0 , UG as 1 and PG as 2
This is ordinal Encoding as simple as that.๐
Nominal Encoding
Nominal encoding is a technique used to convert categorical features into numerical values when there is no inherent order or ranking among the categories. One common method to perform nominal encoding is by using the OneHotEncoder class from the scikit-learn library.
First, let's split our dataset into training and testing sets:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)
X_train.head()
We will be applying OneHotEncoder to two categorical features, namely 'fuel' and 'owner'. Let's take a look at the first few rows of the training data:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
In the code above, we create an instance of the OneHotEncoder class with the parameter 'drop' set to 'first' to drop the first column of each encoded feature (to avoid multicollinearity). We also set 'sparse' to False to obtain a dense array and 'dtype' to np.int32 to ensure integer encoding.
By calling the fit_transform
method on the selected categorical features from the training set, we transform the categorical data into a numerical representation.
This process creates new columns for each unique category in the selected features, with binary values indicating the presence or absence of a particular category. The encoded data, stored in X_train_encoded
, can now be used for further analysis or machine learning tasks.
Remember to apply the same encoding scheme to the testing set (X_test
) to ensure consistency in feature representation.
Label order Encoding
According to the official doc of the sci-kit learn library this transformer should be used to encode target values, i.e. y
, and not the input X
Conclusion:
Encoding categorical data is an essential step in preparing data for machine learning models. In this article, we discussed two common encoding techniques: nominal and ordinal encoding. Nominal encoding is used when there is no particular order among categories, and each category is assigned a unique numerical value. On the other hand, ordinal encoding is applied when there is a clear order or ranking among categories, and the assigned numerical values convey this hierarchy. We also explored the implementation of encoding techniques using libraries like scikit-learn, providing practical examples along the way. Armed with this knowledge, you are now equipped to tackle categorical data encoding in your machine-learning projects with confidence. Happy encoding!