Whether you admit it or not, data cleansing is not an easy task, and in most cases it is time consuming and tedious, but it is important. If you’ve been through the data cleansing process, you’ll know what I mean. And that’s exactly what this article is written for – to make the data cleaning task easier for the reader. In fact, I realized not too long ago that there are data that have similar patterns when it comes to data cleansing. That’s when I started putting together and compiling some data cleaning code that I think is applicable to other common scenarios as well. Since these common scenarios involve different types of datasets, this article focuses more on showing and explaining what these codes can be used to accomplish so that the reader can use them more easily.
Data cleansing gadget box
In the code snippet below, the data cleaning code is encapsulated in a number of functions, and the purpose of the code is very intuitive. You can use these codes directly without having to embed them in functions that require a small number of parameter changes.
1. drop_multiple_columns_data
def drop_multiple_col(col_names_list, df):
”’
AIM -> Drop multiple columns based on their column names
INPUT -> List of column names, df
OUTPUT -> updated df with dropped columns
——
”’
df.drop(col_names_list, axis=1, inplace=True)
return df
Sometimes, not all columns of data are useful for our data analysis work. Therefore, “df.drop” can be convenient to delete your selected columns. 2.
2. convert Dtypes
def change_dtypes(col_int, col_float, df):
”’
AIM -> Changing dtypes to save memory
INPUT -> List of column names (int, float), df
OUTPUT -> updated df with smaller memory
——
”’
df[col_int] = df[col_int].astype(‘int32’)
df[col_float] = df[col_float].astype(‘float32’)
When we are dealing with larger datasets, we need to transform “dtypes” to save memory. If you are interested in learning how to use “Pandas” with large data, I highly recommend reading the article “Why and How to Use Pandas with Large Data” (https://towardsdatascience.com/why-and-how-to-use- pandas-with-large-data-9594dda2ea4c).
3. Convert categorical variables to numerical variables
def convert_cat2num(df):
# Convert categorical variable to numerical variable
num_encode = {‘col_1’ : {‘YES’:1, ‘NO’:0},
‘col_2’ : {‘WON’:1, ‘LOSE’:0, ‘DRAW’:0}}
df.replace(num_encode, inplace=True)
There are some machine learning models that require variables to be in numerical form. In this case, we need to convert the categorical variables into numerical variables and then use them as inputs to the model. For data visualization tasks, I recommend that you keep the categorical variables so that the visualization results are more clearly interpreted and easy to understand.
4. Checking for missing data
def check_missing_data(df):
# check for any missing data in the df (display in descending order)
return df.isnull().sum().sort_values(ascending=False)
If you want to check how much missing data is in each column, this is probably the fastest way to do it. This method gives you a better idea of which columns have more missing data and helps you decide what action you should take next in your data cleaning and data analysis efforts.
5. Remove strings from columns
def remove_col_str(df):
# remove a portion of string in a dataframe column – col_1
df[‘col_1’].replace(‘\n’, ”, regex=True, inplace=True)
# remove all the characters after &# (including &#) for column – col_1
df[‘col_1’].replace(‘ &#. *’, ”, regex=True, inplace=True)
Sometimes you may see a new line of characters, or some strange symbols in a string column. You can easily use df[‘col_1’].replace to handle this problem, where “col_1” is a column in the data frame df.
6. Remove spaces from a column
def remove_col_white_space(df):
# remove white space at the beginning of string
df[col] = df[col].str.lstrip()
When the data is very confusing, many unexpected situations can occur. It is very common to have some spaces at the beginning of a string. Therefore, this method is useful when you want to remove spaces at the beginning of a string in a column.
7. Concatenate two columns of string data (under certain conditions)
def concat_col_str_condition(df):
# concat 2 columns with strings if the last 3 letters of the first column are ‘pil’
mask = df[‘col_1’].str.endswith(‘pil’, na=False)
col_new = df[mask][‘col_1’] + df[mask][‘col_2’]
col_new.replace(‘pil’, ‘ ‘, regex=True, inplace=True) # replace the ‘pil’ with emtpy space
This method is useful when you want to combine two columns of string data under certain conditions. For example, you want to splice the first and second columns of data together when the first column ends with some specific letter. Depending on your needs, you can also remove the ending letters after the splicing job is done.
8. Convert timestamp (from string type to date “DateTime” format)
def convert_str_datetime(df):
”’
AIM -> Convert datetime(String) to datetime(format we want)
INPUT -> df
OUTPUT -> updated df with new datetime format
——
”’
df.insert(loc=2, column=’timestamp’, value=pd.to_datetime(df.transdate, format=’%Y-%m-%d %H:%M:%S.%f’))
When working with time series data, you may encounter timestamp columns in string format. This means that we may have to convert the data in string format to a date “datetime” format specified according to our needs in order to use this data for meaningful analysis and presentation.