Pre-processing database using pre-written functions
Written by: Nir Barazida
The Medium blog-post
The full package usage is elaborated in this Jupyter Notebook
How many times have you received raw database and conduct the same action to pre-process it?
- Check for missing values and fill them if necessary.
- Combine low appearance categories under one umbrella category.
- Plot your data distribution and check for outliers.
- Drop outliers by different methods.
If you said yes to all of the above you have reached to right place!
The main purpose of the 'NBprocessing' package is to make our Data Scientist life easier, or better yet - more efficient.
The 'NBprocessing' package stores most of the generics functions that we all use on a daily basic, such remove outliers, fill missing values etc.
Run from your command line prompt:
pip install NBprocessing
It will also install all the dependent packages such pandas, numpy, seaborn etc.
-
Categorical - contains functions that are relevant to categorical features:
remove_categories(database, column_name, categories_to_drop)
fill_na_by_ratio(database, column_name)
combine_categories(database, column_name, category_name="other", threshold=0.01)
categories_not_in_common(train, test, column_name)
category_ratio(database, columns_to_check=None, num_categories=5)
label_encoder_features(database, features_to_encode)
OHE(database, features_list=None)
-
Continuous - contains functions that are relevant to continuous features:
remove_outliers_by_boundaries(database, column_name, bot_qu, top_qu)
fill_na_timedate(database, column_name)
get_num_outliers_by_value(database, filter_dict_up, filter_dict_down)
remove_outliers_by_value(database, filter_dict_up, filter_dict_down)
-
General - contains general functions:
missing_values(database)
split_and_check(database, column_name, test_size=0.3)
-
Plot - contains plots functions:
plot_missing_value_heatmap(database)
plot_corr_heat_map(database)
count_plot(database, column_list=None)
distribution_plot(database, column_list=None)
world_map_plot(database, locations_column, feature, title=None, color_bar_title=None)
from NBprocessing import NBcategorical
from NBprocessing import NBcontinuous
from NBprocessing import NBplot
from NBprocessing import NBgeneral
All usage of the package functions are reviewed very specifically in this jupyter Notebook
-
Categorical:
-
Fill missing values in a categorical feature by the ratio of the categories:
fill_na_by_ratio(database, column_name)
Fill all missing values in the given column by the ratio of the categories in the column.
Because the ratio sum is not a perfect one - the extra missing values will be filled with the most common category in the column.-
First, we would like to sum all missing values in every categorical feature.
-
Second, Lets explore the ratio of evey category in the feature 'fuel' with and without missing values
-
Last, we would like to fill the missing values and to keep the ratio of the features without them.
To do so we will use thefill_na_by_ratio
function.
As we can see from the above, all the missing values were filled and we manged to keep the ratio of the categories.
-
-
Combine low appearance categories under one category:
combine_categories(database, column_name, category_name="other", threshold=0.01)
Receives a threshold that is the minimum relative part of the category within the column.
All categories that are less than this threshold will be combined under the same category under the name 'category_name'.
The method will return a list with all the name of categories that were combined under 'category_name'.
With this list the user will be able to make the same action on the test set (assuming that the data was already splitted to train and test sets)
-
-
Continuous:
-
Remove outliers by top and bottom percentage of data boundaries:
remove_outliers_by_boundaries(database, column_name, bot_qu, top_qu)
The theory behind it:
the number of outliers will follow a binomial distribution with parameter p, which can generally be well-approximated by the Poisson distribution with λ = pn. Thus if one takes a normal distribution with cutoff 3 standard deviations from the mean, p is approximately 0.3%, and thus for 1000 trials one can approximate the number of samples whose deviation exceeds 3 sigmas by a Poisson distribution with λ = 3. 3 times of standard deviation as my main data and out of this range would be the outlier.Thus, in a normal distribution the top and bottom boundaries should contain 99.7% of the data. However, not all data has Normal distribution thus the user is able to change the top and bottom boundaries
Before removing the indexes will print a message to the user with the number of indexes that will be remove and the percent of the database that will be lost. the user will input 'y'(yes) to pressed and 'n'(no) to cancel the action. If the user choose 'yes' the method will continue to drop the indexes and will plot the new database shape.
Let's see a live example:
-
-
Plot: