FindingData

Data Preprocessing help

private content: for me(admin) only.

Outliers

Why Outliers

Human error
Measurement error
Variablility in data

What is the imapact of outliers on our model

Various prblm on statisticl analysis
It may cause significant cause on mean and the standard deviation

How to iedntify the outliers using graphs

Box plot
Scatter plot

How to identify (methods)

Standard deviation method

## Standard deviation method (when data has normal or guassian distribution)
data_mean = data.mean()
data_std = data.std()
cut_off = data_std * 3
lower_bound, upper_bound = data_mean - cut_off, data_mean + cut_off
# identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
# remove outliers
outliers_removed = [x for x in data if x > lower and x < upper]

Zscore

## ZScore
z_values = np.abs(stats.zscore(data))
threshold = 3
print(np.where(z>3))
data_o = data_o[(z>3).all(axis=1)]

## Inter quartile range (for lower skewed data)
q25 = percentile(data, 25)
q75 = percentile(data, 75)
IQR = q75 - q25
cut_off = IQR * 1.5 
lower_bound, upper_bound = q25-cut_off, q75+cut_off
# identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
# remove outliers
ourliers_removed = [x for x in data if x>lower_bound and x<upper_bound]

for-highly-skewed-data

## Inter quartile range (for highly skewed data)
q25 = percentile(data, 25)
q75 = percentile(data, 75)
IQR = q75 - q25
cut_off = IQR * 3 
lower_bound, upper_bound = q25-cut_off, q75+cut_off
# identify outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
# remove outliers
ourliers_removed = [x for x in data if x>lower_bound and x<upper_bound]

box-cox transform can also be used for highly skewed data
log transformation

np.log(data[i])
don't forget to inverse.

Isolation forest

Should we rermove outliers?

depend on the use case
suppose u have card fraud detection dataset. U must have anamoly to detecct the frauds
one more exmaple or sales data where there are spikes in data which are very important and are ouliers compared to the normal data

Z score = (i - mean) / std. deviation
68% of the data points lie between +/- 1 standard deviation.
95% of the data points lie between +/- 2 standard deviation
99.7% of the data points lie between +/- 3 standard deviation

Categorical Feature Encoding

Ordinal Encoding

WHen feature categories have a order with them

(low, medium, high)

OrdinalEncoding

OrdinalEncoder() from sklearn

LabelEncoding

LabelEncoder() from sklearn

Target Guided Ordinal Encoding

"""
    [ordinal the labels accrding to the target,
    Replace the value by joiint probability of being 1 or 0,
    highest mean -> highest rank]
"""
ordinal_labls = data.groupby([i])[target].mean().sort_values().index
lvl_dict = {k:i for i, k in enumerate(ordinal_labls, 0)}
data[i] = data[i].map(lvl_dict)

Nominal Encoding

When features categories have not ordered associated wiith them

(state, fruits etc)

One hot Encoding

OnehotEncoding from sklearn
get_dummies from pandas
pd.get_dummies(data, drop_first=True)

KDD Orange (Extension of one hot encoding)

what if a feature have more than 500 diffrent categories,

SOLUTION -> (KDD ORange research paper) limit one hot encoding to the 10 most frequent labels of the feature

# lets find the most frequent top 10 categories for thr variable i
data[i].value_counts().sort_values(ascending=False).head(10)
    
# grab them
top_labels = [y for y in data[i].value_counts().sort_values(ascending=False).head(10).index]
    
def one_hot_encoding(data, variable, top):
    for label in top:
        data[variable+'_'+label] = np.where(data[variable]==label, 1, 0)
# Advantages
"""
Straight forward,
does not require for variale exploration,
low feature space,
"""
# Disadvantages
"""
Does not add any information that make the variable more predictive,
does not keep the information of ignored variable]
"""

Mean Encoding

"""
    [Monotic relationship between label and target,
    Capture the info within the labels,
    Prones to overfitting]
    EX - PINCODE
"""
dict_val= data.groupby([i])[target].mean().to_dict()
data[i] = data[i].map(dict_val)

Count / Frequency encoding

"""
    [Replace each label of the categorical variable by the count,
    this is amount of time a categry appeas inside a feature,
    Does not increase the feature space]
    
    [If some of labels have same count, Then diff categry replace with 
    the same count result that it will loose a lot of information]
"""
count = data[i].value_counts().to_dict()
data[i] = data[i].map(count)

Edit this page on GitHub

FindingData

Data Preprocessing help

Data Preprocessing help

Outliers

Why Outliers

What is the imapact of outliers on our model

How to iedntify the outliers using graphs

How to identify (methods)

Should we rermove outliers?

Categorical Feature Encoding

Ordinal Encoding

OrdinalEncoding

LabelEncoding

Target Guided Ordinal Encoding

Nominal Encoding

One hot Encoding

KDD Orange (Extension of one hot encoding)

Mean Encoding

Count / Frequency encoding

Recent Posts

On this page