FindingData
Data Preprocessing help
Data Preprocessing help
private content: for me(admin) only.Outliers
Why Outliers
- Human error
- Measurement error
- Variablility in data
What is the imapact of outliers on our model
- Various prblm on statisticl analysis
- It may cause significant cause on mean and the standard deviation
How to iedntify the outliers using graphs
- Box plot
- Scatter plot
How to identify (methods)
- Standard deviation method
## Standard deviation method (when data has normal or guassian distribution)data_mean = data.mean()data_std = data.std()cut_off = data_std * 3lower_bound, upper_bound = data_mean - cut_off, data_mean + cut_off# identify outliersoutliers = [x for x in data if x < lower_bound or x > upper_bound]# remove outliersoutliers_removed = [x for x in data if x > lower and x < upper]- Zscore
## ZScorez_values = np.abs(stats.zscore(data))threshold = 3print(np.where(z>3))data_o = data_o[(z>3).all(axis=1)]- IQR
## Inter quartile range (for lower skewed data)q25 = percentile(data, 25)q75 = percentile(data, 75)IQR = q75 - q25cut_off = IQR * 1.5 lower_bound, upper_bound = q25-cut_off, q75+cut_off# identify outliersoutliers = [x for x in data if x < lower_bound or x > upper_bound]# remove outliersourliers_removed = [x for x in data if x>lower_bound and x<upper_bound]for-highly-skewed-data
## Inter quartile range (for highly skewed data)q25 = percentile(data, 25)q75 = percentile(data, 75)IQR = q75 - q25cut_off = IQR * 3 lower_bound, upper_bound = q25-cut_off, q75+cut_off# identify outliersoutliers = [x for x in data if x < lower_bound or x > upper_bound]# remove outliersourliers_removed = [x for x in data if x>lower_bound and x<upper_bound]- box-cox transform can also be used for highly skewed data
- log transformation
np.log(data[i])don't forget to inverse.- Isolation forest
Should we rermove outliers?
- depend on the use case
- suppose u have card fraud detection dataset. U must have anamoly to detecct the frauds
- one more exmaple or sales data where there are spikes in data which are very important and are ouliers compared to the normal data
Z score = (i - mean) / std. deviation68% of the data points lie between +/- 1 standard deviation.95% of the data points lie between +/- 2 standard deviation99.7% of the data points lie between +/- 3 standard deviationCategorical Feature Encoding
Ordinal Encoding
WHen feature categories have a order with them
(low, medium, high)OrdinalEncoding
OrdinalEncoder() from sklearnLabelEncoding
LabelEncoder() from sklearnTarget Guided Ordinal Encoding
""" [ordinal the labels accrding to the target, Replace the value by joiint probability of being 1 or 0, highest mean -> highest rank]"""ordinal_labls = data.groupby([i])[target].mean().sort_values().indexlvl_dict = {k:i for i, k in enumerate(ordinal_labls, 0)}data[i] = data[i].map(lvl_dict)Nominal Encoding
When features categories have not ordered associated wiith them
(state, fruits etc)One hot Encoding
OnehotEncoding from sklearnget_dummies from pandaspd.get_dummies(data, drop_first=True)KDD Orange (Extension of one hot encoding)
what if a feature have more than 500 diffrent categories,
SOLUTION -> (KDD ORange research paper) limit one hot encoding to the 10 most frequent labels of the feature
# lets find the most frequent top 10 categories for thr variable idata[i].value_counts().sort_values(ascending=False).head(10) # grab themtop_labels = [y for y in data[i].value_counts().sort_values(ascending=False).head(10).index] def one_hot_encoding(data, variable, top): for label in top: data[variable+'_'+label] = np.where(data[variable]==label, 1, 0)# Advantages"""Straight forward,does not require for variale exploration,low feature space,"""# Disadvantages"""Does not add any information that make the variable more predictive,does not keep the information of ignored variable]"""Mean Encoding
""" [Monotic relationship between label and target, Capture the info within the labels, Prones to overfitting] EX - PINCODE"""dict_val= data.groupby([i])[target].mean().to_dict()data[i] = data[i].map(dict_val)Count / Frequency encoding
""" [Replace each label of the categorical variable by the count, this is amount of time a categry appeas inside a feature, Does not increase the feature space] [If some of labels have same count, Then diff categry replace with the same count result that it will loose a lot of information]"""count = data[i].value_counts().to_dict()data[i] = data[i].map(count)