Cleaning data… that sucks

Physics

theoretical mechanics

The preprocessing of data - or what is left of it.

Author

Fujimiya Amane

Published

March 11, 2021

Modified

September 10, 2021

Well, for everything, even data, we need to clean them off because even one bad data can affect all of the dataset and the model itself.

Scaling feature values

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

Helps gradient descent converge more quickly.
Helps avoid the “NaN trap,” in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN.
Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

One way to scale numerical data is to linearly map \([min,max]\) to a smaller scale of \([-1,1]\) for example. Another way to do it is by using normal distribution, that is: \[val_{scl}(i) = \frac{val_{i}}{\displaystyle{\sum_{j=1}^{n}val_{j}}}\cdot \sigma\] of which \[\begin{cases} val_{i} \longrightarrow \text{original value} \\ \sigma \longrightarrow \text{standard deviation} \\ \displaystyle{\sum_{j=1}^{n}val_{j}} = \mu \longrightarrow \text{mean value} \end{cases}\] # Handling extreme outliers We have two techniques: - Logarithmic scale - Clipping/Capping values. Both of which might work or not, in most case. But it is advised to somewhat use both of them because logarithmic scale smooths the graph out, but not always do justice to the outliers.

Binning

Binning refers to the act of separating non-linear data and relationship in to “bins” of which contain mostly, linear relation. Hence, we can use linear algebra and make the model learn depends on each bin. What does that mean? It means for each bin, i.e different situation, different set of scenarios, then the model can learn from it the mathematical relationship for each specific bin. ![[Pasted image 20240121160334.png]] For that, we can see that we can make it into equal bins: ![[Pasted image 20240121160415.png]] Instead of having one floating-point feature, we now have 11 distinct boolean features (LatitudeBin1, LatitudeBin2, …, LatitudeBin11). Having 11 separate features is somewhat inelegant, so let’s unite them into a single 11-element vector. Doing so will enable us to represent latitude 37.4 as follows:

latitude_vec[11] = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Scrubbing

Until now, we’ve assumed that all the data used for training and testing was trustworthy. In real-life, many examples in data sets are unreliable due to one or more of the following:

Omitted values. For instance, a person forgot to enter a value for a house’s age.
Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.
Bad feature values. For example, someone typed in an extra digit, or a thermometer was left out in the sun.

Once detected, you typically “fix” bad examples by removing them from the data set. To detect omitted values or duplicated examples, you can write a simple program. Detecting bad feature values or labels can be far trickier.

In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:

Maximum and minimum
Mean and median
Standard deviation

Consider generating lists of the most common values for discrete features. For example, do the number of examples with country:uk match the number you expect. Should language:jp really be the most common language in your data set?