Step 2-B: Pre-processing Data The raw data that you get directly from your sources are never in the format that you need to perform analysis on. There are two main goals in the data pre-processing step. The first is to clean the data to address data quality issues, and the second is to transform the raw data to make it suitable for analysis. A very important part of data preparation is to address quality of issues in your data. Real-world data is messy. There are many examples of quality issues with data from real applications including inconsistent data like a customer with two different addresses, duplicate customer records, for example, customers address recorded at two different sales locations. And the two recordings don't agree. Missing customer agent demographics or studies. Missing values like missing a customer age in the demographic studies. invalid data like an invalid zip code for example, a six digit code. And outliers like a sense of failure causing values to be much higher or lower than expected for a period of time. Since we get the data downstream we usually have little control over how the data is collected. Preventing data quality problems as the data is being collected is not often an option. So we have the data that we get and we have to address quality issues by detecting and correcting them. Here are some approaches we can take to address this quality issues. We can remove data records with missing values. We can merge duplicate records. This will require a way to determine how to resolve conflicting values. Perhaps it makes sense to retain the newer value whenever there's a conflict. For invalid values, the best estimate for a reasonable value can be used as a replacement. For example, for a missing age value for an employee, a reasonable value can be estimated based on the employee's length of employment. Outliers can also be removed if they are not important to the task. In order to address data quality issues effectively, knowledge about the application, such as how the data was collected, the user population, and the intended uses of the application is important. This domain knowledge is essential to making informed decisions on how to handle incomplete or incorrect data. The second part of preparing data is to manipulate the clean data into the format needed for analysis. The step is known by many names. Data manipulation, data preprocessing, data wrangling, and even data munging, some operations for this type of operation I mean data munging, wrangling, preprocessing, include, scaling, transformation, feature selection, dimensionality reduction, and data manipulation. Scaling involves changing the range of values to be between a specified range. Such as from zero to one. This is done to avoid having certain features that large values from dominating the results. For example, in analyzing data with height and weight. To magnitude of weight values is much greater than of the height values. So scaling all values to be between zero and one will equalize contributions from both height and weight features. Various transformations can be performed on the data to reduce noise and variability. One such transformation is aggregation. Aggregate data generally results in data with less variability, which may help with your analysis. For example, daily sales figures may have many serious changes. Aggregating values to weekly or monthly sales figures will result in similar data. Other filtering techniques can also be used to remove variability in the data. Of course, this comes at the cost of less detailed data. So these factors must be weighed for the specific application. Future selection can involve removing redundant or irrelevant features, combining features, and creating new features. During the exploring data step, you might have discovered that two features are correlated. In that case one of these features can be removed without negatively affecting the analysis results. For example, the purchase price of a product and the amount of sales tax paid, are likely to be correlated. Eliminating the sales tax amount, then will be beneficial. Removing redundant or irrelevant features will make the subsequent analysis much simpler. In other cases, you may want to combine features or create new ones. For example, adding the applicant's education level as a feature to a loan approval application would make sense. There are also algorithms to automatically determine the most relevant features, based on various mathematical properties. Dimensionality reduction is useful when the data set has a large number of dimensions. It involves finding a smaller subset of dimensions that captures most of the variation in the data. This reduces the dimensions of the data while eliminating irrelevant features and makes analysis simpler. A technique commonly used for dimensional reduction is called principle component analysis or PCA. Raw data often has to be manipulated to be in the correct format for analysis. For example, from samples recording daily changes in stock prices, we may want the capture price changes for a particular market segments like real estate or health care. This would require determining which stocks belong to which market segment. Grouping them together, and perhaps computing the mean, range, standard deviation for each group. In summary, data preparation is a very important part of the data science process. In fact, this is where you will spend most of your time on any data science effort. It can be a tedious process, but it is a crucial step. Always remember, garbage in, garbage out. If you don't spend the time and effort to create good data for the analysis, you will not get good results no matter how sophisticated the analysis technique you're using is.