TL Consulting Group

ai

How Exploratory Data Analysis (EDA) Can Improve Your Data Understanding Capability

How Exploratory Data Analysis (EDA) Can Improve Your Data Understanding Capability Can EDA help to make my phone upgrade decision more precise? You may have heard the term Exploratory Data Analysis (or EDA for short) and wondered what EDA is all about. Recently, one of the Sales team members at TL Consulting Group were thinking of buying a new phone but they were overwhelmed by the many options and they needed to make a decision suited best to their work needs, i.e. Wait for the new iPhone or make an upgrade on the current Android phone. There can be no disagreement on the fact that doing so left them perplexed and with a number of questions that needed to be addressed before making a choice. What was the specification of the new phone and how was that phone better than their current mobile phone? To help enable curiosity and decision-making, they visited YouTube to view the new iPhone trailer and also learned more about the new iPhone via user ratings and reviews from YouTube and other websites. Then they came and asked us how we would approach it from a Data Analytics perspective in theory. And our response was, whatever investigating measures they had already taken before making the decision, this is nothing more but what ML Engineers/data analysts in their lingo call ‘Exploratory Data Analysis’. What is Exploratory Data Analysis? In an automated data pipeline, exploratory data analysis (EDA) entails using data visualisation and statistical tools to acquire insights and knowledge from the data as it travels through the pipeline. At each level of the pipeline, the goal is to find patterns, trends, anomalies, and potential concerns in the data. Exploratory Data Analysis Lifecycle To interpret the diagram and the iPhone scenario in mind, you can think of all brand-new iPhones as a “population” and to make its review, the reviewers will take some iPhones from the market which you can say is a “sample”. The reviewers will then experiment with that phone and will apply different mathematical calculations to define the “probability” if that phone is worth buying or not. It will also help to define all the good and bad properties of the new iPhone which is called “Inference “. Finally, all these outcomes will help potential customers to make their decision with confidence. Benefits of Exploratory Data Analysis The main idea of exploratory data analysis is “Garbage in, Perform Exploratory Data Analysis, possibly Garbage out.” By conducting EDA, it is possible to turn an almost usable dataset into a completely usable dataset. It includes: Key Steps of EDA The key steps involved in conducting EDA on an automated data pipeline are: Types of Exploratory Data Analysis EDA builds a robust understanding of the data, and issues associated with either the info or process. It’s a scientific approach to getting the story of the data. There are four main types of exploratory data analysis which are listed below: 1. Univariate Non-Graphical Let’s say you decide to purchase a new iPhone solely based on its battery size, disregarding all other considerations. You can use univariate non-graphical analysis which is the most basic type of data analysis because we only utilize one variable to gather the data. Knowing the underlying sample distribution and data and drawing conclusions about the population are the usual objectives of univariate non-graphical EDA. Additionally included in the analysis is outlier detection. The traits of population dispersal include: Spread: Spread serves as a gauge for how far away from the Centre we should search for the information values. Two relevant measurements of spread are the variance and the quality deviation. Because the variance is the root of the variance, it is defined as the mean of the squares of the individual deviations. Central tendency: Typical or middle values are related to the central tendency or position of the distribution. Statistics with names like mean, median, and sometimes mode are valuable indicators of central tendency; the mean is the most prevalent. The median may be preferred in cases of skewed distribution or when there is worry about outliers. Skewness and kurtosis: The distribution’s skewness and kurtosis are two more useful univariate characteristics. When compared to a normal distribution, kurtosis and skewness are two different measures of peakedness. 2. Multivariate Non-Graphical Think about a situation where you want to purchase a new iPhone solely based on the battery capacity and phone size. In either cross-tabulation or statistics, multivariate non-graphical EDA techniques are frequently used to illustrate the relationship between two or more variables. An expansion of tabulation known as cross-tabulation is very helpful for categorical data. By creating a two-way table with column headings that correspond to the amount of one variable and row headings that correspond to the amount of the opposing two variables, a cross-tabulation is preferred for two variables. All subjects that share an analogous pair of levels are then included in the counts. For each categorical variable and one quantitative variable, we create statistics for quantitative variables separately for every level of the specific variable then compare the statistics across the amount of categorical variable. It is possible that comparing medians is a robust version of one-way ANOVA, whereas comparing means is a quick version of ANOVA. 3. Univariate Graphical Different Univariate Graphics Imagine that you only want to know the latest iPhone’s speed based on its CPU benchmark results before you decide to purchase it. Since graphical approaches demand some level of subjective interpretation in addition to being quantitative and objective, they are utilized more frequently than non-graphical methods because they can provide a comprehensive picture of the facts. Some common sorts of univariate graphics are: Boxplots: Boxplots are excellent for displaying data on central tendency, showing reliable measures of location and spread, as well as information on symmetry and outliers, but they can be deceptive when it comes to multimodality. The type of side-by-side boxplots is among the simplest applications for boxplots. Histogram: A histogram, which can be a barplot

How Exploratory Data Analysis (EDA) Can Improve Your Data Understanding Capability Read More »

Data, , ,

The Importance of Feature Engineering in ML Modelling

When building Machine Learning (ML) models, we often encounter unorganised and chaotic data. In order to transform this data into explainable features, we rely on the process of feature engineering. Feature engineering plays a crucial role in the Cross Industry Standard Process for Data Mining (CRISP-DM). It is an integral part of the Data Preparation Step, responsible for organising the data effectively before it is ready for modeling. The diagram below illustrates the significance of feature engineering (FE) in the data mining process. CRISP-DM Process Model What is Feature Engineering? Feature Engineering (FE) is the process of extracting and organising important information from raw data in such a way that it fits the machine learning (ML) model. Feature Engineering Process(FE) Source: https://www.omnisci.com/technical-glossary/feature-engineering) Why is Feature Engineering Important? Feature Engineering (FE) has many benefits to offer in the CRISP-DM process. They include: Provides more flexibility and less complexity in models Faster data processing Understanding of models becomes easier Better understanding of the problem and questions to be answered Feature Engineering Techniques for Machine Learning (ML) Below is a list of feature engineering techniques and we will summarise each: Imputation: Handling Outliers Log Transformation One-Hot Encoding Scaling 1. Imputation Missing values is one of the most typical problems when it comes to data preparation. Human errors and dataflow interruptions are some of the major contributors to this problem. Moreover, missing values can detrimentally impact the performance of the ML models. An example of an imputation of NA values with Zero Imputation is frequently employed in healthcare research, such as when dealing with patient records that may have missing values for certain medical measurements. By imputing the missing data using methods like mean imputation or regression imputation, researchers can ensure that a complete dataset is available for analysis, allowing for more accurate assessments and predictions. 2. Handling Outliers Handling Outliers within datasets is an important technique with the purpose of creating an accurate representation of the data. This step must be completed prior to the model training step. There are various methods of handling outliers that include removal, replacing values, capping, and discretization. These methods will be discussed in detail in future blogs. An example of outliers Handling outliers is essential in financial analysis, for instance, when examining stock market data. By detecting and appropriately treating outliers using techniques like Winsorization or trimming, analysts can ensure that extreme values do not unduly influence statistical measures, leading to more robust and reliable insights and decision-making. 3. Log Transformation Log Transformation is one of the most prevalent methods used by data professionals. The technique transforms a skewed distribution of data into normally distributed or slightly skewed data. Therefore, making the data approximate for normal applications is required for different kinds of data analysis. Examples of Log Transformed Data Log transformation is commonly applied in skewed data distributions, such as when dealing with income or population data. By taking the logarithm of the values, the skewed distribution can be transformed into a more symmetric shape, facilitating more accurate modeling, analysis, and interpretation of the data. 4. One Hot Encoding One-Hot Encoding is a technique of preprocessing categorical variables into ML models. The encoding transforms a category variable into a binary feature for each category. It typically assigns a value of ‘1’ to the binary feature it corresponds to and all other binary features are set to ‘0’. An example of One Hot Encoding One-hot encoding is widely used in categorical data processing, such as in natural language processing tasks like sentiment analysis. By converting categorical variables into binary vectors, each representing a unique category, one-hot encoding enables machine learning algorithms to effectively interpret and utilize categorical data, facilitating accurate classification and prediction tasks. 5. Scaling Feature scaling is one of the hardest problems in data science to get right. However, it is not a mandatory step for all machine learning models. It is only applicable to distance-based machine learning models. The training model process requires data with a known set of features that need to be scaled up or down where it is deemed appropriate. The outcome of the scaling operation transforms continuous data to be similar in terms of range. The most popular techniques for scaling are Normalization and Standardisation, which will be discussed in detail in future blogs. Examples of Scaling Scaling is often used in image processing, such as when resizing images for a computer vision task. Scaling the images to a consistent size, regardless of their original dimensions ensures that the images can be properly processed and analysed, allowing for fair comparisons and accurate feature extraction in tasks like object recognition or image classification. Feature Engineering Tools There are a set of feature engineering tools that are popular in the market in terms of the capabilities it provides. We have listed a few of our recommendations: FeatureTools AutoFeat TsFresh OneBM ExploreKit Conclusion In summary, Feature Engineering is a crucial step in the CRISP-DM process before we even think about training our machine learning models. One of the core advantages include the training time of models is reduced significantly. As a result, it allows for a drastic reduction of cost in terms of utilisation of expensive computing resources. In this article, we learned a number of feature engineering techniques and tools that are used in the industry. Here at TL Consulting, our data consultants are experts at using feature engineering techniques to build highly accurate machine learning models, enabling us to deliver high-quality outcomes to support our customer’s data analytics needs. TL Consulting provides advisory and transformation services in the data analytics & engineering domain and has helped many organisations achieve their digital transformation goals. Visit TL Consulting’s data-engineering page to learn more about our service capabilities and send us an enquiry if you’d like to learn more about how our dedicated consultants can help you.

The Importance of Feature Engineering in ML Modelling Read More »

Data, , ,