Different types of means for EDA in Artificial intelligence
Photo by Tingey Injury Law Firm on Unsplash
I hope you are doing great, today we will discuss an important topic in descriptive statistics "Different types of mean". Before starting out let me tell you that after reading this blog you will be completely aware about what are the different types of mean, when to use which type of mean with python code. So let's get started.
What does EDA stand for?
EDA stands for exploratory data analysis and it is one of the most important step in the lifecycle of any data-based project as it involves the use of statistical techniques and visualizations to extract some useful insights from our data which could be useful for better model building.
Why EDA is important?
If you want to make a good model, then you need to stick this fact in your mind that "the more quality time you would spend with your data the chances of you building a better model would increase".
The reason behind this fact is that "If you would know ins and out of your data then it would help you greatly during feature engineering, selection of algorithm for model training and even choosing right evaluation parameters for evaluating your trained model using parameters.
Why there are different types of means?
Normally most of the individuals are only aware about everyones faviourite and most simplest "arithmetic mean" to find the average value, but in the world of data science and data analysis there are some other types of means which are used to find the average value in some specific scenarios. To be precise there are 4 different types of means : Arithmetic mean, weighted mean, trimmed mean and harmonic mean.
Now in the next couple of minutes we will discover when to use which type of mean using python programming language.
Arithmetic mean
Arithmetic mean or simple mean is one of most used statistical parameter used for find the average value of data. An important thing which you must keep in mind is that arithemtic mean should only be used in those kind of scenarios where the distribution of data is even or sort of normal because if there would be some skewness in our data then arithmetic mean will not give us accurate results because of outliers.
Now if you want to know how to check the distribution of data and how to calculate the mean then don't worry we will understand everything from scratch.
def arithmetic_mean(numbers):
if len(numbers) == 0:
return 0 # To handle the case when the list is empty
total = sum(numbers)
mean = total / len(numbers)
return mean
# Example usage:
numbers = [1, 2, 3, 4, 5]
mean = arithmetic_mean(numbers)
print("Arithmetic Mean:", mean)
# For checking the distribution of data use the histplot
sns.histplot(data,kde=True)
Trimmed mean
Trimmed mean is a type of mean which is used in those kind of scenario where our data have some skewness or in more simple terms we can say that when our data have some outliers.
The way this statistical technique works is that it simply removes a certain percentage of the highest and lowest values in a dataset before calculating the mean. For example: a 10% trimmed mean would mean we have removed the highest and lowest 10% of the values in a dataset before calculating the mean.
def trimmed_mean(numbers, percent):
if len(numbers) == 0:
return 0 # To handle the case when the list is empty
# Calculate the number of elements to remove from each end
num_elements = len(numbers)
num_elements_to_remove = int(num_elements * percent / 100)
# Sort the numbers in ascending order
sorted_numbers = sorted(numbers)
# Remove the specified number of elements from each end
trimmed_numbers = sorted_numbers[num_elements_to_remove:-num_elements_to_remove]
# Calculate the mean of the trimmed numbers
mean = sum(trimmed_numbers) / len(trimmed_numbers)
return mean
# Example usage:
numbers = [1, 2, 3, 4, 5]
percent = 20 # Remove 20% of outliers from each end
trimmed_mean_value = trimmed_mean(numbers, percent)
print("Trimmed Mean:", trimmed_mean_value)
Weighted mean
Weighed mean is another important type of mean which is used in those kind of scenarios where all the values in dataset do not have equal importance and have weights associated with them.
To better understand the scenario where we use weighted mean, let say we want to evaluate the performance of a student based on his/her average exam scores. Now while finding the average score all the exam scores should not be given equal importance because some subjects have more importance than others, so in such scenarios we can assign a weight to subject.
def weighted_mean(numbers, weights):
if len(numbers) == 0 or len(weights) == 0 or len(numbers) != len(weights):
return 0 # To handle the case when either list is empty or their lengths don't match
weighted_sum = 0
total_weight = 0
for number, weight in zip(numbers, weights):
weighted_sum += number * weight
total_weight += weight
mean = weighted_sum / total_weight
return mean
# Example usage:
numbers = [1, 2, 3, 4, 5]
weights = [0.1, 0.2, 0.3, 0.2, 0.2]
weighted_mean_value = weighted_mean(numbers, weights)
print("Weighted Mean:", weighted_mean_value)
Harmonic mean
Harmonic mean is another important type of mean which is calculated by taking the reciprocal of the arithmetic mean of the reciprocals of a set of numbers. It is often used to measure rates or ratios, and it is always less than or equal to the arithmetic mean.
Common example of use of harmonic mean in the world of AI is to calculate f1 score which is one of the classification metric.
def harmonic_mean(numbers):
if len(numbers) == 0:
return 0 # To handle the case when the list is empty
reciprocal_sum = 0
for number in numbers:
reciprocal_sum += 1 / number
harmonic_mean = len(numbers) / reciprocal_sum
return harmonic_mean
# Example usage:
numbers = [1, 2, 3, 4, 5]
harmonic_mean_value = harmonic_mean(numbers)
print("Harmonic Mean:", harmonic_mean_value)
Special Note ๐
I hope you good understanding of type of means which you can use for during EDA and if you liked this blog or have any suggestion kindly leave a comment below it would mean a to me.