What are the meanings of the various types of averages in datasets?
- Mean == the "centre" ("center") of a dataset.
- If you have an array: array_1 = np.array([1,2,3,4,5]), the mean of this array would be 3, because 1 + 2 +3 + 4 + 5 = 15, 15 / 5(numbers in array) = 3. Thus, 3 would be the average or the mean of this particular list.
- The mean is affected by outliers.
- Median == the "middle" of a dataset.
- If you have a list [1, 1, 4, 7, 8, 9, 9], then 7 would be the median of the list, as it is literally halfway between the minimum value and the maximum value.
- If you have a list whose length is an even number, say [1, 2, 2, 3, 4, 5, 5, 7] (8 numbers), then the median is the half-way point between the two middle numbers (in this case 3 & 4), so the median of the list above would be 3.5.
- Of course, we're likely to be dealing with very large lists and arrays, so working out the middle numbers ourselves would become a very tedious task. We can overcome this by using the np.median function.
- The median is not affected by outliers.
Finding Percentages:
You can use numpy in conjunction with the mean function to work out percentages from a given dataset: np.mean. You can do so using logical operators. For example, if you have an np.array_example = [15, 18, 9, 5, 4, 21, 10, 16] and you wanted to find out the percentage of elements greater than 10, you could do so by using:
>>>np.mean(array_example > 10)
0.5
The above result is 0.5 or 50%.
Why does this work? Well, the code is using a logical operator to iterate through the array data. Where an element is greater than 10 it is equal to 1 (or
True). Where it is equal to, or not greater than 10, it is equal to 0 (or
False). The mean function then takes the number of results equal to 1 and divides them by the number of elements in the list (in this case the answer would be 4 / 8 = 0.5). In other words, 50% of elements in the array are equal to True, which in this case is the same as saying 50% of elements in the array are greater than 10.