Percentiles with Numpy

First, a warning.  Don't get mixed-up between finding the percentage of x in a list and finding a percentile of x.  I've already covered the code of how to get percentages
find_percentages = np.mean(array_name < 81) 
which will return the percent of elements that are less than 81 in an array. 

Numpy has a function to find percentiles from arrays. 

It takes two arguments: the variable name for the array you are exploring, and the percentile you would like to retrieve from it.  The code looks like this: find_percentile = np.percentile(array_name, 40) - the fortieth percentile of this array will be the output. 

Quartiles, Inter-quartile and Median. 

Codecademy has given this elaboration of percentiles: 

Some percentiles have specific names:
  • The 25th percentile is called the first quartile
  • The 50th percentile is called the median
  • The 75th percentile is called the third quartile
The minimum, first quartile, median, third quartile, and maximum of a dataset are called a five-number summary. This set of numbers is a great thing to compute when we get a new dataset.
The difference between the first and third quartile is a value called the interquartile range. For example, say we have the following array:
d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]
We can calculate the 25th and 75th percentiles using np.percentile:
np.percentile(d, 25) >>> 3.5 np.percentile(d, 75) >>> 6.5
Then to find the interquartile range, we subtract the value of the 25th percentile from the value of the 75th:
6.5 - 3.5 = 3
50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is. The smaller the interquartile range value, the less variance in our dataset. The greater the value, the larger the variance.

NumPy: Averages of Data Sets

What are the meanings of the various types of averages in datasets?

  • Mean == the "centre" ("center") of a dataset. 
    • If you have an array: array_1 = np.array([1,2,3,4,5]), the mean of this array would be 3, because 1 + 2 +3 + 4 + 5 = 15, 15 / 5(numbers in array) = 3. Thus, 3 would be the average or the mean of this particular list.
    • The mean is affected by outliers
  • Median == the "middle" of a dataset. 
    • If you have a list [1, 1, 4, 7, 8, 9,  9], then 7 would be the median of the list, as it is literally halfway between the minimum value and the maximum value. 
    • If you have a list whose length is an even number, say [1, 2,  2,  3, 4,  5, 5, 7] (8 numbers), then the median is the half-way point between the two middle numbers (in this case 3 & 4), so the median of the list above would be 3.5. 
    • Of course, we're likely to be dealing with very large lists and arrays, so working out the middle numbers ourselves would become a very tedious task.  We can overcome this by using the np.median function. 
    • The median is not affected by outliers

Finding Percentages: 

You can use numpy in conjunction with the mean function to work out percentages from a given dataset: np.mean. You can do so using logical operators.  For example, if you have an np.array_example = [15, 18, 9, 5, 4, 21, 10, 16] and you wanted to find out the percentage of elements greater than 10, you could do so by using: 
>>>np.mean(array_example > 10)
0.5
 
The above result is 0.5 or 50%.

Why does this work?  Well, the code is using a logical operator to iterate through the array data.  Where an element is greater than 10 it is equal to 1 (or True).  Where it is equal to, or not greater than 10, it is equal to 0 (or False). The mean function then takes the number of results equal to 1 and divides them by the number of elements in the list (in this case the answer would be 4 / 8 = 0.5). In other words, 50% of elements in the array are equal to True, which in this case is the same as saying 50% of elements in the array are greater than 10. 





NumPy: Outliers and Sorting

Sometimes, from the range of a given dataset, we will see elements that are unusually larger or smaller than the other elements. In an array of heights, for example, we may see numbers that are very short, or very tall.  These elements are known as outliers. 

We can  more easily identify outliers by using the NumPy sort function np.sort(heights_array). We can then begin to identify where possible errors or anomalies lie. You can get people who are between 120cm and 190cm, but it is unlikely that the smallest measurement of 10cm, or the tallest measurement of 1200cm are accurate. 


Web Development: Organizing Files and Folders

When you begin to build your website, it's a very clever idea to organize  your files and folders efficiently. You should have: A ...