Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

NumPy: Averages of Data Sets

What are the meanings of the various types of averages in datasets?

  • Mean == the "centre" ("center") of a dataset. 
    • If you have an array: array_1 = np.array([1,2,3,4,5]), the mean of this array would be 3, because 1 + 2 +3 + 4 + 5 = 15, 15 / 5(numbers in array) = 3. Thus, 3 would be the average or the mean of this particular list.
    • The mean is affected by outliers
  • Median == the "middle" of a dataset. 
    • If you have a list [1, 1, 4, 7, 8, 9,  9], then 7 would be the median of the list, as it is literally halfway between the minimum value and the maximum value. 
    • If you have a list whose length is an even number, say [1, 2,  2,  3, 4,  5, 5, 7] (8 numbers), then the median is the half-way point between the two middle numbers (in this case 3 & 4), so the median of the list above would be 3.5. 
    • Of course, we're likely to be dealing with very large lists and arrays, so working out the middle numbers ourselves would become a very tedious task.  We can overcome this by using the np.median function. 
    • The median is not affected by outliers

Finding Percentages: 

You can use numpy in conjunction with the mean function to work out percentages from a given dataset: np.mean. You can do so using logical operators.  For example, if you have an np.array_example = [15, 18, 9, 5, 4, 21, 10, 16] and you wanted to find out the percentage of elements greater than 10, you could do so by using: 
>>>np.mean(array_example > 10)
0.5
 
The above result is 0.5 or 50%.

Why does this work?  Well, the code is using a logical operator to iterate through the array data.  Where an element is greater than 10 it is equal to 1 (or True).  Where it is equal to, or not greater than 10, it is equal to 0 (or False). The mean function then takes the number of results equal to 1 and divides them by the number of elements in the list (in this case the answer would be 4 / 8 = 0.5). In other words, 50% of elements in the array are equal to True, which in this case is the same as saying 50% of elements in the array are greater than 10. 





NumPy: Outliers and Sorting

Sometimes, from the range of a given dataset, we will see elements that are unusually larger or smaller than the other elements. In an array of heights, for example, we may see numbers that are very short, or very tall.  These elements are known as outliers. 

We can  more easily identify outliers by using the NumPy sort function np.sort(heights_array). We can then begin to identify where possible errors or anomalies lie. You can get people who are between 120cm and 190cm, but it is unlikely that the smallest measurement of 10cm, or the tallest measurement of 1200cm are accurate. 


Python Pyperclip

Problems installing pyperclip module for python.  Not recognised as a module. 

Not able to install it from command line.  Resolved by running pip install pyperclip by running command line as administrator.  (Right-click on command line icon in start menu and select "run as administrator").


Shebang Line

The Shebang Line should be the first line of your python program. 

This is what Automate the Boring Stuff with Python author Al Sweigart says about the Shebang line.

The first line of all your Python programs should be a shebang line, which tells your computer that you want Python to execute this program. The shebang line begins with #!, but the rest depends on your operating system.
  • On Windows, the shebang line is #! python3.
  • On OS X, the shebang line is #! /usr/bin/env python3.
  • On Linux, the shebang line is #! /usr/bin/python3.
You will be able to run Python scripts from IDLE without the shebang line, but the line is needed to run them from the command line.

File management using openpyxl

When using openpyxl to load workbooks, ensure that your excel file is saved in the same folder as the .py file you are developing.

import openpyxl, pprintprint('Opening workbook... ')wb = openpyxl.load_workbook('censuspopdata.xlsx')sheet = wb.get_sheet_by_name('Population by Census Tract')
Otherwise, the following "file not found" error will result:

 "FileNotFoundError: [Errno 2] No such file or directory: 'censuspopdata.xlsx'"

TypeError: 'generator' object is not subscriptable

Error in chapter 12, which produces the following "TypeError: 'generator' object is not subscriptable" for the code below. 



>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb.active
>>> sheet = wb.active
>>> sheet.columns[1]
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    sheet.columns[1]
TypeError: 'generator' object is not subscriptable
>>> 

RESOLUTION: 
Create a list for the sheet.colums: 
list(sheet.columns)[1] to overcome the generator error - outdated method since python 2 apparently. 

Trouble Using Pip and Installing Openpyxl

Openpyxl is, according to Automate the Boring Stuff with Python, supposed to be a handy python library which allows programmers to use python programs with Microsoft excel sheets.

But first you need to install pip on your machine.

What is pip?
"Pip is a package manager for python packages. A package contains all the files you need for a module. Modules are Python code libraries that you can include in your project". - W3 schools. 

The first difficulty I had was getting pip to work through my command line.  I resolved that issue using paths - as described in this video (https://www.youtube.com/watch?v=Jw_MuM2BOuI).

The next issue I had was with installing openpyxl itself (which you need pip to do). I received the following error:
"Could not install packages due to an EnvironmentError: [Errno 13] Permission denied:"
"Consider using the `--user` option or check the permissions" 

Eventually, I found a solution on stackoverflow, which suggested using the following:
python -m pip install --user openpyxl
This worked.

Web Development: Organizing Files and Folders

When you begin to build your website, it's a very clever idea to organize  your files and folders efficiently. You should have: A ...