Jamesy Mc Jamesface: 2019

Aliases in SQL

Did you know you can list your tables as aliases in SQL?

Say you have a table called "quiz" with unique user_id's.

You could list your table as "q" (or whatever you like) and call its user_id column in the following way:
SELECT q.user_id
FROM quiz q
LIMIT 10;

Did you see that? Where you place your table name ("quiz"), you can place your variable name after it, i.e. "q".

This tidies up your code a little bit. Say you had two other columns called "question" and "answer" - you could select the three columns by using the code:
SELECT q.user_id, q.question, q.answer
FROM quiz q
LIMIT 10;

Below is some example code from Codecademy's Funnels excercises on their Data Science course:

SELECT q.user_id,
h.user_id IS NOT NULL AS 'is_home_try_on',
h.number_of_pairs,
p.user_id IS NOT NULL AS 'is_purchase'
FROM quiz q
LEFT JOIN home_try_on h
ON q.user_id = h.user_id
LEFT JOIN purchase p
ON p.user_id = q.user_id
LIMIT 10;

Here is a link to the W3 School's examples of how to create SQL aliases: https://www.w3schools.com/sql/sql_alias.asp

Percentiles with Numpy

First, a warning. Don't get mixed-up between finding the percentage of x in a list and finding a percentile of x. I've already covered the code of how to get percentages -

find_percentages = np.mean(array_name < 81)

which will return the percent of elements that are less than 81 in an array.

Numpy has a function to find percentiles from arrays.

It takes two arguments: the variable name for the array you are exploring, and the percentile you would like to retrieve from it. The code looks like this: find_percentile = np.percentile(array_name, 40) - the fortieth percentile of this array will be the output.

Quartiles, Inter-quartile and Median.

Codecademy has given this elaboration of percentiles:

Some percentiles have specific names:

The 25th percentile is called the first quartile
The 50th percentile is called the median
The 75th percentile is called the third quartile

The minimum, first quartile, median, third quartile, and maximum of a dataset are called a five-number summary. This set of numbers is a great thing to compute when we get a new dataset.

The difference between the first and third quartile is a value called the interquartile range. For example, say we have the following array:


d = [1, 2, 3, 4, 4, 4, 6, 6, 7, 8, 8]

We can calculate the 25th and 75th percentiles using np.percentile:


np.percentile(d, 25)
>>> 3.5
np.percentile(d, 75)
>>> 6.5

Then to find the interquartile range, we subtract the value of the 25th percentile from the value of the 75th:


6.5 - 3.5 = 3

50% of the dataset will lie within the interquartile range. The interquartile range gives us an idea of how spread out our data is. The smaller the interquartile range value, the less variance in our dataset. The greater the value, the larger the variance.

NumPy: Averages of Data Sets

What are the meanings of the various types of averages in datasets?

Mean == the "centre" ("center") of a dataset.

If you have an array: array_1 = np.array([1,2,3,4,5]), the mean of this array would be 3, because 1 + 2 +3 + 4 + 5 = 15, 15 / 5(numbers in array) = 3. Thus, 3 would be the average or the mean of this particular list.
The mean is affected by outliers.

Median == the "middle" of a dataset.

If you have a list [1, 1, 4, 7, 8, 9, 9], then 7 would be the median of the list, as it is literally halfway between the minimum value and the maximum value.
If you have a list whose length is an even number, say [1, 2, 2, 3, 4, 5, 5, 7] (8 numbers), then the median is the half-way point between the two middle numbers (in this case 3 & 4), so the median of the list above would be 3.5.
Of course, we're likely to be dealing with very large lists and arrays, so working out the middle numbers ourselves would become a very tedious task. We can overcome this by using the np.median function.
The median is not affected by outliers.

Finding Percentages:

You can use numpy in conjunction with the mean function to work out percentages from a given dataset: np.mean. You can do so using logical operators. For example, if you have an np.array_example = [15, 18, 9, 5, 4, 21, 10, 16] and you wanted to find out the percentage of elements greater than 10, you could do so by using:

>>>np.mean(array_example > 10)
0.5

The above result is 0.5 or 50%.

Why does this work? Well, the code is using a logical operator to iterate through the array data. Where an element is greater than 10 it is equal to 1 (or True). Where it is equal to, or not greater than 10, it is equal to 0 (or False). The mean function then takes the number of results equal to 1 and divides them by the number of elements in the list (in this case the answer would be 4 / 8 = 0.5). In other words, 50% of elements in the array are equal to True, which in this case is the same as saying 50% of elements in the array are greater than 10.

NumPy: Outliers and Sorting

Sometimes, from the range of a given dataset, we will see elements that are unusually larger or smaller than the other elements. In an array of heights, for example, we may see numbers that are very short, or very tall. These elements are known as outliers.

We can more easily identify outliers by using the NumPy sort function np.sort(heights_array). We can then begin to identify where possible errors or anomalies lie. You can get people who are between 120cm and 190cm, but it is unlikely that the smallest measurement of 10cm, or the tallest measurement of 1200cm are accurate.

Python Pyperclip

Problems installing pyperclip module for python. Not recognised as a module.

Not able to install it from command line. Resolved by running pip install pyperclip by running command line as administrator. (Right-click on command line icon in start menu and select "run as administrator").

Remember this about joins in SQL...

You may be confused by the differences between joins in SQL.

A standard JOIN or inner-join in SQL will join tables where the rows are exactly matching on the column that you are joining on. It will automatically not include non-matching rows, so that you are only presented with rows that are consistent. The benefit of this is that you are only presented with consistent and accurate information. The disadvantage is that you are missing information from some items.

However, a LEFT JOIN will join tables on a column where rows may not be matching (inconsistency between tables can be caused by one table being updated, but corresponding information in another table not being updated). In this scenario, you are given all information, but some attributes may be listed as NULL.

Image above from codecademy.com
https://s3.amazonaws.com/codecademy-content/courses/learn-sql/multiple-tables/left-join.gif

A CROSS JOIN creates a Cartesian Product. This means that it will allow us to combine all rows of one table with all rows of another table. If there are three rows in table A, and three rows in table B, all three rows of table A will be joined with all three rows of table B. The results of this join will have 9 rows.

Shebang Line

The Shebang Line should be the first line of your python program.

This is what Automate the Boring Stuff with Python author Al Sweigart says about the Shebang line.

The first line of all your Python programs should be a shebang line, which tells your computer that you want Python to execute this program. The shebang line begins with #!, but the rest depends on your operating system.

On Windows, the shebang line is #! python3.

On OS X, the shebang line is #! /usr/bin/env python3.

On Linux, the shebang line is #! /usr/bin/python3.

You will be able to run Python scripts from IDLE without the shebang line, but the line is needed to run them from the command line.

SQLite strftime() Function

Did you know that strftime() is an SQLite function than allows the programmer to return a formatted date.

It takes two arguments:

strftime(format, column)

To get an hour: strftime('%H', column_name)
To get the year: strftime('%Y', column_name)
To get the month: strftime('%m', column_name)
To get the day: strftime('%d', column_name)
To get the minute: strftime('%m', column_name)
To get the second: strftime('%S', column_name)

The above is true as long as the time format is YYYY-MM-DD HH:MM:SS

More on this function can be read from the SQL documentation here.

File management using openpyxl

When using openpyxl to load workbooks, ensure that your excel file is saved in the same folder as the .py file you are developing.

import openpyxl, pprintprint('Opening workbook... ')wb = openpyxl.load_workbook('censuspopdata.xlsx')sheet = wb.get_sheet_by_name('Population by Census Tract')

Otherwise, the following "file not found" error will result:

"FileNotFoundError: [Errno 2] No such file or directory: 'censuspopdata.xlsx'"

TypeError: 'generator' object is not subscriptable

Error in chapter 12, which produces the following "TypeError: 'generator' object is not subscriptable" for the code below.

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb.active
>>> sheet = wb.active
>>> sheet.columns[1]
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
sheet.columns[1]
TypeError: 'generator' object is not subscriptable
>>>

RESOLUTION:
Create a list for the sheet.colums:
list(sheet.columns)[1] to overcome the generator error - outdated method since python 2 apparently.

More openpyxl learnings

Continuing our journey through Automate the Boring Stuff with Python's excel lessons, we came across the get_column_letter and column_index_from_string openpyxl functions. When attemtping to import the functions, we were given the following error:

>>> import openpyxl
fr
>>> from openpyxl.cell import get_column_letter, column_index_from_string
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
from openpyxl.cell import get_column_letter, column_index_from_string
ImportError: cannot import name 'get_column_letter' from 'openpyxl.cell' (C:\Users\james\AppData\Roaming\Python\Python37\site-packages\openpyxl\cell\__init__.py)
>>> get_column_letter(1)
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
get_column_letter(1)
NameError: name 'get_column_letter' is not defined
>>>
=============================== RESTART: Shell ===============================
>>> import openpyxl
f
>>>
=============================== RESTART: Shell ===============================
>>> import openpyxl
>>> from openpyxl.cell import get_column_letter
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
from openpyxl.cell import get_column_letter
ImportError: cannot import name 'get_column_letter' from 'openpyxl.cell' (C:\Users\james\AppData\Roaming\Python\Python37\site-packages\openpyxl\cell\__init__.py)

Well, apparently the Automate the Boring Stuff with Python is a little out of date now! According to a StackOverflow contributor,

The function get_column_letter has been relocated in Openpyxl version 2.4 from openpyxl.cell to openpyxl.utils.
The current import is: from openpyxl.utils import get_column_letter

The .utils instead of .cell directory worked.

Trouble Using Pip and Installing Openpyxl

Openpyxl is, according to Automate the Boring Stuff with Python, supposed to be a handy python library which allows programmers to use python programs with Microsoft excel sheets.

But first you need to install pip on your machine.

What is pip?

"Pip is a package manager for python packages. A package contains all the files you need for a module. Modules are Python code libraries that you can include in your project". - W3 schools.

The first difficulty I had was getting pip to work through my command line. I resolved that issue using paths - as described in this video (https://www.youtube.com/watch?v=Jw_MuM2BOuI).

The next issue I had was with installing openpyxl itself (which you need pip to do). I received the following error:

"Could not install packages due to an EnvironmentError: [Errno 13] Permission denied:"

"Consider using the `--user` option or check the permissions"

Eventually, I found a solution on stackoverflow, which suggested using the following:

python -m pip install --user openpyxl

This worked.

Jamesy Mc Jamesface