Thursday, December 19, 2019

Regression and Residual

What is Regression?
Regression line helps us to predict change in Y for a change in X.

From previous example (https://mylearningcafe.blogspot.com/2019/12/correlation.html), we can see if we can determine the value of Y for a value of X.





What is Residual?
Residual tells us the error in the prediction (Actual value - predicted value).
We can see the difference from above known values.



Correlation

What is Correlation?

Correlation shows us the Direction and Strength of a linear relationship shared between 2 quantitative variables.

Its denoted by the equation

where

r = correlation
n = # of data points in a data set
Sx = The standard deviation
Sy = The standard deviation
Xi = The data value
For more details on the mean and standard deviation, refer the following blog post:

Direction is provided by the slope (if we draw a line along the data points)
If the slope is upwards, we deduce that the correlation is positive.
If the slope is downwards, we deduce that the correlation is negative.
Correlation values range from -1 to 1.
A value of 1 indicates perfect positive correlation and a -1 indicates perfect negative correlation.

Correlation is positive


Correlation is negative

Strength of a linear relationship gets stronger as correlation increases from 0 to 1 or from 0 to -1.
Refer pics below.



r = 0


          
        
r = 0.3




r = 0.7

r = 1

Lets look at a calculation for a dataset for "No of hours on a treadmill" vs "Calories burnt"






We can see a near straight line of a positive correlation of 0.969 (very close to a perfect positive correlation).

Friday, December 13, 2019

Mode, Median, Mean, Range and Standard Deviation

Lets try to ascertain differences between Mode, Median, Mean, Range and Standard Deviation.

Lets assume following data set:

50, 20, 100, 150, 20, 60, 20, 15, 35

Mode:
Mode is data that occurs frequently.
From above data set, we can see that 20 occurs thrice and hence Mode for above data set is 20.

Median:
Center point of an ordered data set.
The point to note here is "ordered" data set.

Hence for the above set, lets do the ordering first.

50, 20, 100, 150, 20, 60, 20, 15, 35
becomes
15, 20, 20, 20, 35, 50, 60, 100, 150

How, do we get the median?

Median = (n+1)/2

In the above case, its (9+1)/2 = 5th position which is 35.

How about when we have even numbers in a data set?
In that case, we take the average of the middle two numbers.

Lets add one more element to the above ordered data set.
15, 20, 20, 20, 35, 50, 60, 100, 150, 175

Median will be average of the middle two numbers which is avg of (35 and 50) which is 42.5

Mean:
Mean is the average. [(Sum of all data values)/n]

In this case
50, 20, 100, 150, 20, 60, 20, 15, 35

Mean = (50+20+100+150+20+60+20+15+35)/9 = 470/9 = 52.22

Range:
Range is simply the difference of max vs min.
Hence,

Range = max-min = 150 - 15 = 135

Standard Deviation
Standard deviation ascertains how close to the mean are the values in a data set.

Formula for standard deviation:


How do we calculate this?



Mean in above for difference is 52.22

Standard deviation is 45.69

Small deviation indicates the distribution is less spread and data is close to mean.
Large deviation indicates the distribution is more spread and data is further away from the mean.









Tuesday, December 10, 2019

Web Scraping Introduction

What is Web Scraping?
In simplest terms, as the name suggests, Web Scraping is scraping the data from the Web.

If the web page is correctly marked up, one can extract data using the <p> element where id is the subject and it returns the text.

To get data from a HTML, we can use the BeautifulSoupLibrary which builds a tree out of the various elements in a page. It provides an interface to access these elements.

Pre-requisite:


pip install beautifulsoup4



We shall use requests to get to the html page (via URL) and then use the BeautifulSoup Library function to access the first line (paragraph).

Lets try this with our own website:


from bs4 import BeautifulSoup
import requests

webhtmlpage = requests.get("https://mylearningcafe.blogspot.com/p/welcome_9.html").text;
bsoup = BeautifulSoup(webhtmlpage,'html5lib');

first = bsoup.find('p');
print first;




$ python webscraping.py

<p class="description"><span>The cafe (of learning) never closes <br/><br/> For finance related posts, go to <a href="http://mymoneyrules.blogspot.in/">http://mymoneyrules.blogspot.in/</a></span></p>

To extract the text, if I add:

first_text = bsoup.find('p').text;
print first_text;

I will get

The cafe (of learning) never closes  For finance related posts, go to http://mymoneyrules.blogspot.in/

This is how the data of my website looks like:



Lets get the length of li tags first:

#find count of <li> tags
li_tag = bsoup('li');
print(len(li_tag));

How to extract the Href text?

If we look closely, the main data is in "<div id='adsmiddle24552235005691491924'>"

Thus, we search for the specific div ID and loop through to find the text.
We do get a "None" element as well and it would throw a AttributeError since None object type would have no method. Hence, we would eliminate it using a try except block.

#find count of <li> tags li_tag = bsoup('li'); print(len(li_tag)); div_tag = bsoup.find("div",{"id":"adsmiddle24552235005691491924"}); print(len(div_tag.find_all("li"))); #print(div_tag.find_all("li")) for a_text in div_tag.find_all("li"): try: print(a_text.a.text); except AttributeError: None; #print "skip";

Output:

$ python webscraping.py 
High level introduction to Kubernetes
Docker vs VM
Kubernetes - Features
Quick Introduction to AWS
Definition of various storage services
How to add buckets in S3
AWS Database services (Intro and how to create a DB Instance)
Statistical learning Index Page
Annotate - Lets you create personalized messages over images (meme)
Link Library - A library for all your links
Raspberry Pi Index Page
R programming Index Page
Python Index Page
Alert pop up dialog in android 
Rename package in Android
How to change version of apk for redeployment in Android Studio 
How to change an app's icon in Android Studio 
http://mylearningcafe.blogspot.in/2015/05/garbage-collection-gc.html
http://mylearningcafe.blogspot.in/2014/02/sorting-algorithms-in-java.html
http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-2.html
http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-1.html
http://mylearningcafe.blogspot.in/2014/02/automatic-resource-management-in-java-7.html







Monday, December 2, 2019

Miscellaneous

Index page for Python:
http://mylearningcafe.blogspot.com/2015/08/python-index-page.html

Some miscellaneous stuff:

Sorting


>>> list_of_items = [1,100,200,3,100,200,5]
>>> sorted_list_of_items = sorted(list_of_items)
>>> list_of_items
[1, 100, 200, 3, 100, 200, 5]
>>> sorted_list_of_items
[1, 3, 5, 100, 100, 200, 200]

Reverse Sorting

>>> sorted_list_of_items_backwards = sorted(list_of_items,key=abs,reverse=True)

>>> sorted_list_of_items_backwards
[200, 200, 100, 100, 5, 3, 1]

Get even # list

>>> list_of_items = [1,100,200,3,100,200,5]
>>> list_of_items
[1, 100, 200, 3, 100, 200, 5]

>>> even_number_list = [x for x in list_of_items if x % 2 == 0]

>>> even_number_list
[100, 200, 100, 200]

Randomly shuffle data

>>> import random
>>> list_of_items = [1,100,200,3,100,200,5]
>>> 
>>> list_of_items
[1, 100, 200, 3, 100, 200, 5]

>>> random.shuffle(list_of_items)

>>> list_of_items
[3, 1, 5, 100, 100, 200, 200]

Dictionaries and Sets

Index page for Python:
http://mylearningcafe.blogspot.com/2015/08/python-index-page.html


Dictionaries in python are data structures which associate keys with values.


>>> empl_name_id_dict = {"Nitin":1,"Jonathan":2,"Brien":3}


>>> briens_details = empl_name_id_dict["Brien"]
>>> briens_details
3

>>> briens_details = empl_name_id_dict["brien"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>

KeyError: 'brien'

Keys are case sensitive.

Get method helps to get a default value for a non-existing key rather than an exception

>>> briens_details = empl_name_id_dict.get("brien")
>>> briens_details
>>> 

>>> briens_details = empl_name_id_dict.get("Brien")
>>> briens_details
3

//Adding a new element
>>> empl_name_id_dict ["Matt"] = 4
>>> empl_name_id_dict
{'Nitin': 1, 'Matt': 4, 'Jonathan': 2, 'Brien': 3}

//Get the list of keys or values

>>> list_of_keys = empl_name_id_dict.keys()
>>> list_of_keys
['Nitin', 'Matt', 'Jonathan', 'Brien']
>>> 
>>> list_of_values = empl_name_id_dict.values()
>>> list_of_values
[1, 4, 2, 3]

>>> list_of_items = empl_name_id_dict.items()
>>> list_of_items
[('Nitin', 1), ('Matt', 4), ('Jonathan', 2), ('Brien', 3)]


Sets is a data structure that represents a collection of distinct elements.

>>> list_of_items = [1,100,200,3,100,200,5]
>>> list_of_items
[1, 100, 200, 3, 100, 200, 5]
>>> 
>>> list_of_items_set = set(list_of_items)
>>> list_of_items_set

set([200, 1, 3, 100, 5])

Index page for Python:

Lists in Python

Index page for Python
http://mylearningcafe.blogspot.com/2015/08/python-index-page.html

What is a list?
Simply put, an ordered collection.

>>> list_of_integers = [1,4,5]
>>> len(list_of_integers)
3
>>> sum(list_of_integers)
10


How to get to an element in the list:

>>> list_of_integers[1]
4

Lists index starts from 0.

>>> list_of_integers[0]
1

Slice lists:

Slice lists using square brackets

>>> first_two_elements = list_of_integers[:2]
>>> first_two_elements
[1, 4]

Lets try some more commands:

>>> lists_of_value = [100,1,50,25,34,67,78,99]
>>> len(lists_of_value)
8

>>> first_4_elements = lists_of_value[:4]
>>> first_4_elements
[100, 1, 50, 25]

>>> last_3_elements = lists_of_value[-3:]
>>> last_3_elements
[67, 78, 99]

>>> copy_of_list_elements = lists_of_value[:]
>>> copy_of_list_elements
[100, 1, 50, 25, 34, 67, 78, 99]

>>> copy_of_list_elements.extend([200,300,400])
>>> copy_of_list_elements
[100, 1, 50, 25, 34, 67, 78, 99, 200, 300, 400]


Index Page for Python:
http://mylearningcafe.blogspot.com/2015/08/python-index-page.html