My Learning Cafe

Showing posts with label data science. Show all posts

Thursday, December 19, 2019

Regression and Residual

What is Regression?
Regression line helps us to predict change in Y for a change in X.

From previous example (https://mylearningcafe.blogspot.com/2019/12/correlation.html), we can see if we can determine the value of Y for a value of X.

What is Residual?

Residual tells us the error in the prediction (Actual value - predicted value).
We can see the difference from above known values.

Correlation

What is Correlation?

Correlation shows us the Direction and Strength of a linear relationship shared between 2 quantitative variables.

Its denoted by the equation

where

r = correlation
n = # of data points in a data set
Sx = The standard deviation
Sy = The standard deviation
Xi = The data value

For more details on the mean and standard deviation, refer the following blog post:

https://mylearningcafe.blogspot.com/2019/12/mode-median-mean-range-and-standard.html

Direction is provided by the slope (if we draw a line along the data points)

If the slope is upwards, we deduce that the correlation is positive.

If the slope is downwards, we deduce that the correlation is negative.

Correlation values range from -1 to 1.

A value of 1 indicates perfect positive correlation and a -1 indicates perfect negative correlation.

Correlation is positive

Correlation is negative

Strength of a linear relationship gets stronger as correlation increases from 0 to 1 or from 0 to -1.

Refer pics below.

r = 0

r = 0.3

r = 0.7

r = 1

Lets look at a calculation for a dataset for "No of hours on a treadmill" vs "Calories burnt"

We can see a near straight line of a positive correlation of 0.969 (very close to a perfect positive correlation).

Friday, December 13, 2019

Mode, Median, Mean, Range and Standard Deviation

Lets try to ascertain differences between Mode, Median, Mean, Range and Standard Deviation.

Lets assume following data set:

50, 20, 100, 150, 20, 60, 20, 15, 35

Mode:
Mode is data that occurs frequently.
From above data set, we can see that 20 occurs thrice and hence Mode for above data set is 20.

Median:
Center point of an ordered data set.
The point to note here is "ordered" data set.

Hence for the above set, lets do the ordering first.

50, 20, 100, 150, 20, 60, 20, 15, 35
becomes
15, 20, 20, 20, 35, 50, 60, 100, 150

How, do we get the median?

Median = (n+1)/2

In the above case, its (9+1)/2 = 5th position which is 35.

How about when we have even numbers in a data set?
In that case, we take the average of the middle two numbers.

Lets add one more element to the above ordered data set.
15, 20, 20, 20, 35, 50, 60, 100, 150, 175

Median will be average of the middle two numbers which is avg of (35 and 50) which is 42.5

Mean:
Mean is the average. [(Sum of all data values)/n]

In this case
50, 20, 100, 150, 20, 60, 20, 15, 35

Mean = (50+20+100+150+20+60+20+15+35)/9 = 470/9 = 52.22

Range:
Range is simply the difference of max vs min.
Hence,

Range = max-min = 150 - 15 = 135

Standard Deviation
Standard deviation ascertains how close to the mean are the values in a data set.

Formula for standard deviation:

How do we calculate this?

Mean in above for difference is 52.22

Standard deviation is 45.69

Small deviation indicates the distribution is less spread and data is close to mean.
Large deviation indicates the distribution is more spread and data is further away from the mean.