Thursday, July 30, 2020

Probability (Simple/Conditional)

Probability is how likely is it that something will occur.
All probabilities are numbers between 0 and 1. A probability of 1 indicates 100% chance that something will occur. Probability of 0 implies 0% chance that something will occur.

A probability of getting a heads when we flip a coin once is 50% or 0.5

A probability value between 0 and 0.5 implies its most likely the event may not occur.
A probability value of 0.5 implies the event has an equal chance to occur or non-occur.
A probability value between 0.5 and 1 implies its most likely the event may occur.

Probability of an event occurring is expressed as

P(event) = (All possible outcomes that meet the criteria)/(All possible outcome)

Probability of a coin being flipped and we getting a tails is:

P(Tails) = 1/2 = 0.5

since in this case there are 2 possible outcomes (Heads and Tails)

Probability of flipping a dice and getting the number 5 is

P(5) = 1/6 = 0.16 (or 16%)

Since there can be 6 possible outcomes when we flip a coin (from 1...6)

What I described above is Theoretical Probability.

We also have Experimental Probability, which basically is probability when we conduct tests and record them.

e.g we flipped a dice 7 times and got the number five 4 times.

Whats the experimental probability?

P(5) = 4/7 = 0.57 (57%)

As per Law of Large numbers, the more experiments we conduct, we shall see that the Experimental probability value will get closer and closer to the Theoretical probability value.


The Addition Rule:

Definition:
The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen.

e.g.
What is the probability of getting a "Heads" if we flip a coin 2 times?

When we flip a coin, the options we have is:
Heads - Heads
Heads - Tails
Tails - Heads
Tails - Tails

So by the definition of the addition rule:

P(Heads) = P(Getting heads in the first flip) + P(Getting heads in the second flip) - P(Both flips we get heads)

P(Heads) = (2/4) + (2/4) - (1/4) = 3/4 = 75%









Wednesday, February 26, 2020

Polynomial Regression

Polynomial Regression is a form of regression where the relationship between the dependent variable y and the independent variable x is modeled as an nth degree polynomial.

Formula for Polynomial Regression

<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><msub><mi>b</mi><mrow><mn>0</mn><mo>&#xA0;</mo></mrow></msub><mo>+</mo><mo>&#xA0;</mo><msub><mi>b</mi><mn>1</mn></msub><mo>&#xA0;</mo><msub><mi>x</mi><mrow><mn>1</mn><mo>&#xA0;</mo></mrow></msub><mo>+</mo><mo>&#xA0;</mo><msub><mi>b</mi><mn>2</mn></msub><mo>&#xA0;</mo><msubsup><mi>x</mi><mn>1</mn><mn>2</mn></msubsup><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><msub><mi>b</mi><mn>3</mn></msub><mo>&#xA0;</mo><msubsup><mi>x</mi><mn>1</mn><mn>3</mn></msubsup><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><msub><mi>b</mi><mi>n</mi></msub><mo>&#xA0;</mo><msubsup><mi>x</mi><mn>1</mn><mi>n</mi></msubsup></math>

Code snippet

from sklearn.preprocessing import PolynomialFeatures
poly_regr = PolynomialFeatures(degree = 2)
X_poly = poly_regr.fit_transform(X)
poly_regr.fit(X_poly, y)
reg = LinearRegression()

reg.fit(X_poly, y)

degree is the degree of the polynomial features.

Refer plots below with varying degree and observe that as degree value increases, the curve becomes more aligned to the data

Degree 2


Degree 3



Degree 4


Degree 5


Multiple Linear Regression

Multiple Linear Regression is a regression model where we have multiple independent variables.

We need to predict values for the dependent variable as a function of the independent variables.



Formula for Multiple Linear Regression:


<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><mo>&#xA0;</mo><msub><mi>b</mi><mi>o</mi></msub><mo>&#xA0;</mo><mo>+</mo><mo>&#x2009;</mo><msub><mi>b</mi><mn>1</mn></msub><msub><mi>x</mi><mn>1</mn></msub><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><msub><mi>b</mi><mn>2</mn></msub><msub><mi>x</mi><mn>2</mn></msub><mo>&#xA0;</mo><mo>+</mo><mo>&#x2009;</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>&#xA0;</mo><mo>+</mo><mo>&#x2009;</mo><msub><mi>b</mi><mi>n</mi></msub><msub><mi>x</mi><mi>n</mi></msub></math>

where

y is the dependent Variable
<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>1</mn></msub></math> onwards are the the independent Variable
<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>b</mi><mn>1</mn></msub></math> onwards is the coefficient (connector between dependent and Independent)
<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>b</mi><mn>0</mn></msub></math> is the Constant

Always encode the categorical data if any after data import.

Code Snippet:

from sklearn.linear_model import LinearRegression
multi_regressor = LinearRegression()
multi_regressor.fit(X_train, y_train)

# Prediction
y_pred = multi_regressor.predict(X_test)





Simple Linear Regression

Simple Linear Regression is a linear regression model where we have one dependent and one independent variable.

We need to predict values for the dependent variable as a function of the independent variable.



Formula for Simple Linear Regression:


<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>y</mi><mo>=</mo><msub><mi>b</mi><mn>0</mn></msub><mo>+</mo><msub><mi>b</mi><mn>1</mn></msub><msub><mi>x</mi><mn>1</mn></msub></math>

where

y is the dependent Variable
<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mn>1</mn></msub></math> is the independent Variable
<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>b</mi><mn>1</mn></msub></math> is the coefficient (connector between dependent and Independent)
<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>b</mi><mn>0</mn></msub></math> is the Constant

Code Snippet:

from sklearn.linear_model import LinearRegression
simple_regressor = LinearRegression()
simple_regressor.fit(X_train, y_train)

# Prediction
y_pred = simple_regressor.predict(X_test)

Plot

plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, simple_regressor.predict(X_train), color = 'blue')
plt.show()




Monday, February 3, 2020

Big Data and Hadoop Introduction

How do we define Big Data?

Big data in lay man's term implies large volume of data (in petabytes [1024 terabytes] or exabytes [1024 petabytes]). This data can be in structured and unstructured format.

Such large volumes of data cannot be handled by traditional data processing software.

Big data attributes:

1. Volume - as mentioned above, we are handling large volumes of data.
2. Format - Data can be in structured and unstructured format.
3. Velocity - Data processing can take time based on its structure.
4. Veracity - In large data sets, the quality of data may vary and needs to be ascertained.

e.g
Amazon and Netflix recommendation engines based on subscriber interests.
Uber Geodata.
Apple Siri - voice data.
Driverless cars - sensor and image quality data processing.


What is Hadoop?

As per text book definition,"Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware."


What it implies?

- Hadoop can store large volumes of data.
- It runs on commodity hardware implying lower cost per GB.



Hadoop stores data in clusters and hence can scale horizontally. This helps in reducing cost per GB as volume increases. This implies that data needs to be fetched from different clusters and is not available in a single machine.

Thursday, December 19, 2019

Regression and Residual

What is Regression?
Regression line helps us to predict change in Y for a change in X.

From previous example (https://mylearningcafe.blogspot.com/2019/12/correlation.html), we can see if we can determine the value of Y for a value of X.





What is Residual?
Residual tells us the error in the prediction (Actual value - predicted value).
We can see the difference from above known values.



Correlation

What is Correlation?

Correlation shows us the Direction and Strength of a linear relationship shared between 2 quantitative variables.

Its denoted by the equation

where

r = correlation
n = # of data points in a data set
Sx = The standard deviation
Sy = The standard deviation
Xi = The data value
For more details on the mean and standard deviation, refer the following blog post:

Direction is provided by the slope (if we draw a line along the data points)
If the slope is upwards, we deduce that the correlation is positive.
If the slope is downwards, we deduce that the correlation is negative.
Correlation values range from -1 to 1.
A value of 1 indicates perfect positive correlation and a -1 indicates perfect negative correlation.

Correlation is positive


Correlation is negative

Strength of a linear relationship gets stronger as correlation increases from 0 to 1 or from 0 to -1.
Refer pics below.



r = 0


          
        
r = 0.3




r = 0.7

r = 1

Lets look at a calculation for a dataset for "No of hours on a treadmill" vs "Calories burnt"






We can see a near straight line of a positive correlation of 0.969 (very close to a perfect positive correlation).