February 2020

Wednesday, February 26, 2020

Polynomial Regression

Polynomial Regression is a form of regression where the relationship between the dependent variable y and the independent variable x is modeled as an nth degree polynomial.

Formula for Polynomial Regression

Code snippet

from sklearn.preprocessing import PolynomialFeatures
poly_regr = PolynomialFeatures(degree = 2)
X_poly = poly_regr.fit_transform(X)
poly_regr.fit(X_poly, y)
reg = LinearRegression()

reg.fit(X_poly, y)

degree is the degree of the polynomial features.

Refer plots below with varying degree and observe that as degree value increases, the curve becomes more aligned to the data

Degree 2

Degree 3

Degree 4

Degree 5

Multiple Linear Regression

Multiple Linear Regression is a regression model where we have multiple independent variables.

We need to predict values for the dependent variable as a function of the independent variables.

Formula for Multiple Linear Regression:

where

y is the dependent Variable

onwards are the the independent Variable

onwards is the coefficient (connector between dependent and Independent)

is the Constant

Always encode the categorical data if any after data import.

Code Snippet:

from sklearn.linear_model import LinearRegression

multi_regressor = LinearRegression()

multi_regressor.fit(X_train, y_train)

# Prediction

y_pred = multi_regressor.predict(X_test)

Simple Linear Regression

Simple Linear Regression is a linear regression model where we have one dependent and one independent variable.

We need to predict values for the dependent variable as a function of the independent variable.

Formula for Simple Linear Regression:

where

y is the dependent Variable

is the independent Variable

is the coefficient (connector between dependent and Independent)

is the Constant

Code Snippet:

from sklearn.linear_model import LinearRegression

simple_regressor = LinearRegression()

simple_regressor.fit(X_train, y_train)

# Prediction

y_pred = simple_regressor.predict(X_test)

Plot

plt.scatter(X_test, y_test, color = 'red')

plt.plot(X_train, simple_regressor.predict(X_train), color = 'blue')

plt.show()

Monday, February 3, 2020

Big Data and Hadoop Introduction

How do we define Big Data?

Big data in lay man's term implies large volume of data (in petabytes [1024 terabytes] or exabytes [1024 petabytes]). This data can be in structured and unstructured format.

Such large volumes of data cannot be handled by traditional data processing software.

Big data attributes:

1. Volume - as mentioned above, we are handling large volumes of data.
2. Format - Data can be in structured and unstructured format.
3. Velocity - Data processing can take time based on its structure.
4. Veracity - In large data sets, the quality of data may vary and needs to be ascertained.

e.g
Amazon and Netflix recommendation engines based on subscriber interests.
Uber Geodata.
Apple Siri - voice data.
Driverless cars - sensor and image quality data processing.

What is Hadoop?

As per text book definition,"Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware."

What it implies?

- Hadoop can store large volumes of data.

- It runs on commodity hardware implying lower cost per GB.

Hadoop stores data in clusters and hence can scale horizontally. This helps in reducing cost per GB as volume increases. This implies that data needs to be fetched from different clusters and is not available in a single machine.