My Learning Cafe

Showing posts with label data. Show all posts

Sunday, July 17, 2022

Google Cloud - Storage

Let's talk a bit about storage in GCP.

Hard disks for VMs is persistent and is block storage. In GCP we call is persistent storage. (block storage)

As mentioned, this is similar to a hard drive of a computer.
Only one block storage per VM. A block storage will map to one VM
However, one VM can have different block storages.
To avoid confusion on the above statements, look at the picture below

Direct Attached storage is like a hard disk and Storage Area Network is like a pool of storage devices connected via a high speed network.
GCP provides two options

Persistent Disks

Connected to a block storage via high speed network.
Zonal - Data replicated in one zone
Regional - Data replicated in multiple zones
Logical to use Regional option for durability.
By default a 10GB boot disk (persistent) is attached to a VM when we create a VM.

Local SSDs

Local block storage.
Faster
High performance

File store is for file storage and sharing between multiple VMs.

Pretty logical, use file storages to store and share files across VMs.

Cloud storage in GCP is the object storage.

Create a container (bucket in GCP) to store objects (use console)

Bucket name has to be unique (globally)
Location type

Region (low latency)
Dual region (2 regions) [High availability and low latency across 2 regions]
Multi region (multiple regions) [High availability]

Storage class

Standard

Short term
Frequently accessed

Near line

Backups
Data accessed less than one time a month
Min storage duration is 1 month (30 days)

Cold line

Disaster recovery
Data accessed less than once a quarter
Min storage duration is 90 days

Archive

Long term data preservation (backup)
Data accessed less than once a year
Min storage duration is 365 days.

Inexpensive
Auto scales (as you add storage)
Stored as key-value pair
Access control at object level
REST API available to access and modify stored objects
Command line also available (gsutil command)

Now logically one can store any type of data in the object storage.

But some of these can be less frequently accessed (e.g backup files)
Object storage helps to optimize costs based on access needs.

I have data on premise. How do I transfer to Google cloud?

Options:

Online transfer:

Transfer to Google cloud storage via APIs or CLI (gsutil) [< 1TB]
Good for smaller sized transfer (not for peta byte sized data)

Storage Transfer:

Peta byte sized data
Setup a recurring repeating schedule
Can be an incremental transfer as well.
Fault tolerant - starts from where it failed.
Use when

> 1TB of data
Transferring from different cloud

Transfer Appliance is physical data transfer.

Size > 20TB
Request an appliance.
Upload data to the appliance (e.g USB type appliance)
Ship the appliance
Google uploads to storage.
Data is encrypted in the appliance
Two appliance devices

TA40 (for upto 40 TB)
TA300 (for upto 300 TB)

Saturday, October 17, 2020

ElasticSearch (Introduction)

Q. What is ElasticSearch?

ElasticSearch is an open source analytics and full text search engine.

Often used to enable search functionality for applications. One can built complex search functionality using Elastic Search. Can also be used to aggregate data and analyze results.

Data is stored as documents (JSON Object) in ElasticSearch.

e.g A document in ElasticSearch corresponds to a row in a relational database (for better understanding).

This document contains fields which corresponds to columns in a relational database row.

e.g

{

"Name": "James Bond",

"DOB": "01-01-1999",

"EmployeeID":"10203"

}

Data is stored in Nodes. There are multiple nodes and each node stores a part of the data.

Node is an instance of ElasticSearch. A machine can have many nodes. To store large datasets, we use multiple machines, each having multiple nodes. Best to have a node per machine.

Set of nodes is called a Cluster.

Every document in ElasticSearch is stored within an index (all documents need to be indexed).

e.g A document containing a person's details like name, country, DOB etc may be stored in an index names "people_index".

Indexes are broken into shards which resides on nodes. Based on the capacity of node(s), shards are created and they reside on the nodes. Sharding helps in improving performance since queries can now be run in parallel on multiple shards.

By default, an index has one shard and we can configure to increase/decrease the # of shards using split/shrink API.

But what if a node/shard fails? Will we lose data?

If there is no copy of the data, we will lose data. Replication is enabled by default in ElasticSearch. Default value is 1.

Replication is configured at the index level. We can choose the # of replicas we need while creating an index (default is 1 as mentioned above). Copies of shards are created. These copies are known as "replicas" or "replica shards".

The main shard is known as the "primary shard" and a replication group (primary and replicas) is created.

As obvious the replicas will be stored in different nodes to avoid a single point of failure.

But what if I have only one node?

Replication in one node makes no sense since failure of node will lead to loss of data. Need to have a minimum of 2 nodes for replication to be effective.

How is replication distributed?

1 node:

Replication ineffective.

2 nodes:

With the default configuration of 1, the replicas will be placed in the other node (other than the primary)
With a configuration > 1, all replicas will be placed in the other node (other than the primary)

> 2 nodes:

With a configuration > 1, all replicas will be distributed across multiple nodes to improve availability and fault tolerance.

What does replication achieve?

High availability (replica available in case of downtime)
Throughput (since search queries can be routed to replica as well and hence improve performance)

Can we take backups of the data?

Snapshots can be taken at any point in time (at index level or cluster level). Can be used to restore data as well.

Monday, February 3, 2020

Big Data and Hadoop Introduction

How do we define Big Data?

Big data in lay man's term implies large volume of data (in petabytes [1024 terabytes] or exabytes [1024 petabytes]). This data can be in structured and unstructured format.

Such large volumes of data cannot be handled by traditional data processing software.

Big data attributes:

1. Volume - as mentioned above, we are handling large volumes of data.
2. Format - Data can be in structured and unstructured format.
3. Velocity - Data processing can take time based on its structure.
4. Veracity - In large data sets, the quality of data may vary and needs to be ascertained.

e.g
Amazon and Netflix recommendation engines based on subscriber interests.
Uber Geodata.
Apple Siri - voice data.
Driverless cars - sensor and image quality data processing.

What is Hadoop?

As per text book definition,"Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware."

What it implies?

- Hadoop can store large volumes of data.

- It runs on commodity hardware implying lower cost per GB.

Hadoop stores data in clusters and hence can scale horizontally. This helps in reducing cost per GB as volume increases. This implies that data needs to be fetched from different clusters and is not available in a single machine.

Thursday, December 19, 2019

Correlation

What is Correlation?

Correlation shows us the Direction and Strength of a linear relationship shared between 2 quantitative variables.

Its denoted by the equation

where

r = correlation
n = # of data points in a data set
Sx = The standard deviation
Sy = The standard deviation
Xi = The data value

For more details on the mean and standard deviation, refer the following blog post:

https://mylearningcafe.blogspot.com/2019/12/mode-median-mean-range-and-standard.html

Direction is provided by the slope (if we draw a line along the data points)

If the slope is upwards, we deduce that the correlation is positive.

If the slope is downwards, we deduce that the correlation is negative.

Correlation values range from -1 to 1.

A value of 1 indicates perfect positive correlation and a -1 indicates perfect negative correlation.

Correlation is positive

Correlation is negative

Strength of a linear relationship gets stronger as correlation increases from 0 to 1 or from 0 to -1.

Refer pics below.

r = 0

r = 0.3

r = 0.7

r = 1

Lets look at a calculation for a dataset for "No of hours on a treadmill" vs "Calories burnt"

We can see a near straight line of a positive correlation of 0.969 (very close to a perfect positive correlation).

Friday, December 13, 2019

Mode, Median, Mean, Range and Standard Deviation

Lets try to ascertain differences between Mode, Median, Mean, Range and Standard Deviation.

Lets assume following data set:

50, 20, 100, 150, 20, 60, 20, 15, 35

Mode:
Mode is data that occurs frequently.
From above data set, we can see that 20 occurs thrice and hence Mode for above data set is 20.

Median:
Center point of an ordered data set.
The point to note here is "ordered" data set.

Hence for the above set, lets do the ordering first.

50, 20, 100, 150, 20, 60, 20, 15, 35
becomes
15, 20, 20, 20, 35, 50, 60, 100, 150

How, do we get the median?

Median = (n+1)/2

In the above case, its (9+1)/2 = 5th position which is 35.

How about when we have even numbers in a data set?
In that case, we take the average of the middle two numbers.

Lets add one more element to the above ordered data set.
15, 20, 20, 20, 35, 50, 60, 100, 150, 175

Median will be average of the middle two numbers which is avg of (35 and 50) which is 42.5

Mean:
Mean is the average. [(Sum of all data values)/n]

In this case
50, 20, 100, 150, 20, 60, 20, 15, 35

Mean = (50+20+100+150+20+60+20+15+35)/9 = 470/9 = 52.22

Range:
Range is simply the difference of max vs min.
Hence,

Range = max-min = 150 - 15 = 135

Standard Deviation
Standard deviation ascertains how close to the mean are the values in a data set.

Formula for standard deviation:

How do we calculate this?

Mean in above for difference is 52.22

Standard deviation is 45.69

Small deviation indicates the distribution is less spread and data is close to mean.
Large deviation indicates the distribution is more spread and data is further away from the mean.

Tuesday, December 10, 2019

Web Scraping Introduction

What is Web Scraping?
In simplest terms, as the name suggests, Web Scraping is scraping the data from the Web.

If the web page is correctly marked up, one can extract data using the element where id is the subject and it returns the text.

To get data from a HTML, we can use the BeautifulSoupLibrary which builds a tree out of the various elements in a page. It provides an interface to access these elements.

Pre-requisite:

pip install beautifulsoup4

We shall use requests to get to the html page (via URL) and then use the BeautifulSoup Library function to access the first line (paragraph).

Lets try this with our own website:

from bs4 import BeautifulSoup

import requests

webhtmlpage = requests.get("https://mylearningcafe.blogspot.com/p/welcome_9.html").text;

bsoup = BeautifulSoup(webhtmlpage,'html5lib');

first = bsoup.find('p');

print first;

$ python webscraping.py

The cafe (of learning) never closes For finance related posts, go to <a href="http://mymoneyrules.blogspot.in/">http://mymoneyrules.blogspot.in/</a>

To extract the text, if I add:

first_text = bsoup.find('p').text;

print first_text;

I will get

$ The cafe (of learning) never closes For finance related posts, go to http://mymoneyrules.blogspot.in/

This is how the data of my website looks like:

Lets get the length of li tags first:

#find count of <li> tags

li_tag = bsoup('li');

print(len(li_tag));

How to extract the Href text?

If we look closely, the main data is in "<div id='adsmiddle24552235005691491924'>"

Thus, we search for the specific div ID and loop through to find the text.

We do get a "None" element as well and it would throw a AttributeError since None object type would have no method. Hence, we would eliminate it using a try except block.

#find count of <li> tags li_tag = bsoup('li'); print(len(li_tag)); div_tag = bsoup.find("div",{"id":"adsmiddle24552235005691491924"}); print(len(div_tag.find_all("li"))); #print(div_tag.find_all("li")) for a_text in div_tag.find_all("li"): try: print(a_text.a.text); except AttributeError: None; #print "skip";

Output:

$ python webscraping.py

High level introduction to Kubernetes

Docker vs VM

Kubernetes - Features

Quick Introduction to AWS

Definition of various storage services

How to add buckets in S3

AWS Database services (Intro and how to create a DB Instance)

Statistical learning Index Page

Annotate - Lets you create personalized messages over images (meme)

Link Library - A library for all your links

Raspberry Pi Index Page

R programming Index Page

Python Index Page

Alert pop up dialog in android

Rename package in Android

How to change version of apk for redeployment in Android Studio

How to change an app's icon in Android Studio

http://mylearningcafe.blogspot.in/2015/05/garbage-collection-gc.html

http://mylearningcafe.blogspot.in/2014/02/sorting-algorithms-in-java.html

http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-2.html

http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-1.html

http://mylearningcafe.blogspot.in/2014/02/automatic-resource-management-in-java-7.html