Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Sunday, July 17, 2022

Google Cloud - Storage

 Let's talk a bit about storage in GCP.

Hard disks for VMs is persistent and is block storage. In GCP we call is persistent storage. (block storage)

  • As mentioned, this is similar to a hard drive of a computer.
  • Only one block storage per VM. A block storage will map to one VM
  • However, one VM can have different block storages. 
  • To avoid confusion on the above statements, look at the picture below


  • Direct Attached storage is like a hard disk and Storage Area Network is like a pool of storage devices connected via a high speed network.
  • GCP provides two options
    • Persistent Disks
      • Connected to a block storage via high speed network.
      • Zonal - Data replicated in one zone
      • Regional - Data replicated in multiple zones
      • Logical to use Regional option for durability.
      • By default a 10GB boot disk (persistent) is attached to a VM when we create a VM.
    • Local SSDs
      • Local block storage.
      • Faster
      • High performance

File store is for file storage and sharing between multiple VMs.

  • Pretty logical, use file storages to store and share files across VMs.

Cloud storage in GCP is the object storage.

  • Create a container (bucket in GCP) to store objects (use console)
    • Bucket name has to be unique (globally)
    • Location type
      • Region (low latency)
      • Dual region (2 regions) [High availability and low latency across 2 regions]
      • Multi region (multiple regions) [High availability]
    • Storage class
      • Standard
        • Short term
        • Frequently accessed
      • Near line
        • Backups
        • Data accessed less than one time a month
        • Min storage duration is 1 month (30 days)
      • Cold line
        • Disaster recovery
        • Data accessed less than once a quarter
        • Min storage duration is 90 days
      • Archive
        • Long term data preservation (backup)
        • Data accessed less than once a year
        • Min storage duration is 365 days.
    • Inexpensive
    • Auto scales (as you add storage)
    • Stored as key-value pair
    • Access control at object level
    • REST API available to access and modify stored objects
    • Command line also available (gsutil command)
  • Now logically one can store any type of data in the object storage.
    • But some of these can be less frequently accessed (e.g backup files)
    • Object storage helps to optimize costs based on access needs.

I have data on premise. How do I transfer to Google cloud?
Options:
  • Online transfer:
    • Transfer to Google cloud storage via APIs or CLI (gsutil) [< 1TB]
    • Good for smaller sized transfer (not for peta byte sized data)
  • Storage Transfer:
    • Peta byte sized data
    • Setup a recurring repeating schedule
    • Can be an incremental transfer as well.
    • Fault tolerant - starts from where it failed.
    • Use when 
      • > 1TB of data
      • Transferring from different cloud
  • Transfer Appliance is physical data transfer.
    • Size > 20TB
    • Request an appliance.
    • Upload data to the appliance (e.g USB type appliance)
    • Ship the appliance
    • Google uploads to storage.
    • Data is encrypted in the appliance
    • Two appliance devices
      • TA40 (for upto 40 TB)
      • TA300 (for upto 300 TB)
    


Saturday, October 17, 2020

ElasticSearch (Introduction)

Q. What is ElasticSearch?

ElasticSearch is an open source analytics and full text search engine.

Often used to enable search functionality for applications. One can built complex search functionality using Elastic Search. Can also be used to aggregate data and analyze results.

Data is stored as documents (JSON Object) in ElasticSearch.

e.g A document in ElasticSearch corresponds to a row in a relational database (for better understanding).

This document contains fields which corresponds to columns in a relational database row.

e.g

{

    "Name": "James Bond",

    "DOB": "01-01-1999",

    "EmployeeID":"10203"

}


Data is stored in Nodes. There are multiple nodes and each node stores a part of the data.

Node is an instance of ElasticSearch. A machine can have many nodes. To store large datasets, we use multiple machines, each having multiple nodes. Best to have a node per machine.

Set of nodes is called a Cluster.

Every document in ElasticSearch is stored within an index (all documents need to be indexed).

e.g A document containing a person's details like name, country, DOB etc may be stored in an index names "people_index". 

Indexes are broken into shards which resides on nodes. Based on the capacity of node(s), shards are created and they reside on the nodes. Sharding helps in improving performance since queries can now be run in parallel on multiple shards.

By default, an index has one shard and we can configure to increase/decrease the # of shards using split/shrink API.

But what if a node/shard fails? Will we lose data? 

If there is no copy of the data, we will lose data. Replication is enabled by default in ElasticSearch. Default value is 1.

Replication is configured at the index level. We can choose the # of replicas we need while creating an index (default is 1 as mentioned above). Copies of shards are created. These copies are known as "replicas" or "replica shards".

The main shard is known as the "primary shard" and a replication group (primary and replicas) is created.

As obvious the replicas will be stored in different nodes to avoid a single point of failure. 

But what if I have only one node?

Replication in one node makes no sense since failure of node will lead to loss of data. Need to have a minimum of 2 nodes for replication to be effective.

How is replication distributed?

1 node:

  • Replication ineffective.

2 nodes:

  • With the default configuration of 1, the replicas will be placed in the other node (other than the primary)
  • With a configuration > 1, all replicas will be placed in the other node (other than the primary)
> 2 nodes:
  • With a configuration > 1, all replicas will be distributed across multiple nodes to improve availability and fault tolerance.
What does replication achieve?
  • High availability (replica available in case of downtime)
  • Throughput (since search queries can be routed to replica as well and hence improve performance)
Can we take backups of the data?

Snapshots can be taken at any point in time (at index level or cluster level). Can be used to restore data as well.




     




    Monday, February 3, 2020

    Big Data and Hadoop Introduction

    How do we define Big Data?

    Big data in lay man's term implies large volume of data (in petabytes [1024 terabytes] or exabytes [1024 petabytes]). This data can be in structured and unstructured format.

    Such large volumes of data cannot be handled by traditional data processing software.

    Big data attributes:

    1. Volume - as mentioned above, we are handling large volumes of data.
    2. Format - Data can be in structured and unstructured format.
    3. Velocity - Data processing can take time based on its structure.
    4. Veracity - In large data sets, the quality of data may vary and needs to be ascertained.

    e.g
    Amazon and Netflix recommendation engines based on subscriber interests.
    Uber Geodata.
    Apple Siri - voice data.
    Driverless cars - sensor and image quality data processing.


    What is Hadoop?

    As per text book definition,"Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware."


    What it implies?

    - Hadoop can store large volumes of data.
    - It runs on commodity hardware implying lower cost per GB.



    Hadoop stores data in clusters and hence can scale horizontally. This helps in reducing cost per GB as volume increases. This implies that data needs to be fetched from different clusters and is not available in a single machine.

    Thursday, December 19, 2019

    Correlation

    What is Correlation?

    Correlation shows us the Direction and Strength of a linear relationship shared between 2 quantitative variables.

    Its denoted by the equation

    where

    r = correlation
    n = # of data points in a data set
    Sx = The standard deviation
    Sy = The standard deviation
    Xi = The data value
    For more details on the mean and standard deviation, refer the following blog post:

    Direction is provided by the slope (if we draw a line along the data points)
    If the slope is upwards, we deduce that the correlation is positive.
    If the slope is downwards, we deduce that the correlation is negative.
    Correlation values range from -1 to 1.
    A value of 1 indicates perfect positive correlation and a -1 indicates perfect negative correlation.

    Correlation is positive


    Correlation is negative

    Strength of a linear relationship gets stronger as correlation increases from 0 to 1 or from 0 to -1.
    Refer pics below.



    r = 0


              
            
    r = 0.3




    r = 0.7

    r = 1

    Lets look at a calculation for a dataset for "No of hours on a treadmill" vs "Calories burnt"






    We can see a near straight line of a positive correlation of 0.969 (very close to a perfect positive correlation).

    Friday, December 13, 2019

    Mode, Median, Mean, Range and Standard Deviation

    Lets try to ascertain differences between Mode, Median, Mean, Range and Standard Deviation.

    Lets assume following data set:

    50, 20, 100, 150, 20, 60, 20, 15, 35

    Mode:
    Mode is data that occurs frequently.
    From above data set, we can see that 20 occurs thrice and hence Mode for above data set is 20.

    Median:
    Center point of an ordered data set.
    The point to note here is "ordered" data set.

    Hence for the above set, lets do the ordering first.

    50, 20, 100, 150, 20, 60, 20, 15, 35
    becomes
    15, 20, 20, 20, 35, 50, 60, 100, 150

    How, do we get the median?

    Median = (n+1)/2

    In the above case, its (9+1)/2 = 5th position which is 35.

    How about when we have even numbers in a data set?
    In that case, we take the average of the middle two numbers.

    Lets add one more element to the above ordered data set.
    15, 20, 20, 20, 35, 50, 60, 100, 150, 175

    Median will be average of the middle two numbers which is avg of (35 and 50) which is 42.5

    Mean:
    Mean is the average. [(Sum of all data values)/n]

    In this case
    50, 20, 100, 150, 20, 60, 20, 15, 35

    Mean = (50+20+100+150+20+60+20+15+35)/9 = 470/9 = 52.22

    Range:
    Range is simply the difference of max vs min.
    Hence,

    Range = max-min = 150 - 15 = 135

    Standard Deviation
    Standard deviation ascertains how close to the mean are the values in a data set.

    Formula for standard deviation:


    How do we calculate this?



    Mean in above for difference is 52.22

    Standard deviation is 45.69

    Small deviation indicates the distribution is less spread and data is close to mean.
    Large deviation indicates the distribution is more spread and data is further away from the mean.









    Tuesday, December 10, 2019

    Web Scraping Introduction

    What is Web Scraping?
    In simplest terms, as the name suggests, Web Scraping is scraping the data from the Web.

    If the web page is correctly marked up, one can extract data using the <p> element where id is the subject and it returns the text.

    To get data from a HTML, we can use the BeautifulSoupLibrary which builds a tree out of the various elements in a page. It provides an interface to access these elements.

    Pre-requisite:


    pip install beautifulsoup4



    We shall use requests to get to the html page (via URL) and then use the BeautifulSoup Library function to access the first line (paragraph).

    Lets try this with our own website:


    from bs4 import BeautifulSoup
    import requests

    webhtmlpage = requests.get("https://mylearningcafe.blogspot.com/p/welcome_9.html").text;
    bsoup = BeautifulSoup(webhtmlpage,'html5lib');

    first = bsoup.find('p');
    print first;




    $ python webscraping.py

    <p class="description"><span>The cafe (of learning) never closes <br/><br/> For finance related posts, go to <a href="http://mymoneyrules.blogspot.in/">http://mymoneyrules.blogspot.in/</a></span></p>

    To extract the text, if I add:

    first_text = bsoup.find('p').text;
    print first_text;

    I will get

    The cafe (of learning) never closes  For finance related posts, go to http://mymoneyrules.blogspot.in/

    This is how the data of my website looks like:



    Lets get the length of li tags first:

    #find count of <li> tags
    li_tag = bsoup('li');
    print(len(li_tag));

    How to extract the Href text?

    If we look closely, the main data is in "<div id='adsmiddle24552235005691491924'>"

    Thus, we search for the specific div ID and loop through to find the text.
    We do get a "None" element as well and it would throw a AttributeError since None object type would have no method. Hence, we would eliminate it using a try except block.

    #find count of <li> tags li_tag = bsoup('li'); print(len(li_tag)); div_tag = bsoup.find("div",{"id":"adsmiddle24552235005691491924"}); print(len(div_tag.find_all("li"))); #print(div_tag.find_all("li")) for a_text in div_tag.find_all("li"): try: print(a_text.a.text); except AttributeError: None; #print "skip";

    Output:

    $ python webscraping.py 
    High level introduction to Kubernetes
    Docker vs VM
    Kubernetes - Features
    Quick Introduction to AWS
    Definition of various storage services
    How to add buckets in S3
    AWS Database services (Intro and how to create a DB Instance)
    Statistical learning Index Page
    Annotate - Lets you create personalized messages over images (meme)
    Link Library - A library for all your links
    Raspberry Pi Index Page
    R programming Index Page
    Python Index Page
    Alert pop up dialog in android 
    Rename package in Android
    How to change version of apk for redeployment in Android Studio 
    How to change an app's icon in Android Studio 
    http://mylearningcafe.blogspot.in/2015/05/garbage-collection-gc.html
    http://mylearningcafe.blogspot.in/2014/02/sorting-algorithms-in-java.html
    http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-2.html
    http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-1.html
    http://mylearningcafe.blogspot.in/2014/02/automatic-resource-management-in-java-7.html