Tuesday, December 10, 2019

Web Scraping Introduction

What is Web Scraping?
In simplest terms, as the name suggests, Web Scraping is scraping the data from the Web.

If the web page is correctly marked up, one can extract data using the <p> element where id is the subject and it returns the text.

To get data from a HTML, we can use the BeautifulSoupLibrary which builds a tree out of the various elements in a page. It provides an interface to access these elements.

Pre-requisite:


pip install beautifulsoup4



We shall use requests to get to the html page (via URL) and then use the BeautifulSoup Library function to access the first line (paragraph).

Lets try this with our own website:


from bs4 import BeautifulSoup
import requests

webhtmlpage = requests.get("https://mylearningcafe.blogspot.com/p/welcome_9.html").text;
bsoup = BeautifulSoup(webhtmlpage,'html5lib');

first = bsoup.find('p');
print first;




$ python webscraping.py

<p class="description"><span>The cafe (of learning) never closes <br/><br/> For finance related posts, go to <a href="http://mymoneyrules.blogspot.in/">http://mymoneyrules.blogspot.in/</a></span></p>

To extract the text, if I add:

first_text = bsoup.find('p').text;
print first_text;

I will get

The cafe (of learning) never closes  For finance related posts, go to http://mymoneyrules.blogspot.in/

This is how the data of my website looks like:



Lets get the length of li tags first:

#find count of <li> tags
li_tag = bsoup('li');
print(len(li_tag));

How to extract the Href text?

If we look closely, the main data is in "<div id='adsmiddle24552235005691491924'>"

Thus, we search for the specific div ID and loop through to find the text.
We do get a "None" element as well and it would throw a AttributeError since None object type would have no method. Hence, we would eliminate it using a try except block.

#find count of <li> tags li_tag = bsoup('li'); print(len(li_tag)); div_tag = bsoup.find("div",{"id":"adsmiddle24552235005691491924"}); print(len(div_tag.find_all("li"))); #print(div_tag.find_all("li")) for a_text in div_tag.find_all("li"): try: print(a_text.a.text); except AttributeError: None; #print "skip";

Output:

$ python webscraping.py 
High level introduction to Kubernetes
Docker vs VM
Kubernetes - Features
Quick Introduction to AWS
Definition of various storage services
How to add buckets in S3
AWS Database services (Intro and how to create a DB Instance)
Statistical learning Index Page
Annotate - Lets you create personalized messages over images (meme)
Link Library - A library for all your links
Raspberry Pi Index Page
R programming Index Page
Python Index Page
Alert pop up dialog in android 
Rename package in Android
How to change version of apk for redeployment in Android Studio 
How to change an app's icon in Android Studio 
http://mylearningcafe.blogspot.in/2015/05/garbage-collection-gc.html
http://mylearningcafe.blogspot.in/2014/02/sorting-algorithms-in-java.html
http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-2.html
http://mylearningcafe.blogspot.in/2014/02/some-new-features-in-java-7-part-1.html
http://mylearningcafe.blogspot.in/2014/02/automatic-resource-management-in-java-7.html







No comments:

Post a Comment