Home Hacking & Security How to Scrape Data from a Website using Python Beautifulsoup

How to Scrape Data from a Website using Python Beautifulsoup

266
0
how to scrape data from a website using python

How to Scrape Data from a website using Python

Data mining or web scraping is the technique by which we can download the data present inside specific web-page, there are a hundreds of tutorials on “how to scrape data from a website using python” on the web but I remember the first time I searched for good tutorial it couldn’t really help me understand the simple concepts for mining.

So here I would try and explain scraping for absolute beginners. First and foremost this article is for educational purpose only mine data without slamming the servers and nor can I help you mine contents for which you have to pay for.

Tools I’m using:

  1. Python (I am currently using python3 but have also worked on 2.7)
  2. Few Libraries has to downloaded i will step by step
  3. Currently using Linux (Ubuntu) but both Windows and Mac would do just fine

First find a website you want to scrape data from here let’s take IMDB for example

For the test scenario let’s say we want to grab names of all the movie names released in 2018 which are voted highest and grab it’s ratings so that we can create a watch list. The libraries required are ‘requests’ & ’beautifulsoup’.

To install these libraries we need to install ‘pip’ which is a package manager for python. If you have not yet installed python I don’t think you’d have read this far ahead, so anyway there are a couple of ways to install pip on your system respective of the OS you are currently on

  • Linux Users : Go to this Link
  • Windows Users : Go to this Link
  • Mac Users : Go to this Link

And while testing python codes use Ipython, which will help in easy learning and also has an auto-complete feature on tab-key. Install ipython by typing “pip install ipython” in your terminal

 

Now install required libraries like requests “pip install requests

Also “pip install BeautifulSoup” and “pip install html5lib

Now in your terminal open ipython-shell by typing “ipython” and now try importing the installed libraries

[1] import requests

[2] from bs4 import BeautifulSoup

Now when these are both are imported we are halfway there, now we have the required tools to scrape data. First navigate to webpage we want to scrape ( here we are taking IMDb link as reference”)

Here we can see the zoomed out version of webpage now we are going to scrape the data we need, the first thing we need to understand is that no matter how we see a web site any website its all following the same structure

<html>

   <head>

   </head>

   <body>

All Contents

   </body>

</html>

What we need are the contents inside body, so first load the website link inside terminal

[1] import requests

[2] from bs4 import BeautifulSoup

[3] url = ‘http://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1’

[4] html = requests.get(url).content

In the third line url is a variable which holds link and variable html can be used to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor and parses the website link and retrieve its contents.

[5] soup = BeautifulSoup(html,’html5lib’)

The soup variable now holds a parsed data of the link passed through request method. And this soup now holds all data present inside the displayed web page now only thing to do is extract the data we need.

Now open the link again in the browser to separate the required element, and press F12 or Ctrl+Shift+I or Cmd+Shift+I. Now select and note down the name and class of required elements if id is present then it’s more comfortable.

[6] list = soup.find(‘div’,class_=’lister-list’)

[7] mov_container = list.find_all(‘div’,class_=’lister-item mode-advanced’)

Here we first separated the div field we wanted from the soup which contained the whole website, and next from the whole div we just came down to div which only included movie details

Now since there are multiple fields by which i mean multiple movies to extract from we have to use a for loop. And we declare an empty array to which we can later save the details to.

[8] arr=[]

[9] for mov in mov_container:
    arr.append(mov.h3.a.text+” -IMDB Rating “+mov.strong.text)

Now when we display the array the output should be something like this:

Since the mov_container field is fetching divs with unique class name we can directly seperate them from soup but some websites will use same class for various elements so its better to use find function in a tree view model.

The code:

import requests

from bs4 import BeautifulSoup

url = ‘http://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1′

html = requests.get(url).content

soup = BeautifulSoup(html,’html5lib’)

mlist = soup.find(‘div’,class_=’lister-list’)

mov_container = mlist.find_all(‘div’,class_=’lister-item mode-advanced’)

arr = []

for mov in mov_container:

arr.append(mov.h3.a.text+” -IMDB Rating “+mov.strong.text)

This is just a very basic scraping tutorial now we can save them in spreadsheets, some websites will not allow request methods then we have to use a web-driver like selenium to open an automated web browser, but the base of scraping any website is same, using BeautifulSoup get the page source, parse it through Beautifulsoup and then, separate and extract the required fields.

Author: Aditya Raj

Technology is one of my greatest crush and has always been. Blogging is my hobby with the aim to share knowledge on technology..How things are done & works . I believe that technology has gone further level than we are now and It’s time we get ourselves updated and Contribute a little bit of ours to the ever-growing World.

LEAVE A REPLY

Please enter your comment!
Please enter your name here