Web Scraping using python BeautifulSoup

If you are a Python developer and planning to start with web scraping using Python then 99% of us would start with beautifulsoup as beginners. There is a valid reason behind this as well. Web scraping using BeautifulSoup is quite easier in comparison to other ways of doing it.

In this article, we are going to learn to scrape a website using Python and BeautifulSoup package.

webscraping with python beautifulsoup

What is Python Beautifulsoup:

Beautifulsoup or BS4 is a Python library that helps to scrape data from a web page. It helps to extract necessary information from pages providing an ability to search, iterate and modify as required the HTM DOM structure tree.

How does it work?

Referring to the website shown in the image. Let’s say we want to extract the main title “Delivering Data with Managed Web Scraping Expertise” from the website page. In this example the workflow will be something as follows:

Get the website page
Look into HTML DOM structure to find out how to extract it. As shown in the image the text we want to extract is part of <h1> tag and it has different classes assigned to it.
Now we have a multiple way to extract it, if its a single <h1> element then we can refer it with <h1> if not we can either use class name or ID name. It depends on how the element is placed and structured.
Fetch the desired data using BeautifulSoup command.

Code example, how to use BeautifulSoup:

Create a folder for the project

mkdir beautifulsoup-test

Setup a virtualenv

python3 -m venv myvenv

Activate virtualenv

source myvenv/bin/activate

Install request package

pip3 install requests

Install beautifulsoup package

pip3 install bs4

Full Code to extract title from the webpage using python request and beautifulsoup package

import requests
from bs4 import BeautifulSoup

def extract_h1_title(url):
    try:
        # Send a GET request to the URL
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)

        # Parse the HTML content of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the first <h1> tag on the page
        h1_tag = soup.find('h1')

        # Return the text content of the <h1> tag or a message if not found
        return h1_tag.text.strip() if h1_tag else "No <h1> tag found"
    except requests.exceptions.RequestException as e:
        return f"An error occurred: {e}"

# Example usage
url = "https://getdataforme.com/"
h1_title = extract_h1_title(url)
print(f"The <h1> title is: {h1_title}")

Run the code python3 main.py

Output Result:

final output after running python beautifulsoup code example