-->
If you are a Python developer and planning to start with web scraping using Python then 99% of us would start with beautifulsoup as beginners. There is a valid reason behind this as well. Web scraping using BeautifulSoup is quite easier in comparison to other ways of doing it.
In this article, we are going to learn to scrape a website using Python and BeautifulSoup package.
Beautifulsoup or BS4 is a Python library that helps to scrape data from a web page. It helps to extract necessary information from pages providing an ability to search, iterate and modify as required the HTM DOM structure tree.
Referring to the website shown in the image. Let’s say we want to extract the main title “Delivering Data with Managed Web Scraping Expertise” from the website page. In this example the workflow will be something as follows:
<h1>
tag and it has different classes assigned to it.<h1>
element then we can refer it with <h1>
if not we can either use class name or ID name. It depends on how the element is placed and structured.Create a folder for the project
mkdir beautifulsoup-test
Setup a virtualenv
python3 -m venv myvenv
Activate virtualenv
source myvenv/bin/activate
Install request package
pip3 install requests
Install beautifulsoup package
pip3 install bs4
import requests
from bs4 import BeautifulSoup
def extract_h1_title(url):
try:
# Send a GET request to the URL
response = requests.get(url)
response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')
# Find the first <h1> tag on the page
h1_tag = soup.find('h1')
# Return the text content of the <h1> tag or a message if not found
return h1_tag.text.strip() if h1_tag else "No <h1> tag found"
except requests.exceptions.RequestException as e:
return f"An error occurred: {e}"
# Example usage
url = "https://getdataforme.com/"
h1_title = extract_h1_title(url)
print(f"The <h1> title is: {h1_title}")
Run the code
python3 main.py