Get Data For Me
webscrapping

Website Sitemap, How to find it

Web Scraping Team
#sitemap#web-scraping

What is a Sitemap?

All the websites have sitemaps, and some might use different formats or require authentication to access them. A sitemap is a file where you can list all the web pages of your site to tell search engines and web crawlers about the organization of your site content. It acts as a roadmap of your website that search engines use to better crawl your site.

How to find the website sitemap

There are multiple ways we can figure out the sitemap being used in the website. Lets look into few of the methods that we can apply to figure it out.

Method-1: Direct URL Access

In general, most of the website follow some patter to put the sitemap url. The Most common sitemap URLs to try are as follows:

https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/sitemap/
https://example.com/sitemaps.xml
https://example.com/sitemap.php
https://example.com/sitemap.html

Method-2: Check Robots.txt

The robots.txt file is a text file used by websites to instruct web crawlers and bots about which parts of the site they are allowed or disallowed to access. It is part of the Robots Exclusion Protocol (REP) and serves as a guide for search engines, web scrapers, and other automated tools. These are just a guidelance for crawlers to follow but not enforced. Getdataforme makes sure we respect the guidelance in robot.txt

Though its optional but most of the website often put sitemap reference to help crawlers:

1. Go to: https://example.com/robots.txt
2. Look for lines starting with "Sitemap:"
   Example:
   Sitemap: https://example.com/sitemap.xml
   Sitemap: https://example.com/post-sitemap.xml

Method-3: CMS-Specific Locations

Depending upon the CMS type being used to develop the website the sitemap location might vary. Lets look into some common CMS being used and the sitemap location for them.

WordPress Sites sitemap locations

/wp-sitemap.xml
/wp-sitemap-posts-post-1.xml
/wp-sitemap-taxonomies-category-1.xml
/wp-sitemap-users-1.xml
/news-sitemap.xml

Shopify Sites sitemap locations

/sitemap.xml
/sitemap_products_1.xml
/sitemap_collections_1.xml
/sitemap_pages_1.xml
/sitemap_blogs_1.xml

Magento Sites sitemap locations

/sitemap.xml
/pub/media/sitemap.xml
/media/sitemap.xml

Method-4: Google Search Tricks

One tricky way to figure out the sitemap would be to make use of google search. Try these search operators in Google:

site:example.com filetype:xml
site:example.com inurl:sitemap

Method-5: Check Page Source

One easier way would be to simply look in the HTML source code for the website and check for the following link.

<link rel="sitemap" type="application/xml" title="Sitemap" href="/sitemap.xml">

Method 7: Developer Tools

One another approach will be to use the developer tools. Follow the following steps:

  1. Open Browser Dev Tools (F12)
  2. Go to Network tab
  3. Filter searches:
    • Type “sitemap”
    • Extension “.xml”
  4. Refresh page to see requests

Finding sitemap is integral part of webscraping, at getdataforme , we prioritize it at most and make sure we use it to our best to provide accuracte and efficient data solution.

← Back to Blog