From Sitemaps to Rendered Links: The 6-Method Stack for Full-Site URL Discovery
Expert in Web Scraping Technologies
Key Takeaways:
- There is no single command that lists every page. A complete URL inventory comes from layering methods: the
site:search operator for a quick estimate,sitemap.xmland the sitemap index for what the site publishes,robots.txtSitemap directives for the entry points, an SEO crawler or a Python crawler for what's actually linked, and a cloud browser for client-side links that only appear after JavaScript runs. - Sitemaps are the fastest authoritative source β when they exist and stay current. A single
requests.get("/sitemap.xml")plus a recursive walk of any sitemap index can return hundreds of URLs in one pass. - A breadth-first crawler finds what sitemaps omit. Sitemaps are author-curated and frequently stale; a BFS walk of internal
<a href>links discovers orphan pages, deep-linked content, and anything the sitemap forgot. The crawler in this guide honorsrobots.txtDisallowrules on every URL before fetching it. - JavaScript-rendered links need a real browser. Single-page apps and infinite-scroll catalogs paint their internal links client-side, so a plain HTTP fetch returns a near-empty shell. Scrapeless Scraping Browser renders the page in a cloud browser, then you collect the anchors from the hydrated DOM β with US residential egress pinned on the session.
- Free to start. New Scrapeless accounts include free Scraping Browser runtime β sign up at app.scrapeless.com.
Introduction: why a full URL inventory is harder than it looks
Knowing every page on a website is the foundation for a lot of work: a technical SEO audit, a content migration, a broken-link sweep, a price-monitoring pipeline that needs every product URL, or an LLM ingestion job that wants the whole text corpus. The problem is that no site hands you a guaranteed-complete list. The home page links to most sections, the sitemap publishes some pages, and orphan pages β reachable by direct link but not linked from navigation β slip through both.
Existing options each cover one slice. The Google site: operator gives a fast public estimate but caps results and reflects only what's indexed. A sitemap.xml is authoritative for what the publisher chose to declare, but it goes stale and omits pages the CMS never registered. A crawler that follows links finds the linked graph, but a plain HTTP crawler returns an empty shell on JavaScript-heavy pages where the navigation is rendered client-side.
This guide walks through six methods in order of cost and completeness β cheap-and-fast first, thorough-and-rendered last. The Python examples use requests and the standard library for the static tiers, and a cloud browser through Scrapeless Scraping Browser for the JavaScript tier, so client-side links become discoverable. Pin US egress, render the page, collect the anchors. Cross-links to sibling guides are at the end.
What You Can Do With It
- Technical SEO audits. Enumerate every indexable URL, then diff the crawl against the sitemap to surface orphan pages and pages the sitemap forgot.
- Content migrations. Build the complete source-URL list before a replatform so nothing 404s after the cutover, and map old paths to new ones.
- Broken-link sweeps. Walk the internal link graph, record every destination, and flag the ones that return a non-200 status.
- Price and catalog monitoring. Discover every product URL on a retailer β including the JavaScript-rendered ones β and feed them into a downstream extraction pipeline.
- LLM corpus ingestion. Produce the full set of content URLs so a text-extraction job can pull the entire public corpus without missing deep-linked articles.
- Competitive content mapping. Inventory a competitor's public section structure from sitemaps and link graphs to size their content footprint.
Why Scrapeless Scraping Browser
Most of this guide runs on the Python standard library and requests β sitemaps and robots.txt are plain text and XML, and a static link crawler needs nothing more. Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI agents; it is the tier you reach for when the links you need only exist after JavaScript runs. For full-site URL discovery specifically, it brings:
- Cloud-side JavaScript rendering, so the internal links on a single-page app, an infinite-scroll catalog, or a React/Vue/Next.js navigation appear in the DOM you read β not as an empty
<div id="root">. - US residential proxy egress, pinned per session, so geo-gated sites serve the same page structure they serve a US visitor.
- Anti-detection fingerprinting on every session, so the rendered page matches what organic traffic sees rather than a flagged automation profile.
- Session continuity across a warm-up navigation and the target page, so a homepage visit that sets cookies carries into the page you actually want to enumerate.
- One endpoint, standard CDP, so Playwright (or any Chrome DevTools Protocol client) connects over WebSocket without a local browser doing the rendering.
Get your API key on the free plan at app.scrapeless.com.
Prerequisites
- Python 3.10 or newer.
- A Scrapeless account and API key β sign up at app.scrapeless.com.
pip install requests playwrightandplaywright install chromium(the local Chromium is only the CDP protocol client; the rendering runs in Scrapeless's cloud).- Basic familiarity with the terminal and HTTP.
Method 1 β The Google site: search operator
The fastest first estimate needs no code. Type this into Google:
site:example.com
Google returns the pages it has indexed for that host, and the results header shows an approximate count. Narrow it to map the section structure:
site:example.com/blogβ only URLs under/blog.site:example.com inurl:productβ indexed URLs whose path containsproduct.site:example.com -inurl:tagβ exclude a path segment you don't care about.
What this method is good for: a thirty-second sense of how big the site is and which sections exist. What it is not good for: completeness. The site: operator reflects only pages Google has chosen to index, the result count is an estimate rather than an exact figure, and the operator limits how many results you can page through. Treat it as a sanity check against the more thorough methods below β if your sitemap declares 5,000 URLs and site: shows roughly 200, that gap is itself a finding.
Method 2 β Parse sitemap.xml and the sitemap index
A sitemap is the publisher's own declaration of its URLs, served as XML at a conventional path such as /sitemap.xml. It is the single fastest authoritative source when it exists and stays current. Two shapes matter:
- A
<urlset>sitemap lists page URLs directly, one<url><loc>β¦</loc></url>per page. - A
<sitemapindex>lists other sitemaps β large sites split their URLs across child files (pages_sitemap.xml,blog_sitemap.xml, and so on) and point to them from one index. You walk the index, then walk each child.
This requests script handles both shapes with a single recursive function. It detects the root tag, recurses into a sitemap index, and collects page URLs from each <urlset>:
python
import requests
import xml.etree.ElementTree as ET
from urllib.parse import urljoin
SM_NS = "{http://www.sitemaps.org/schemas/sitemap/0.9}"
HEADERS = {"User-Agent": "Mozilla/5.0 (sitemap-discovery)"}
def walk_sitemap(url, seen=None):
"""Return every page URL reachable from a sitemap or sitemap index."""
seen = seen if seen is not None else set()
if url in seen: # guard against a sitemap that references itself
return []
seen.add(url)
resp = requests.get(url, headers=HEADERS, timeout=30)
resp.raise_for_status()
root = ET.fromstring(resp.content)
tag = root.tag.split("}")[-1] # strip the namespace, keep "sitemapindex" or "urlset"
urls = []
if tag == "sitemapindex":
# An index points to child sitemaps β recurse into each one.
for sm in root.findall(f"{SM_NS}sitemap"):
loc = sm.findtext(f"{SM_NS}loc")
if loc:
urls.extend(walk_sitemap(loc.strip(), seen))
else:
# A urlset lists page URLs directly.
for u in root.findall(f"{SM_NS}url"):
loc = u.findtext(f"{SM_NS}loc")
if loc:
urls.append(loc.strip())
return urls
if __name__ == "__main__":
pages = walk_sitemap("https://example.com/sitemap.xml")
print(f"Discovered {len(pages):,} URLs from the sitemap tree")
for u in pages[:10]:
print(" ", u)
Run against a site whose /sitemap.xml is a sitemap index pointing at child sitemaps, the recursive walk returns the union of every child sitemap in a single pass.
A few notes on sitemaps in the wild:
- The path is a convention, not a guarantee.
/sitemap.xmlis the common location, but a site can name it anything and declare the real path inrobots.txt(Method 3). Always checkrobots.txtbefore assuming the file doesn't exist. - Compressed sitemaps exist. Some sites serve
sitemap.xml.gz;requestsdoes not auto-decompress a.gzbody, so decompress it with thegzipmodule before parsing if you hit one. - Sitemaps go stale. They reflect what the CMS registered at generation time. Pages added since the last build, and orphan pages the CMS never registered, will be missing β which is exactly why Methods 5 and 6 exist.
Method 3 β Read robots.txt Sitemap directives
Before crawling anything, fetch /robots.txt. It serves two purposes for URL discovery: it often declares the sitemap location(s) with one or more Sitemap: lines, and it tells you which paths the site asks crawlers to leave alone (Disallow:). Both matter β the first feeds Method 2, the second is a compliance obligation you carry into Method 5.
python
import requests
from urllib.parse import urljoin
HEADERS = {"User-Agent": "Mozilla/5.0 (sitemap-discovery)"}
def sitemaps_from_robots(base_url):
"""Extract every Sitemap: directive declared in robots.txt."""
resp = requests.get(urljoin(base_url, "/robots.txt"), headers=HEADERS, timeout=30)
resp.raise_for_status()
sitemaps = []
for line in resp.text.splitlines():
if line.lower().startswith("sitemap:"):
sitemaps.append(line.split(":", 1)[1].strip())
return sitemaps
if __name__ == "__main__":
for sm in sitemaps_from_robots("https://example.com"):
print("Sitemap declared:", sm)
A site's robots.txt often declares one or more Sitemap: lines β for example Sitemap: https://example.com/sitemap.xml. Feed those straight into walk_sitemap from Method 2 and you have the publisher's complete declared URL set without guessing the path.
The combined pattern is the backbone of static discovery: read robots.txt to find the sitemap(s) and the disallowed paths, then walk every declared sitemap. Whatever those two return is your authoritative starting inventory. Everything after this is about finding the pages they miss.
Get your API key on the free plan: app.scrapeless.com
Method 4 β SEO crawlers (the no-code option)
If you'd rather not write Python, a desktop or cloud SEO crawler does discovery, link-graph mapping, and reporting in one tool. The common ones β Screaming Frog SEO Spider, Sitebulb, and the site-audit crawlers built into Ahrefs and Semrush β all do the same core job: seed a start URL, follow internal links breadth-first, and produce a table of every URL found along with status code, title, depth, and inbound link count.
These tools are the right call when:
- You want a visual report and a CSV export without maintaining code.
- You need the standard SEO columns (status, canonical, indexability, redirect chains) computed for you.
- The site is mostly server-rendered HTML, which the desktop crawlers handle natively.
Their limits are worth knowing: free tiers cap the URL count (Screaming Frog's free edition stops at 500 URLs), JavaScript rendering is an optional, slower mode that not every tier enables, and the cloud audit tools price by project size. For a one-off audit of a small site they're hard to beat; for a repeatable pipeline that feeds another system, the programmatic methods below give you the URLs as data rather than a report. The next two methods are that programmatic path.
Method 5 β A Python BFS internal-link crawler
When the sitemap is stale or absent, you discover pages the way a search engine does: start at the home page, parse out every internal <a href>, queue the ones you haven't seen, and repeat breadth-first until the frontier is empty or you hit a page cap. This finds orphan and deep-linked pages that no sitemap declares.
Two responsibilities are non-negotiable in a link crawler, and both are built into the code below:
- Honor
robots.txt. Checkcan_fetchfor every URL before requesting it, and skip anything the site disallows. The standard-libraryurllib.robotparserreads and evaluates the rules for you. - Stay on-host and de-duplicate. Only queue links whose host matches the start host, strip URL fragments so
/pageand/page#sectioncount once, and keep aseenset so a cycle in the link graph doesn't loop forever.
python
import requests
from collections import deque
from html.parser import HTMLParser
from urllib.parse import urljoin, urldefrag, urlparse
from urllib.robotparser import RobotFileParser
HEADERS = {"User-Agent": "Mozilla/5.0 (link-discovery)"}
class LinkParser(HTMLParser):
"""Collect every href from <a> tags in a page."""
def __init__(self):
super().__init__()
self.links = []
def handle_starttag(self, tag, attrs):
if tag == "a":
for key, value in attrs:
if key == "href" and value:
self.links.append(value)
def load_robots(base_url):
rp = RobotFileParser()
rp.set_url(urljoin(base_url, "/robots.txt"))
rp.read() # parses Disallow rules and any crawl-delay
return rp
def crawl(start_url, max_pages=200, user_agent="*"):
"""Breadth-first walk of internal links, honoring robots.txt."""
host = urlparse(start_url).netloc
robots = load_robots(start_url)
seen = {start_url}
queue = deque([start_url])
found, skipped = set(), []
while queue and len(found) < max_pages:
url = queue.popleft()
# Compliance gate: never fetch a path the site disallows.
if not robots.can_fetch(user_agent, url):
skipped.append(url)
continue
try:
resp = requests.get(url, headers=HEADERS, timeout=30)
resp.raise_for_status()
except requests.RequestException:
# A single bad URL is recorded out-of-band, not chased inline.
continue
if "text/html" not in resp.headers.get("Content-Type", ""):
continue
found.add(url)
parser = LinkParser()
parser.feed(resp.text)
for href in parser.links:
absolute = urldefrag(urljoin(url, href))[0] # resolve + drop #fragment
if urlparse(absolute).netloc == host and absolute not in seen:
seen.add(absolute)
queue.append(absolute)
return found, skipped
if __name__ == "__main__":
pages, disallowed = crawl("https://example.com/", max_pages=200)
print(f"Discovered {len(pages):,} pages; skipped {len(disallowed)} disallowed by robots.txt")
A live run of this crawler against a static demo catalog discovered 40 pages with the cap set to 40 and zero URLs skipped, because that site's robots.txt disallows nothing. Pointed at a site whose robots.txt disallows a path, the same crawler correctly declined the disallowed URL and recorded it in the skipped list rather than fetching it β compliance enforced on every URL, not as an afterthought.
How this composes with the earlier methods:
- The crawler finds what the sitemap omits; the sitemap finds what the crawler can't reach by link. Run both and take the union for the most complete inventory.
- Failure stays out-of-band. A URL that errors is dropped from this pass and the crawl keeps going β one bad page never stalls the whole walk. Collect the dropped URLs separately for review.
- Cap concurrency for politeness. A single-threaded crawler like the one above is already gentle. If you parallelize, keep it to no more than 3 workers per host, and respect any
Crawl-delaytherobots.txtdeclares. - The crawler is HTML-only. Links painted by JavaScript after load are invisible to
requests. That gap is exactly what Method 6 closes.
Method 6 β Render JavaScript-heavy pages with Scrapeless Scraping Browser
A requests-based crawler reads whatever bytes the server sends. For a single-page app, an infinite-scroll catalog, or a React/Vue/Next.js navigation, those bytes are an app shell β <div id="root"></div> plus a script tag β and the internal links paint client-side once the bundle runs. Plain HTTP can't see them; a real browser can.
Scrapeless Scraping Browser renders the page in a cloud browser and exposes it over the Chrome DevTools Protocol. You connect with Playwright over a WebSocket endpoint, warm the homepage so the session carries cookies, navigate to the target, then collect the anchors from the hydrated DOM β the rendered-page analog of the link-parsing step in Method 5. US egress is pinned on the session so geo-gated sites serve their standard structure.
The connection is a single WebSocket URL built from your API key. This is the exact connection shape:
python
import os
from urllib.parse import urlencode
from playwright.sync_api import sync_playwright
def scraping_browser_url(proxy_country="US", session_ttl=240):
params = urlencode({
"token": os.environ["SCRAPELESS_API_KEY"],
"sessionTTL": session_ttl,
"proxyCountry": proxy_country,
})
return f"wss://browser.scrapeless.com/api/v2/browser?{params}"
With the endpoint in hand, render the target and harvest its internal links:
python
from urllib.parse import urljoin, urldefrag, urlparse
def discover_rendered_links(start_url, proxy_country="US"):
"""Render a JS-heavy page in the cloud browser and collect same-host links."""
host = urlparse(start_url).netloc
homepage = f"{urlparse(start_url).scheme}://{host}/"
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp(scraping_browser_url(proxy_country))
context = browser.contexts[0] if browser.contexts else browser.new_context()
page = context.pages[0] if context.pages else context.new_page()
# Warm the homepage first so the session carries cookies, then go to the target.
page.goto(homepage, wait_until="domcontentloaded", timeout=60_000)
page.goto(start_url, wait_until="networkidle", timeout=60_000)
# Read hrefs from the hydrated DOM, after client-side rendering has run.
hrefs = page.eval_on_selector_all(
"a[href]", "els => els.map(e => e.getAttribute('href'))"
)
browser.close()
links = set()
for href in hrefs:
if not href:
continue
absolute = urldefrag(urljoin(start_url, href))[0]
if urlparse(absolute).netloc == host:
links.add(absolute)
return links
if __name__ == "__main__":
found = discover_rendered_links("https://example.com/app/catalog")
print(f"Discovered {len(found):,} client-side links after render")
for u in sorted(found)[:10]:
print(" ", u)
The pattern that makes this reliable:
- Warm the homepage, then navigate. The first
gototo/lets the session pick up cookies and pass the site's first-load checks; the secondgotolands on the page you actually want to enumerate. Going straight to a deep URL on a cold session is more likely to draw a challenge. wait_until="networkidle"on the target gives the client-side router time to mount its links before you read the DOM. For infinite-scroll pages, scroll to the bottom in a loop (page.mouse.wheel) until the link count stops growing, then collect.- The rendering runs in Scrapeless's cloud, not on your machine. The local
playwright install chromiumis only the CDP client that speaks to thewss://browser.scrapeless.com/...endpoint. - Feed the result back into the union. The rendered links join the sitemap set and the static-crawl set; de-dupe the combined collection by normalized URL for the final inventory.
To wire this into a full crawler, swap the requests.get body in Method 5 for discover_rendered_links on the hosts you know are JavaScript-heavy, and keep the cheap requests path for the server-rendered majority. That HTTP-first, browser-second split keeps cloud-browser usage to the pages that actually need it.
What You Get Back
Each method emits a set of absolute URLs; the final inventory is the de-duplicated union across all of them. A merged record for one host looks like this:
json
// Schema reflects the union of the four programmatic methods.
// Counts are illustrative samples, not a frozen snapshot of any site today.
{
"host": "example.com",
"discovered_at_methods": ["sitemap", "robots", "bfs_crawl", "rendered"],
"counts": {
"from_sitemap": 226,
"from_bfs_crawl": 40,
"from_rendered": 18,
"union_unique": 248
},
"sample_urls": [
"https://example.com/",
"https://example.com/blog",
"https://example.com/blog/how-to-scrape-bbb-business-listings",
"https://example.com/app/catalog?page=2"
],
"skipped_by_robots": [
"https://example.com/private/"
]
}
A few honest observations about full-site URL discovery, worth knowing before running at scale:
- The union beats any single source. Sitemaps declare what the publisher registered; the BFS crawl finds linked orphans; the rendered pass finds client-side links. Coverage is the union of all three minus the disallowed paths.
- De-dupe on normalized URLs. Strip fragments, decide whether trailing slashes and
?utm_*query params are significant for your use case, and normalize before counting β otherwise/pageand/page/inflate the total. - Sitemaps lag content. A page published after the last sitemap build only shows up in the crawl tiers. If a complete inventory matters, always run a crawl alongside the sitemap read.
- Client-side links are invisible to HTTP. If a
requestscrawl of a known-large site returns only a handful of URLs, the navigation is almost certainly rendered client-side β escalate that host to Method 6. - Respect the disallow list end to end. The
skipped_by_robotsarray is not a TODO list. Those paths stay out of the inventory.
Conclusion: build a complete URL inventory
Finding every page reduces to four programmatic moves layered on top of one manual check: estimate the size with site:, read robots.txt for the sitemap location and the disallowed paths, walk the sitemap tree for the declared URLs, BFS-crawl the internal link graph for the orphans, and render the JavaScript-heavy hosts in a cloud browser for the client-side links. Take the union, de-dupe on normalized URLs, and that is the inventory.
For the proxy layer that routes the rendered tier, see What Is an SSL Proxy?. The Scraping Browser product page and the pricing page cover the cloud-browser tier; the full SDK reference is at docs.scrapeless.com. Pin US egress on the rendered tier, warm the homepage before the target page, honor robots.txt on every URL, and treat the final list as the union of every method.
Ready to Build Your AI-Powered Data Pipeline?
Join our community to claim a free plan and connect with developers building URL-discovery and crawling pipelines: Discord Β· Telegram.
Sign up at app.scrapeless.com for free Scraping Browser runtime and adapt the patterns above to the sites, sitemaps, and regions the pipeline needs.
FAQ
Q: Is crawling a website to find all its pages legal?
Discovery itself reads publicly visible URLs, but legality depends on what you access, from where, and under what terms. Honor the site's robots.txt, review its terms of service, avoid private or authenticated areas, and consult counsel for high-stakes use cases. Scrapeless accesses publicly available data only.
Q: Sitemap or crawl β which gives the complete list?
Neither alone. A sitemap is the publisher's declaration and is often stale or partial; a crawl finds the linked graph but misses pages no page links to. The complete inventory is the union of the sitemap walk (Method 2) and the BFS crawl (Method 5), with the rendered tier (Method 6) added for client-side links.
Q: Why does my crawler find only a handful of pages on a large site?
The site almost certainly renders its navigation client-side. A plain requests fetch returns the app shell before JavaScript runs, so there are no links to follow. Render those hosts with Scrapeless Scraping Browser (Method 6) and collect the anchors from the hydrated DOM instead.
Q: Do I need a proxy for URL discovery?
For a single-host sitemap read and a polite static crawl, often not. A proxy earns its place when the site geo-gates content (you need US egress to see the US structure), when your IP is rate-limited, or when the rendered tier needs residential egress to match organic traffic. The Scrapeless connection in Method 6 pins egress with proxy_country="US".
Q: How do I get a clean render when a page serves an access challenge?
Pin US residential egress on the session and warm the session β navigate to the site's homepage first in the same browser session before the target page, as the Method 6 code does, so cookies are set and the deep page loads in an already-trusted context. A cold jump straight to a deep URL is more likely to draw a challenge.
Q: What happens when the site changes its HTML or link structure?
The static crawler keys on the generic <a href> element, so markup changes rarely break it. If you've tightened the rendered-tier selector to a specific container, re-check it when the markup shifts and widen it back to a[href] if the site reorganizes its navigation.
Q: How do I avoid hammering the site while crawling?
The single-threaded crawler in Method 5 is already gentle. If you parallelize, keep concurrency to no more than 3 workers per host, honor any Crawl-delay directive in robots.txt, and never queue a path the disallow rules cover.
Q: How do I de-duplicate the final URL set?
Normalize before counting: strip #fragments, decide whether a trailing slash and tracking query params (?utm_*) are significant for your use case, and store URLs in a set keyed on the normalized form. Each method already returns a set; take the union of all of them and the duplicates collapse.
Q: Can I discover URLs without an AI agent or any SDK?
Yes. Methods 1β5 use only the Python standard library and requests. Method 6 adds Playwright connecting to the Scrapeless Scraping Browser endpoint over CDP β no agent framework required, just the WebSocket URL built from your API key.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.



