How to Scrape Home Depot Product Data With an Agent Browser
Expert in Web Scraping Technologies
Key Takeaways:
- One PDP fetch, full product schema. Home Depot's PDP HTML embeds a JSON-LD
Productblock server-side β name, brand, sku, model, gtin, image, description, offers (price, currency, availability), aggregate rating, and the top 10 reviews. That's the fast path: no hydration wait, no React shell to wrestle. - Hydrated fields fill in the rest. Sale flags, fulfillment options (ship / pickup / delivery), per-store availability text, and the on-PDP review carousel arrive after JavaScript runs. Scrapeless Scraping Browser β the agent-ready cloud browser this guide uses β renders them in a real cloud browser, so a single CLI call returns the populated DOM.
- Per-store stock is the differentiator. Setting a Home Depot store via the location-selector modal returns store-specific availability and pickup ETAs alongside the standard PDP fields β a signal not exposed through generic search APIs.
- Search and category at scale.
/s/<query>and/b/<category>listing pages paginate via theNaoURL offset. The same Discover β Extract pattern that powers PDP scraping returns paginated cards with productId, title, brand, price, rating, reviewCount, image, and canonical PDP URL. - Clean review schema. The per-review extractor emits a flat, semantic shape β
id,title,text,rating,badges,reviewer.name,time,original_source.name,images[].link,total_positive_feedback,total_negative_feedbackβ that maps directly to typical review-intelligence pipelines without renaming. - US proxy egress is required β 100%. Home Depot is a US-only retail site.
--proxy-country USis mandatory on every session. - Use canonical product URLs. The reliable PDP target is
/p/<slug>/<productId>(with the slug); bare ID-only URLs like/p/<productId>returned Home Depot's generic error page in verification. The reviews-page target is/p/reviews/<slug>/<productId>/<page>. - Discover β Extract. Read the live DOM with
get htmlfirst against semantic anchors (data-testid,aria-label,[itemprop], semantic ids), then writeevalselectors against what's actually rendered. Class names rotate; semantic anchors don't.
Introduction: scraping Home Depot product data with an agent browser
Home Depot is the largest US home-improvement retailer by revenue, with a public catalog spanning appliances, tools, building materials, and services. For competitive-pricing teams, MAP-compliance monitors, brand owners selling through Home Depot, inventory pipelines, and review-driven product researchers, the PDP, search/category, and per-store inventory surfaces are the data that drives the pipeline.
The catalog is hydrated client-side. A typical product detail page arrives as a thin HTML shell with a server-rendered JSON-LD Product block, and React fills the rest in once the bundle loads β sale price, current promotion, fulfillment options, store availability, the rating histogram, and the on-PDP review carousel are all populated after rendering. Pure-HTTP scraping returns the shell; the data lives one render cycle later.
This post walks through a terminal-first workflow on top of Scrapeless Scraping Browser β an agent-ready cloud browser that handles JavaScript rendering, residential-proxy egress, and session-bound state for per-store stock checks. Steps 1β8 below cover the full PDP extraction (JSON-LD fast path + hydrated fields), search/category pagination, the location-selector flow that unlocks store-specific availability, and the review pipeline (top-10 from JSON-LD plus rendered-DOM pagination, sort, and filter).
Every step runs through the scrapeless-scraping-browser CLI documented in skill-dev/SKILL.md. The emphasis is the agent-browser experience: describe the product-data shape you need in natural language, and let the skill drive the CLI under the hood.
For the same Discover β Extract pattern on another retailer, see the Amazon scraper post.
What You Can Do With It
- Competitive pricing. Track price, sale flag, and promotion text across competing SKUs on a rolling cadence; alert on MAP violations.
- MAP compliance monitoring. Brand owners track third-party seller pricing across the catalog and flag below-MAP listings for enforcement.
- Inventory & fulfillment intelligence. Per-store stock and pickup-ETA signals via the location-selector flow expose regional availability that generic search APIs do not.
- Catalog ingestion. Search and category listings feed downstream pipelines with a normalized product schema (productId, title, brand, price, rating, reviewCount, image, canonical URL).
- Review intelligence. Sentiment analysis, complaint clustering, verified-purchaser ratio tracking, photo-evidence collection β the flat per-review schema slots into existing review pipelines.
- Brand monitoring. Track first-party and competing-brand reviews across categories to detect launch-time review bombing or sustained quality regressions.
- Product development. Turn repeated negative themes from reviews into roadmap inputs, support docs, replacement-part changes, or listing-page improvements.
Why Scrapeless Scraping Browser
Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI agents. For Home Depot specifically, it brings:
- US residential proxies via
--proxy-country USβ required for Home Depot. - Cloud-side JavaScript rendering so price, fulfillment, store availability, the review carousel, sort dropdowns, photo filters, and the pagination bar arrive populated rather than as the React shell.
- Session persistence via
--session-idacross the snapshot β click β fill choreography for the store-locator, sort/filter, and pagination flows, so cookies and applied state stay consistent across follow-up commands within one chained shell invocation. - Anti-detection fingerprinting on every session, so PDPs and listing pages render identically to organic traffic.
- A single CLI surface β every operation needed for the discover, extract, and pagination steps (
open,wait,snapshot,eval,get,click,fill,cookies) is one CLI call away.
Get your API key on the free plan by sign up on Scrapeless and join our official community. The full CLI surface is documented in skill-dev/SKILL.md; the Proxy Solutions page covers the residential-proxy plan that backs the cloud browser.
Scrapeless Official Discord Community
Scrapeless Official Telegram Community
Prerequisites
- Node.js 18 or newer.
- A Scrapeless account and API key β sign up at app.scrapeless.com.
jq(optional, for JSON parsing in shell scripts β a portablegrepfallback is shown below).- Basic familiarity with the terminal.
Install
The recipes below run on the scrapeless-scraping-browser CLI. Setup is three steps β both CLI users and AI-agent users need #1 and #2; AI-agent users do #3 too.
1. Install the CLI package
bash
npm install -g scrapeless-scraping-browser
This provides the scrapeless-scraping-browser binary that every step of this post calls. The skill does not bring its own runtime β it loads command patterns into your AI agent, but the CLI itself must be installed first.
2. Configure your API key
Get your token from app.scrapeless.com, then store it where the CLI can read it:
bash
scrapeless-scraping-browser config set apiKey your_api_token_here
scrapeless-scraping-browser config get apiKey # verify
Using an AI agent? The skill's instructions tell the agent that authentication is required before any session call. If the API key isn't set when the agent first reaches for the CLI, the agent will prompt you and run the config set apiKey β¦ command for you.
The config file lives at ~/.scrapeless/config.json with access restricted to the current user, takes priority over the environment variable, and is portable across agents and CI runners. For CI pipelines, prefer:
bash
export SCRAPELESS_API_KEY=your_api_token_here
3. Install the Scrapeless skill in your AI agent
This is a separate step from step 1. Step 1 installed the CLI binary β the runtime your agent invokes. The skill is what teaches your agent how to invoke it correctly (selectors, waits, retry patterns, the discover β extract workflow). They're two different things, and you need both.
The skill is a folder containing SKILL.md + skill.json + references/. The canonical source is the scrapeless-ai/scrapeless-agent-browser β skills/scraping-browser-skill repo on GitHub.
To install it in Claude Code, Cursor, VS Code + GitHub Copilot, OpenAI Codex CLI, or Gemini CLI, follow the Scrapeless AI Agent install guide β it has the per-agent copy-paste commands (bash and Windows PowerShell). Reload your agent after install so the skill becomes active.
What the skill loads into your agent's operating context up-front:
- Authentication β check for
~/.scrapeless/config.jsonorSCRAPELESS_API_KEYand prompt you to set it if missing. - Discover β Extract workflow β read the live DOM with
get html "<region>"first, identify stable anchors (data-testid,aria-label, semantic ids,[itemprop='review']), then writeevalselectors based on what's actually rendered. - Wait gotchas for Home Depot β
wait 1500betweenopenand any selector wait to dodge the cold-sessionchrome://new-tab-page/race;wait '<review-anchor>'against a review-card element instead of blanketnetworkidle, because Home Depot keeps firing lazy beacons that never settle. - Selector syntax β when to use CSS selectors vs accessibility refs (
@e1fromsnapshot -i). - Parallel CLI workers β single-shell
&&chaining, unique session names, β€3 concurrent workers per host. - Common pitfalls β
evalreturns JSON-quoted values,openexits non-zero on successful navigation, sessions terminate when the connection closes. - Full command reference β every flag for
new-session,open,wait,eval,get,click,fill,snapshot,cookies,recording,stop.
4. Verify the skill is wired up
Before the first real Home Depot scrape, smoke-test the install with one safe prompt:
"Using the Scrapeless skill, open https://example.com and tell me the page title."
The agent should mint a Scrapeless session, navigate, and reply with "Example Domain". If those two words come back, the skill is loaded, the API key is set, and the cloud browser is reachable.
If it fails:
| Symptom | Likely cause | Fix |
|---|---|---|
| "I don't have a tool/skill to do that" | Skill not loaded in this agent session | Reinstall via the skill install guide and reload the agent |
Authentication failed / 401 |
API key not set | Re-run scrapeless-scraping-browser config set apiKey <token> (Install step 2) |
command not found |
CLI binary missing on PATH | Re-run Install step 1 |
ERR_TUNNEL_CONNECTION_FAILED on Home Depot |
Proxy pool had no available residential IP at allocation time | Mint a fresh session β keep --proxy-country US and retry shortly |
Hangs / lands on chrome://new-tab-page/ |
Cold-session wait race | Ask the agent to retry β the skill knows to insert wait 1500 between open and the next wait |
| Home Depot returns the generic error page | Non-US egress, ID-only URL, or transient session-fingerprint flag | Confirm --proxy-country US, use a canonical /p/<slug>/<productId> URL, mint a fresh session, retry |
Scrapeless session has been terminated and cannot be reconnected on every new-session call |
The local daemon cached a now-terminated session id and keeps trying to reconnect to it | Kill the daemon and clear its pidfile, then mint a fresh session: Stop-Process -Id (Get-Content "$env:USERPROFILE\.scrapeless-scraping-browser\default.pid") -Force; Remove-Item "$env:USERPROFILE\.scrapeless-scraping-browser\default.pid","$env:USERPROFILE\.scrapeless-scraping-browser\default.port" -Force (PowerShell on Windows). On Linux/macOS the path is ~/.scrapeless-scraping-browser/. |
How you actually use this: prompt your agent
After install, you scrape Home Depot product data by talking to your agent β not by copy-pasting bash. The skill loads selectors, waits, retry classifiers, and the discover β extract pattern into the agent's context, so a one-line natural-language prompt is enough to get structured JSON back.
Prompts you can paste
| You say to your agent | What you get back |
|---|---|
| "Get the full product schema for Home Depot product 204279858 (price, brand, model, image, availability)." | Step 2 JSON-LD product schema + Step 3 hydrated price/fulfillment payload |
| "Track the price + sale flag for this Home Depot PDP." | price, wasPrice, onSale, promotion, currency |
| "Search Home Depot for 'cordless drill' and return the first 5 pages of results." | Paginated listing cards: productId, title, brand, price, rating, reviewCount, image, productUrl |
| "Check stock for product 204279858 at ZIP 90015." | store, availabilityText, pickupEta, stockCount (per-store payload) |
| "For these 20 product IDs, get price + per-store availability at ZIP 33101." | One product-data JSON per ID with the store-set fulfillment payload |
| "Resolve Home Depot product ID 326716329 to its canonical reviews URL, then return review JSON." | Canonical URL plus product-level summary and review array |
"Get the first 100 Home Depot reviews from this /p/reviews/... URL, newest first." |
Paginated reviews with page and per-page index metadata |
| "Get only the photo reviews for this Home Depot product." | Reviews filtered by images.length > 0 |
| "For product 204279858 list verified-purchaser reviews only." | Reviews filtered by the verified-purchaser badge |
| "Get newest reviews then sort by helpful count, return top 30." | Reviews after the UI sort flow, ranked by helpful votes |
| "Open the product page, set store ZIP 90015, then scrape store-context reviews." | Reviews collected from a store-set browser session (see Step 5) |
Worked example: get reviews for product 204279858 (DEWALT drill)
You type:
"Get title, text, rating, and reviewer name for the top reviews of Home Depot product 204279858. Return JSON."
The agent's plan (in plain English):
- Resolve
204279858to the canonical PDP URL/p/<slug>/204279858.- Mint a US-egress session (
--proxy-country US); retry once on transientos error 10054/ 503.- Open the PDP URL,
wait 1500, thenwait 'h1'to dodge the cold-session race.evalagainstscript[type="application/ld+json"]to parse the embeddedProductschema and return the top 10 reviews + aggregate.- If more than 10 reviews are needed, navigate to
/p/reviews/<slug>/204279858/<page>and run the rendered-DOM extractor (Step 6) per page.
What you get back (illustrative output, body excerpts trimmed for length):
json
{
"productId": "204279858",
"productUrl": "https://www.homedepot.com/p/DEWALT-20V-MAX-Cordless-1-2-in-Drill-Driver-2-20V-1-3Ah-Batteries-Charger-and-Bag-DCD771C2/204279858",
"productName": "DEWALT 20V MAX Cordless 1/2 in. Drill/Driver, (2) 20V 1.3Ah Batteries, Charger and Bag DCD771C2",
"brand": "DEWALT",
"sku": "1000014677",
"modelNumber": "DCD771C2",
"overallRating": 4.7,
"totalReviews": 11168,
"reviewsReturned": 10,
"reviews": [
{
"title": "Great tool!",
"text": "I've been using this for two weeks now, and they are worth every penny. The quality is exceptionally crisp...",
"rating": 5,
"reviewer": { "name": "kevein" },
"time": null,
"original_source": { "name": "homedepot.com" }
},
{
"title": null,
"text": "I purchased the Dewalt 20v max Drill/Driver to replace an ancient Craftsman cordless drill...",
"rating": 5,
"reviewer": { "name": "DIYer_Bill" },
"time": null,
"original_source": { "name": "homedepot.com" }
}
// ... 8 more reviews with the same shape
]
}
time comes back null because Home Depot does not include datePublished in the JSON-LD review objects β the timestamps are present on the rendered review-page DOM, so retrieve them via the Step 6 path if review timestamps are required.
That's the entire user-facing surface for this scrape. The bash, selectors, and waits in Steps 1β8 below are what the skill makes the agent run β you don't have to type any of them.
Shaping prompts: how to control what comes back
| Phrasing | Effect |
|---|---|
| "β¦return JSON" / "β¦as CSV" | Output format |
| "β¦fields: title, text, rating, time only" | Restricts the fields the agent extracts |
| "β¦top 25" / "β¦across pages 1β10" | Pagination depth |
| "β¦newest first" / "β¦lowest rating first" | Triggers the UI sort flow |
| "β¦photo reviews only" | Triggers the photo-review filter |
| "β¦verified purchasers only" | Filters extracted reviews by badge |
| "β¦save to reviews.jsonl" | Writes one review per line to file |
| "β¦then summarize the top 5 complaints" | Chains a post-extraction analysis pass |
| "β¦set store ZIP 90015 first" | Triggers the store-locator flow before scraping |
That's the workflow. Steps 1β8 below are the under-the-hood reference β read them once to see how the discover β extract pattern composes; then trust your agent to apply it.
Step 1 β Connect to Scrapeless Scraping Browser
Create a US-egress session.
bash
SESSION=$(scrapeless-scraping-browser new-session \
--name "homedepot-product-data" \
--ttl 1800 \
--recording true \
--proxy-country US \
--json | jq -r '.data.taskId')
echo "Session: $SESSION"
Portable fallback without jq:
bash
SESSION=$(scrapeless-scraping-browser new-session \
--name "homedepot-product-data" --ttl 1800 --recording true \
--proxy-country US --json \
| grep -oE '"taskId":"[^"]*"' | cut -d'"' -f4)
Residential-proxy allocations occasionally return a transient 503 on the first attempt β retry once. If ERR_TUNNEL_CONNECTION_FAILED appears, the proxy pool has no available IP at allocation time; mint a fresh session and retry shortly.
Step 2 β Fast path: extract product schema + top-10 reviews from JSON-LD
The Home Depot PDP HTML embeds a server-side JSON-LD Product block β name, brand, sku, model, gtin, image, description, offers (price, currency, availability), aggregate rating, and the top 10 reviews β all without waiting for hydration. Schema.org-conformant fields the product doesn't ship (e.g., availability, seller, priceValidUntil, mpn, color) come back null and should be treated as nullable.
Open the canonical PDP (with the slug β bare ID-only URLs return a generic error page) and eval the JSON-LD parser:
bash
PRODUCT_ID="204279858"
PRODUCT_SLUG="DEWALT-20V-MAX-Cordless-1-2-in-Drill-Driver-2-20V-1-3Ah-Batteries-Charger-and-Bag-DCD771C2"
PDP_URL="https://www.homedepot.com/p/$PRODUCT_SLUG/$PRODUCT_ID"
scrapeless-scraping-browser --session-id $SESSION open "$PDP_URL"
# Brief pause so the next wait doesn't resolve on the pre-warm
# chrome://new-tab-page/ before navigation commits.
scrapeless-scraping-browser --session-id $SESSION wait 1500
# Wait for the H1 product title β the JSON-LD blocks are server-rendered into
# the same static HTML and are guaranteed present once the H1 is visible.
# (Don't wait on `script[type="application/ld+json"]` directly β `wait` defaults
# to the "visible" state, and `<script>` elements are never visible.)
scrapeless-scraping-browser --session-id $SESSION wait 'h1'
scrapeless-scraping-browser --session-id $SESSION eval '
(() => {
const ld = [...document.querySelectorAll("script[type=\"application/ld+json\"]")]
.map((s) => { try { return JSON.parse(s.textContent); } catch { return null; } })
.filter(Boolean);
const product = ld.find((o) => o["@type"] === "Product");
if (!product) return JSON.stringify({ error: "no Product JSON-LD on this page" });
const productId =
product.productID || product.sku ||
location.pathname.match(/\/(\d+)(?:\/\d+)?(?:[/?#]|$)/)?.[1] || null;
const offers = Array.isArray(product.offers) ? product.offers[0] : product.offers || null;
const offerPrice = offers
? (offers.price ?? offers.priceSpecification?.price ?? offers.lowPrice ?? null)
: null;
const availability = offers?.availability
? String(offers.availability).replace(/^https?:\/\/schema\.org\//, "")
: null;
const images = []
.concat(product.image || [])
.flat()
.filter(Boolean);
const reviews = (Array.isArray(product.review) ? product.review : product.review ? [product.review] : [])
.map((r) => ({
title: r.headline || null,
text: r.reviewBody || null,
rating: Number(r.reviewRating?.ratingValue) || null,
reviewer: { name: r.author?.name || null },
time: r.datePublished || null,
original_source: { name: "homedepot.com" },
}));
return JSON.stringify({
productId,
productUrl: location.href,
productName: product.name,
brand: product.brand?.name || product.brand || null,
sku: product.sku || null,
modelNumber: product.model || null,
gtin: product.gtin13 || product.gtin14 || product.gtin12 || product.gtin8 || product.gtin || null,
mpn: product.mpn || null,
category: product.category || null,
description: product.description || null,
color: product.color || null,
image: images[0] || null,
images,
offers: {
price: offerPrice != null ? Number(offerPrice) : null,
currency: offers?.priceCurrency || null,
availability,
priceValidUntil: offers?.priceValidUntil || null,
itemCondition: offers?.itemCondition
? String(offers.itemCondition).replace(/^https?:\/\/schema\.org\//, "")
: null,
seller: offers?.seller?.name || null,
},
overallRating: Number(product.aggregateRating?.ratingValue) || null,
totalReviews: Number(product.aggregateRating?.reviewCount) || null,
reviewsReturned: reviews.length,
reviews,
});
})()
'
This single PDP fetch returns the full server-rendered product schema β productId, name, brand, sku, model, gtin, image, description, offers (price, currency, availability), aggregate rating β plus the top 10 reviews in a flat, semantic shape (id, title, text, rating, reviewer.name, time, original_source.name, images[].link, total_positive_feedback, total_negative_feedback). It returns reliably without depending on the React shell to hydrate.
Some fields are conditional. JSON-LD ships what schema.org marks the product with β gtin, mpn, color, priceValidUntil, seller are present on most catalog items but not all (treat them as nullable). For sale-flag detection, current promotion text, and per-store availability, those live in the rendered DOM rather than JSON-LD β Step 3 covers that surface. For the rating histogram per star (5β
/4β
/3β
/2β
/1β
counts), the histogram region is rendered inline on the reviews page covered in Step 6.
Step 3 β PDP hydrated fields (price, fulfillment, image, specs)
The Step 2 JSON-LD pass returns the static product schema. Sale-price overrides, current promotion text, fulfillment options (ship to home, store pickup, scheduled delivery), the in-stock badge, the hero-image gallery, and the spec/feature tables all live in the rendered DOM and arrive after React hydrates. Drive the discover β extract flow on the same PDP, against semantic anchors that survive class-name rotation.
Run a
get htmldiscover pass against the PDP's price region first, confirm the active anchor names, then tighten theevalselectors to whatever the page actually ships. Class names rotate;data-testidandaria-labelpatterns are the stable surface.
bash
# The session is still on the PDP from Step 2. Wait for the price region to render,
# then peek at the surrounding HTML to confirm anchors.
scrapeless-scraping-browser --session-id $SESSION wait '[data-testid*="price" i], [class*="price" i], [data-component*="Price"]'
scrapeless-scraping-browser --session-id $SESSION get html '[data-testid*="price" i], [data-testid*="fulfillment" i]'
From the discovered HTML, the longest-lived anchors are [data-testid*="price"], [data-testid*="fulfillment"], [data-testid*="availability"], and aria-labels on fulfillment buttons (aria-label*="Ship to Home", aria-label*="Store Pickup", aria-label*="Scheduled Delivery"). Then eval the extractor:
bash
scrapeless-scraping-browser --session-id $SESSION eval '
(() => {
const text = (el) => (el ? el.textContent.replace(/\s+/g, " ").trim() : null);
const number = (s) => {
if (!s) return null;
const m = String(s).replace(/,/g, "").match(/-?\d+(?:\.\d+)?/);
return m ? Number(m[0]) : null;
};
const priceText =
text(document.querySelector("[data-testid*=\"price\" i]")) ||
text(document.querySelector("[class*=\"price\" i]"));
const wasPriceText = text(document.querySelector(
"[data-testid*=\"was-price\" i], [class*=\"was-price\" i], [data-testid*=\"strike\" i]"
));
const onSale = !!wasPriceText && priceText !== wasPriceText;
const promotion = text(document.querySelector("[data-testid*=\"promo\" i], [data-testid*=\"saving\" i]"));
const fulfillment = [...document.querySelectorAll(
"[data-testid*=\"fulfillment\" i] button, [aria-label*=\"Ship\" i], [aria-label*=\"Pickup\" i], [aria-label*=\"Delivery\" i]"
)].map((btn) => {
const label = btn.getAttribute("aria-label") || text(btn);
return label ? { option: label, available: !btn.matches("[disabled], [aria-disabled=\"true\"]") } : null;
}).filter(Boolean);
const availabilityText = text(document.querySelector(
"[data-testid*=\"availability\" i], [data-testid*=\"in-stock\" i]"
));
const heroImg = document.querySelector(
"img[data-testid*=\"main-image\" i], [data-testid*=\"image-gallery\" i] img, picture img"
);
const features = [...document.querySelectorAll(
"[data-testid*=\"feature\" i] li, [class*=\"feature\" i] li, [data-testid*=\"highlights\" i] li"
)].map(text).filter(Boolean).slice(0, 20);
const specs = Object.fromEntries(
[...document.querySelectorAll("[data-testid*=\"specs\" i] tr, [class*=\"specs\" i] tr")]
.map((row) => {
const cells = [...row.querySelectorAll("th, td")].map(text);
return cells.length >= 2 ? [cells[0], cells.slice(1).join(" ")] : null;
})
.filter(Boolean)
);
return JSON.stringify({
productUrl: location.href,
price: number(priceText),
priceText,
wasPrice: number(wasPriceText),
onSale,
promotion,
currency: "USD",
fulfillment,
availabilityText,
image: heroImg?.currentSrc || heroImg?.src || null,
features,
specs,
});
})()
'
For most pricing-intelligence pipelines, run both Step 2 and Step 3 against the same session: Step 2 captures the canonical JSON-LD schema (productId, brand, sku, model, gtin, image, offers), Step 3 captures the hydrated overrides (sale flag, promotion, fulfillment options, in-stock badge, specs/features). The two payloads merge cleanly on productUrl.
Step 4 β Search and category listings
Home Depot exposes search at /s/<query> and category landing pages at /b/<category-slug>. Both paginate through the Nao URL offset (?Nao=24, ?Nao=48, β¦). Each page surfaces ~24 product cards.
bash
QUERY="cordless drill"
# Encode spaces in the query (Home Depot uses standard URL encoding)
SEARCH_URL="https://www.homedepot.com/s/${QUERY// /%20}"
scrapeless-scraping-browser --session-id $SESSION open "$SEARCH_URL"
# Search pages have a two-phase render: a React placeholder skeleton with a
# single card mounts first, then the real ~24-card grid populates. A bare
# `wait '[data-testid*="product-pod" i]'` can resolve on the placeholder, so
# the extractor runs against an empty grid. Wait for the real grid
# (β₯ 5 cards) instead of any single match.
scrapeless-scraping-browser --session-id $SESSION wait 3000
scrapeless-scraping-browser --session-id $SESSION eval \
'document.querySelectorAll("[data-testid*=\"product-pod\" i], [data-pod-position]").length >= 5'
scrapeless-scraping-browser --session-id $SESSION eval '
(() => {
const text = (el) => (el ? el.textContent.replace(/\s+/g, " ").trim() : null);
const number = (s) => {
if (!s) return null;
const m = String(s).replace(/,/g, "").match(/-?\d+(?:\.\d+)?/);
return m ? Number(m[0]) : null;
};
const cards = [...document.querySelectorAll("[data-testid*=\"product-pod\" i], [data-pod-position]")];
const results = cards.map((c) => {
const link = c.querySelector("a[href*=\"/p/\"]");
const href = link?.getAttribute("href");
const productId = href?.match(/\/(\d{6,})(?:[/?#]|$)/)?.[1] || null;
const titleEl = c.querySelector("[data-testid*=\"product-title\" i], [class*=\"product-title\" i], h2, h3");
const ratingEl = c.querySelector("[aria-label*=\"out of\" i]");
const reviewCountEl = c.querySelector("[data-testid*=\"review-count\" i], [class*=\"review-count\" i]");
const priceEl = c.querySelector("[data-testid*=\"price\" i], [class*=\"price\" i]");
const brandEl = c.querySelector("[data-testid*=\"brand\" i], [class*=\"brand\" i]");
const img = c.querySelector("img");
return {
productId,
productUrl: href ? new URL(href, location.origin).href : null,
title: text(titleEl),
brand: text(brandEl),
price: number(text(priceEl)),
priceText: text(priceEl),
rating: number(ratingEl?.getAttribute("aria-label")),
reviewCount: number(text(reviewCountEl)),
image: img?.currentSrc || img?.src || null,
};
}).filter((r) => r.productId || r.productUrl);
const offset = Number(new URLSearchParams(location.search).get("Nao") || 0);
return JSON.stringify({
query: location.href,
page: Math.floor(offset / 24) + 1,
pageSize: results.length,
results,
});
})()
'
Pagination loop for the first five pages:
bash
for offset in 0 24 48 72 96; do
scrapeless-scraping-browser --session-id $SESSION open "$SEARCH_URL?Nao=$offset"
scrapeless-scraping-browser --session-id $SESSION wait 1500
scrapeless-scraping-browser --session-id $SESSION wait '[data-testid*="product-pod" i], [data-pod-position]'
scrapeless-scraping-browser --session-id $SESSION eval '/* same listings extractor */' \
> "search-page-$((offset / 24 + 1)).json"
done
For higher fan-out across dozens of queries, mint fresh sessions per query and cap concurrency at β€3 workers per host (see Step 8 parallel-workers note). Category pages (/b/<category-slug>) accept the same Nao offset and the same extractor.
Step 5 β Per-store stock check (ZIP code)
Home Depot's distinctive feature is per-store availability: setting a Home Depot store via the location-selector modal returns store-specific stock and pickup-ETA text on the same PDP, alongside national fulfillment options. The same flow can also shift the "Most Helpful" review ordering toward regional helpful votes, so a single store-set session covers both inventory and review-context use cases.
The
@e15/@e22/@e25accessibility refs in the snippet are placeholders β Scrapelesssnapshot -iassigns refs dynamically per page render. Capture the snapshot, find the rows labelled "Change Store", the ZIP input, and the Search / Confirm button, and substitute the real@e<n>numbers before running the click sequence. Confirm the[data-testid*="store-locator" i]/[data-testid*="store-name" i]selectors against the live DOM with aget htmldiscover pass β these match Home Depot's generaldata-testidconventions but should be tightened to whatever the page renders.
bash
ZIP="90015"
# 1. Open the PDP and surface accessibility refs for the location selector.
scrapeless-scraping-browser --session-id $SESSION open \
"https://www.homedepot.com/p/$PRODUCT_SLUG/$PRODUCT_ID"
scrapeless-scraping-browser --session-id $SESSION wait 1500
scrapeless-scraping-browser --session-id $SESSION wait '[data-testid*="store-locator" i], [aria-label*="store" i]'
scrapeless-scraping-browser --session-id $SESSION snapshot -i > /tmp/pdp-refs.txt
# 2. Click "Change Store" / "My Store" β fill ZIP β confirm.
# Adjust the @e<n> refs below to whatever the snapshot returns.
scrapeless-scraping-browser --session-id $SESSION click @e15 # Change Store
scrapeless-scraping-browser --session-id $SESSION wait 800
scrapeless-scraping-browser --session-id $SESSION fill @e22 "$ZIP" # ZIP input
scrapeless-scraping-browser --session-id $SESSION click @e25 # Search / Confirm
scrapeless-scraping-browser --session-id $SESSION wait 1500
scrapeless-scraping-browser --session-id $SESSION wait '[data-testid*="fulfillment" i], [data-testid*="availability" i]'
# 3. Extract per-store availability + the store badge.
scrapeless-scraping-browser --session-id $SESSION eval '
(() => {
const text = (el) => (el ? el.textContent.replace(/\s+/g, " ").trim() : null);
return JSON.stringify({
productUrl: location.href,
store: text(document.querySelector("[data-testid*=\"store-name\" i], [data-testid*=\"my-store\" i]")),
zip: "'$ZIP'",
availabilityText: text(document.querySelector("[data-testid*=\"availability\" i], [data-testid*=\"in-stock\" i]")),
pickupEta: text(document.querySelector("[data-testid*=\"pickup\" i], [aria-label*=\"Pickup\" i]")),
stockCount: text(document.querySelector("[data-testid*=\"stock-count\" i], [class*=\"stock-count\" i]")),
});
})()
'
The store-set state is preserved in cookies for the rest of the session β subsequent calls (the Step 3 hydrated-fields eval, the Step 6 rendered-review extractor, the Step 7 sort flow) return the regional view without re-driving the modal.
Cookie-set fallback. When the location-selector modal isn't reachable (popup blocker, A/B variant, transient WAF), the same outcome is available via cookies β THD_LOCSTORE / THD_PERSIST carry the store id and ZIP. Set them via scrapeless-scraping-browser cookies set and the next open returns the store-context PDP without the click flow.
Step 6 β Reviews 11+ via the rendered carousel
When the workflow needs more than the JSON-LD top 10 (full review corpus, applied sort/filter, custom date range), the rendered review widget is the surface. The reviews live at /p/reviews/<slug>/<productId>/<page> and inside the on-PDP carousel; both render through the same paginated-product-reviews micro-frontend at assets.thdstatic.com/experiences/paginated-product-reviews/*.
Drive the discover β extract flow:
bash
REVIEWS_URL="https://www.homedepot.com/p/reviews/$PRODUCT_SLUG/$PRODUCT_ID/1"
scrapeless-scraping-browser --session-id $SESSION open "$REVIEWS_URL"
scrapeless-scraping-browser --session-id $SESSION wait 1500
# networkidle does NOT settle on Home Depot β wait for review-card anchors instead.
scrapeless-scraping-browser --session-id $SESSION wait "#reviews, [itemprop='review'], [data-testid*='review' i]"
# Peek at the review region HTML to confirm anchors
scrapeless-scraping-browser --session-id $SESSION get html "#reviews, [data-component*='Review' i]"
From the returned HTML, identify the stable anchors: review cards usually live under [itemprop='review'], [data-testid*='review' i], or repeated <article>/<li> structure; star widgets carry an aria-label like "5 out of 5"; reviewer names sit near repeated heading patterns; verified-purchaser badges expose stable text content; pagination buttons carry accessible labels.
Then eval the extraction:
bash
scrapeless-scraping-browser --session-id $SESSION eval '
(() => {
const text = (el) => (el ? el.textContent.replace(/\s+/g, " ").trim() : null);
const number = (v) => {
if (v == null) return null;
const m = String(v).replace(/,/g, "").match(/-?\d+(?:\.\d+)?/);
return m ? Number(m[0]) : null;
};
const unique = (xs) => [...new Set(xs.filter(Boolean))];
const productId =
location.pathname.match(/\/(\d+)(?:\/\d+)?(?:[/?#]|$)/)?.[1] || null;
const root =
document.querySelector("#reviews") ||
document.querySelector("[data-component*=\"Review\" i]") ||
document.querySelector("[data-testid*=\"review\" i]") ||
document.body;
const cards = [...root.querySelectorAll(
"[itemprop=\"review\"], [data-testid*=\"review\" i], article, li"
)].filter((c) => {
const bodyStr = text(c) || "";
const hasRating = !!c.querySelector("[aria-label*=\"out of\" i]");
const looksReviewish = /(verified|recommend|helpful|review|purchased)/i.test(bodyStr);
return bodyStr.length > 40 && (hasRating || looksReviewish);
});
const reviews = cards.map((c) => {
const bodyStr = text(c) || "";
const ratingLabel =
c.querySelector("[aria-label*=\"out of\" i]")?.getAttribute("aria-label") ||
text(c.querySelector("[data-testid*=\"rating\" i]"));
const dateEl = c.querySelector("time, [datetime], [data-testid*=\"date\" i]");
const helpfulText = unique(
[...c.querySelectorAll("button, [aria-label*=\"helpful\" i]")]
.map((el) => el.getAttribute("aria-label") || text(el))
).join(" ");
return {
id: c.getAttribute("id") || c.getAttribute("data-review-id") || null,
title: text(c.querySelector("[data-testid*=\"title\" i], h3, h4")),
text: text(c.querySelector("[data-testid*=\"body\" i], p")),
rating: number(ratingLabel),
isRecommended: /recommend/i.test(bodyStr)
? !/not recommend|would not recommend/i.test(bodyStr) : null,
badges: unique(
[...c.querySelectorAll("[data-testid*=\"badge\" i], [class*=\"badge\" i]")].map(text)
.concat(bodyStr.match(/Verified Purchaser|Verified Buyer|Pro|Staff Pick|Incentivized/gi) || [])
),
reviewer: { name: text(c.querySelector("[data-testid*=\"author\" i], [class*=\"author\" i]")) },
time: dateEl?.getAttribute("datetime") || text(dateEl),
original_source: { name: "homedepot.com" },
images: unique([...c.querySelectorAll("img")].map((i) => i.currentSrc || i.src)).map((link) => ({ link })),
total_positive_feedback: number(helpfulText.match(/(\d+)\s*(?:people\s*)?(?:found\s*)?(?:helpful|yes|up)/i)?.[1]),
total_negative_feedback: number(helpfulText.match(/(\d+)\s*(?:not helpful|no|down)/i)?.[1]),
verified: /verified/i.test(bodyStr),
};
}).filter((r) => r.text || r.title);
return JSON.stringify({ productId, productUrl: location.href, reviewsReturned: reviews.length, reviews });
})()
'
The rendered-DOM extractor is the path for review pagination (Step 8), applied sort/filter (Step 7), and any time the workflow needs reviews 11+. Treat absent fields as null or [] rather than dropping the key β that keeps downstream pipelines stable when Home Depot rotates a review-card template.
Per-review schema:
id,title,text,rating,badges,reviewer.name(nested),time,original_source.name(nested),images[].link(array of objects),total_positive_feedback,total_negative_feedback, plus the Scrapeless extrasisRecommended(boolean derived from review text) andverified(boolean derived from badge text). Keep absent fields asnullor[]so downstream consumers stay stable across DOM rotations.
If the rendered-review pass returns zero cards even after the wait, the micro-frontend hasn't fully hydrated; retry the wait, retry with a fresh session if Access Denied returned, or fall back to the JSON-LD top 10 from Step 2 plus the GraphQL pagination path (/federation-gateway/graphql?opname=reviewSentiments) for deeper queries.
Step 7 β Sort and filter reviews through the UI
Home Depot exposes sort options (newest, highest rating, lowest rating, most helpful, photo reviews only) on the reviews page. Drive those controls inside the same persistent session β the snapshot β click choreography preserves cookies and applied state.
The
@e34/@e35/@e36/@e41refs in the snippet are placeholders β Scrapelesssnapshot -iassigns refs dynamically per page render. Capture the snapshot first, find the rows for the sort dropdown and filter buttons, and substitute the real numbers before running the click sequence.
bash
# Surface accessibility refs for interactive controls
scrapeless-scraping-browser --session-id $SESSION snapshot -i > /tmp/reviews-refs.txt
# Identify the sort dropdown / filter button refs from the snapshot, then click.
# On a typical Home Depot reviews page these surface as something like:
# @e34 [button] "Sort By: Most Helpful"
# @e35 [button] "Photos Only"
# @e36 [button] "Verified Purchaser"
# Adjust the @ref numbers below to whatever the snapshot returns.
scrapeless-scraping-browser --session-id $SESSION click @e34 # open sort dropdown
scrapeless-scraping-browser --session-id $SESSION wait 800
scrapeless-scraping-browser --session-id $SESSION snapshot -i > /tmp/sort-options.txt
scrapeless-scraping-browser --session-id $SESSION click @e41 # "Newest" option
scrapeless-scraping-browser --session-id $SESSION wait 1500
scrapeless-scraping-browser --session-id $SESSION wait "#reviews, [data-component*='Review' i]"
# Re-run the extraction after the review region rerenders
scrapeless-scraping-browser --session-id $SESSION eval '/* same review-extraction body as Step 6 */'
For a photo-only pass, click the visible photo-review filter; the review region rerenders with the filtered set, and the same extractor in Step 6 returns photo reviews only. If the page does not expose a photo filter for a given product, extract all visible reviews and post-filter on images.length > 0.
Step 8 β Paginate through reviews
Home Depot paginates reviews through /p/reviews/<slug>/<productId>/<page>. Two patterns work:
Pattern A β drive the visible "Next" control inside the same session, useful when sort/filter state has been applied via Step 7 and must persist across pages:
bash
for page in $(seq 1 5); do
scrapeless-scraping-browser --session-id $SESSION eval '/* extract review schema */' \
> "reviews-page-$page.json"
# Click visible "Next" pagination control
scrapeless-scraping-browser --session-id $SESSION eval '
(() => {
const next = [...document.querySelectorAll("a, button")]
.find((el) => /next/i.test(el.getAttribute("aria-label") || el.textContent || ""));
if (!next) return false;
next.scrollIntoView({ block: "center" });
next.click();
return true;
})()
'
scrapeless-scraping-browser --session-id $SESSION wait 1500
scrapeless-scraping-browser --session-id $SESSION wait "#reviews, [data-component*='Review' i]"
done
Pattern B β direct URL pagination with a fresh session per page, useful for parallel extraction at scale:
bash
PRODUCT_ID="326716329"
SLUG="NextWall-31-35-sq-ft-Off-White-Faux-Shiplap-Vinyl-Paintable-Peel-and-Stick-Wallpaper-Roll-PP10000"
for page in $(seq 1 10); do
SID=$(scrapeless-scraping-browser new-session \
--name "hd-reviews-p$page" --ttl 300 --proxy-country US --json \
| jq -r '.data.taskId')
URL="https://www.homedepot.com/p/reviews/$SLUG/$PRODUCT_ID/$page"
scrapeless-scraping-browser --session-id $SID open "$URL"
scrapeless-scraping-browser --session-id $SID wait 1500
scrapeless-scraping-browser --session-id $SID wait "#reviews, [data-component*='Review' i]"
scrapeless-scraping-browser --session-id $SID eval '/* review-extraction body */' \
> "reviews-page-$page.json"
scrapeless-scraping-browser stop $SID
done
One fresh session per page, not one long session. A reused --session-id is in theory more efficient, but in practice Scrapeless sessions begin returning blank chrome://new-tab-page/ or terminating after ~3 hydrated-page requests. Short-TTL sessions per page are more reliable, easy to parallelize, and trivial to reason about.
Running parallel workers? The default
~/.scrapeless-scraping-browser/daemon is shared across processes on the same host β multiple workers can hijack each other's sessions even when each passes its own--session-id. What works: chain every CLI call for one job as a single-shell&&invocation (atomic from the daemon's POV β other workers can't interleave between your steps), use unique session names (port is hashed from the name), and cap concurrency around 3 workers per host. For higher fan-out, shard across hosts. Seeskill-dev/SKILL.md"Parallel CLI Agents" for the full pattern.
Dedupe across pages by id when present; fall back to reviewer.name + time + title + text when no ID is exposed. Pinned reviews and sponsored widgets occasionally repeat across pages.
What You Get Back
The Scraping Browser returns a live DOM β the extraction schema is whatever the caller writes into the eval step. For a single reviews-page pass with the discover β extract template from Step 6, the fields that come back today look like this:
json
// Schema reflects exactly what the Step 6 eval emits.
// Field values are illustrative samples, not a frozen snapshot of any product today.
{
"productId": "326716329",
"productUrl": "https://www.homedepot.com/p/reviews/<slug>/326716329/1",
"reviewsReturned": 1,
"reviews": [
{
"id": "abc123",
"title": "Easy install and solid quality",
"text": "The product installed cleanly and worked as expected.",
"rating": 5,
"isRecommended": true,
"badges": ["Verified Purchaser"],
"reviewer": { "name": "HomeDepotCustomer" },
"time": "2024-04-04",
"original_source": { "name": "homedepot.com" },
"images": [],
"total_positive_feedback": 8,
"total_negative_feedback": 0,
"verified": true
}
]
}
The Step 6 eval emits productId, productUrl, reviewsReturned, and reviews[]. The rating histogram (overallRating, ratings[] per star), aggregate totalReviews, and product-level metadata are emitted by the Step 2 JSON-LD eval β pull both Step 2 and Step 6 into the same session if a workflow needs aggregate + corpus together. The Step 6 extractor can be extended with a histogram pass against [data-testid*="histogram" i] / [aria-label*="stars" i] rows, but that is not in the eval body shown above; add it explicitly when you need it.
A few honest observations about this output, worth knowing before running at scale:
- Hydration timing. The reviews micro-frontend at
assets.thdstatic.com/experiences/paginated-product-reviews/*paints in waves: page chrome and the H1 first, then the histogram and the review cards. Awaitagainst a review-card anchor (Step 6) is what gates extraction. If the discover pass returns the page chrome only, wait a beat longer and re-runget htmlbefore tightening selectors β the cards are usually one render cycle away. - Selector stability.
[itemprop='review'],[data-testid*='review' i], andaria-label*='out of'star widgets are the longest-lived anchors. Class names beginning with hashed prefixes rotate across deploys. - Histogram presence. The rating histogram is rendered inline on the reviews page but uses different DOM shapes across product categories. Treat its absence as nullable rather than retrying β for products with very few reviews, it may not be rendered at all.
- Verified-purchaser badge. Two stable surfaces: a span with badge styling carrying the literal text "Verified Purchaser", and a
data-testidon the badge wrapper. Match either. - Photos. Reviews with attached photos expose
<img>elements inside the review card; non-photo reviews simply have no<img>children βimages.length > 0is a clean filter. - WAF interstitials. Some allocations land on Home Depot's
Access Deniedpage (bodyLenβͺ 1000, no review markers). The script in Step 6 should detect that page (e.g.,if (/Access Denied|Error Page/i.test(document.title)) throw) and the caller retries with a fresh session.
Claim your free plan and start scrape:
Join Scrapeless's vibrant community to claim a $5-10 free plan and connecting with fellow innovators:
Scrapeless Official Discord Community
Scrapeless Official Telegram Community
FAQ
Q1: Do I need a proxy to scrape Home Depot?
Yes β 100%. Home Depot is a US retail site and serves a generic error page on non-US egress. --proxy-country US is required on every session.
Q2: Where do price and fulfillment options live β JSON-LD or rendered DOM?
Both. JSON-LD ships the canonical offers.price, offers.priceCurrency, and offers.availability (Step 2). Sale-price overrides, current promotion text, the per-store availability badge, and the fulfillment buttons (Ship to Home / Pickup / Delivery) are populated after React hydrates and are extracted from the rendered DOM (Step 3).
Q3: How do I scrape per-store stock?
Drive the location-selector modal: open the PDP, click "Change Store", fill the ZIP, confirm. Subsequent eval calls in the same session return the regional view. See Step 5 for the full snapshot β click β fill choreography. The cookie-set fallback (THD_LOCSTORE / THD_PERSIST) is documented at the end of Step 5 for environments where the modal isn't reachable.
Q4: Why does my session sometimes return ERR_TUNNEL_CONNECTION_FAILED?
The proxy pool had no available residential IP at allocation time. Mint a fresh session and retry shortly.
Q5: Why does the page sometimes return Access Denied?
Home Depot's WAF occasionally challenges a fresh allocation. Mint a new session and retry. The Step 6 extractor should detect the error-page title and throw so the caller can retry with a fresh session.
Q6: How do search and category pages paginate?
Through the Nao URL offset (?Nao=24, ?Nao=48, β¦) at 24 cards per page. Step 4 covers the listing extractor and the pagination loop. Category landing pages (/b/<slug>) accept the same offset.
Q7: Can I scrape only photo reviews?
Yes. First try the visible photo-review filter (Step 7). If that control is absent for a given product, extract all visible reviews and post-filter on images.length > 0.
Q8: How do I get newest or lowest-rated reviews?
Use the UI sort controls inside the same persistent session β the snapshot β click choreography in Step 7. Click the relevant control, wait for the review region to rerender, then rerun the extraction eval.
Q09: What happens when Home Depot changes the DOM?
Rerun the discover pass: get html "<region>", identify the current stable anchors (data-testid, aria-label, [itemprop], semantic ids), and adjust the eval selectors. Don't ship hashed class names as permanent selectors.
Q10: How many products can I collect in one session?
For reliability, keep one Scrapeless session to a small logical scrape (a few PDPs or a few search pages). For larger jobs, mint fresh sessions per task and keep concurrency β€ 3 per host (Step 8 parallel-workers note). Shard across hosts for higher fan-out.
Q11: Can this run without an AI agent?
Yes. The CLI commands in Steps 1β8 work end-to-end as bash. The agent-driven workflow (skill + natural-language prompts) is the recommended path because the skill carries the discover β extract pattern, the wait gotchas, and the parallel-worker rules so the prompt can stay one line.
Q12: Why use a canonical /p/<slug>/<productId> URL instead of a bare product ID?
Bare ID-only URLs like /p/<productId> returned Home Depot's generic error page in verification. The canonical PDP URL /p/<slug>/<productId> and the canonical reviews URL /p/reviews/<slug>/<productId>/<page> are the reliable browser targets; resolve product IDs to those shapes before opening the page.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.



