Track Your Brand Across 6 AI Answer Engines with One Pipeline
Lead Scraping Automation Engineer
Key Takeaways:
- Six AI answer engines, one pipeline. ChatGPT, Grok, Gemini, Perplexity, Copilot, and Google's AI Overview each answer buying questions with citations — and all six are capturable through one endpoint, one
x-api-token, and one{ status, task_id, task_result }envelope. - The platforms differ only at the field level. Each engine stores its citations under a different key (
content_references,web_search_results,citations,web_results,source); a six-line field map normalizes them into one citation stream. - Share of citation is the output metric. Group the normalized citations by domain per prompt and platform, and the count over time is your brand's AI-answer visibility.
- Three stages, three short scripts. Capture the answers, normalize the citations, report the counts — each stage is a runnable Python file you can put on a schedule.
- Pin the variables that move. Country, Grok's reasoning mode, and the prompt set stay fixed per series; the answers vary run to run, and that variance is the signal you chart.
- Free to start. New Scrapeless accounts include free trial credits — sign up at app.scrapeless.com.
Pipeline at a glance
A buyer asks an AI assistant which tool to choose, and the assistant names someone — backed by a short list of cited sources. Whether that someone is you differs per platform: the engine that cites you may not be the engine your buyers use. Tracking one platform tells you about that platform; the visibility picture is the six of them side by side.
The pipeline below produces that picture end to end:
- Capture — run a fixed prompt against all six engines through their Scrapeless actors; store the raw answers as JSONL.
- Normalize — map each platform's citation field into one unified
{platform, prompt, domain, url, title}stream. - Report — count citations by domain per platform, and check where your own domain appears.
Stage 1 is the only stage that touches the network. Stages 2 and 3 are pure transforms, so re-running analysis is free. For the conceptual background on why AI-answer citations became a visibility metric, the GEO and brand-AI-visibility piece covers the discipline; this guide builds the instrument.
Prerequisites
- A Scrapeless account and API key — sign up at app.scrapeless.com.
- Python 3.10+ with
requests. - A fixed prompt your buyers might actually ask (the worked example uses one; production runs use a set).
Store your key in the environment so it never lands in code:
bash
export SCRAPELESS_API_KEY=your_api_token_here
Stage 1 — Capture the answers
One function covers all six engines, because the actors share an endpoint and an envelope. The per-engine differences are confined to the input map — Grok requires a reasoning mode, Perplexity wants the web_search flag, Copilot takes its own mode:
| Platform | Actor | Extra input | Citations live in |
|---|---|---|---|
| ChatGPT | scraper.chatgpt |
— | content_references[] |
| Grok | scraper.grok |
mode (required) |
web_search_results[] + x_search_results[] |
| Gemini | scraper.gemini |
— | citations[] |
| Perplexity | scraper.perplexity |
web_search: true |
web_results[] |
| Copilot | scraper.copilot |
mode: "smart" |
citations[] |
| Google AI Overview | scraper.overview |
— | source[] |
python
# capture.py — run one prompt across six AI answer engines, store raw answers
import json
import os
import time
import requests
ENDPOINT = "https://api.scrapeless.com/api/v2/scraper/execute"
HEADERS = {
"Content-Type": "application/json",
"x-api-token": os.environ["SCRAPELESS_API_KEY"],
}
PROMPT = "What is the best web scraping API for JavaScript-heavy sites?"
COUNTRY = "US"
ENGINES = {
"chatgpt": {"actor": "scraper.chatgpt", "extra": {}},
"grok": {"actor": "scraper.grok", "extra": {"mode": "MODEL_MODE_FAST"}},
"gemini": {"actor": "scraper.gemini", "extra": {}},
"perplexity": {"actor": "scraper.perplexity", "extra": {"web_search": True}},
"copilot": {"actor": "scraper.copilot", "extra": {"mode": "smart"}},
"google-ai-overview": {"actor": "scraper.overview", "extra": {}},
}
with open("answers.jsonl", "w", encoding="utf-8") as out:
for platform, spec in ENGINES.items():
payload = {
"actor": spec["actor"],
"input": {"prompt": PROMPT, "country": COUNTRY, **spec["extra"]},
}
resp = requests.post(ENDPOINT, headers=HEADERS, json=payload, timeout=300)
resp.raise_for_status()
data = resp.json()
out.write(json.dumps({
"platform": platform,
"prompt": PROMPT,
"country": COUNTRY,
"captured_at": int(time.time()),
"status": data.get("status"),
"task_id": data.get("task_id"),
"task_result": data.get("task_result"),
}) + "\n")
print(f"{platform}: {data.get('status')}")
Each line of answers.jsonl is one platform's complete capture — answer, citations, and run metadata — keyed by task_id for the audit trail.
Get your API key on the free plan: app.scrapeless.com
Stage 2 — Normalize the citations
The field map is the whole trick: each platform names its citation array differently and shapes the entries differently, but every entry carries a URL. Six mappings turn six schemas into one stream:
python
# normalize.py — answers.jsonl -> citations.jsonl (one row per cited source)
import json
from urllib.parse import urlparse
# platform -> list of (array_field, url_key) pairs inside task_result
CITATION_FIELDS = {
"chatgpt": [("content_references", "url")],
"grok": [("web_search_results", "url"), ("x_search_results", "url")],
"gemini": [("citations", "url")],
"perplexity": [("web_results", "url")],
"copilot": [("citations", "url")],
"google-ai-overview": [("source", "url")],
}
with open("answers.jsonl", encoding="utf-8") as inp, \
open("citations.jsonl", "w", encoding="utf-8") as out:
for line in inp:
row = json.loads(line)
result = row.get("task_result") or {}
for field, url_key in CITATION_FIELDS[row["platform"]]:
for entry in result.get(field) or []:
url = entry.get(url_key) or ""
if not url.startswith("http"):
continue
out.write(json.dumps({
"platform": row["platform"],
"prompt": row["prompt"],
"country": row["country"],
"captured_at": row["captured_at"],
"panel": field,
"domain": urlparse(url).netloc.removeprefix("www."),
"url": url,
"title": entry.get("title") or entry.get("name") or "",
}) + "\n")
print(sum(1 for _ in open("citations.jsonl", encoding="utf-8")), "citations normalized")
Grok contributes two panels — open-web pages and X posts — and the panel field keeps them distinguishable downstream.
Stage 3 — Report share of citation
With one citation stream, the report is a group-by. Per platform: which domains the engine credits, and whether yours is among them:
python
# report.py — citations.jsonl -> share-of-citation table per platform
import json
import os
from collections import Counter, defaultdict
BRAND = os.environ.get("BRAND_DOMAIN", "scrapeless.com")
per_platform = defaultdict(Counter)
with open("citations.jsonl", encoding="utf-8") as inp:
for line in inp:
row = json.loads(line)
per_platform[row["platform"]][row["domain"]] += 1
for platform, counts in per_platform.items():
total = sum(counts.values())
brand_hits = counts.get(BRAND, 0)
print(f"\n{platform} — {total} citations · {BRAND}: {brand_hits}")
for domain, n in counts.most_common(5):
marker = " ←" if domain == BRAND else ""
print(f" {n:>3} {domain}{marker}")
Run on a schedule, this table becomes a time series: per platform, per prompt, per market — the count of answers that cite you, and who gets cited instead. That series is the deliverable a GEO program reports on.
Scheduling and scaling the series
- Hold the variables. Same prompts, same
country, same Grok mode every run — a series is only readable when the process is constant. Capture daily or weekly; AI answers move on both timescales. - Scale by multiplication, not new code. More prompts is a loop around Stage 1; more markets is a second
COUNTRY; both multiply run counts, so budget accordingly — the actors bill usage-based, with current tiers on the pricing page. - Keep the raw captures.
answers.jsonlis the evidence behind every number in the report; normalization choices change, raw answers don't. - Expect empty panels. Some prompts yield no citations on some engines (Grok's X panel in particular is prompt-dependent). An empty array is a data point, not a failure.
The actors live in the Universal Scraping API line; the best LLM scrapers guide ranks the category if you're comparing tooling.
FAQ
Q: Is it legal to capture AI answers this way?
The actors capture publicly rendered answer content. Rules vary by jurisdiction and each platform's terms — review the relevant ToS and consult counsel for your use case. Never collect personal data protected under GDPR or CCPA.
Q: Why one prompt in the example instead of a set?
Clarity. Production runs loop a prompt set around Stage 1; everything downstream already handles multiple prompts because every row carries its prompt.
Q: How many runs make a usable series?
Single captures of a non-deterministic surface prove little. Daily captures for two to three weeks give enough points to separate trend from noise on most prompt sets.
Q: What about Google's AI Mode tab?
It has its own actor (scraper.aimode) under the same envelope — add a seventh entry to the engine map. The AI Overview guide covers Google's answer surfaces in depth.
Q: Do I need a proxy?
No. Residential egress and geo-routing are built into the actors; the country input is the whole configuration.
Q: Can this run without an AI agent or SDK?
Yes — the three stages are plain Python over HTTP. Any scheduler (cron, CI, a workflow runner) can drive them.
Conclusion: one envelope, six engines, one number
The pipeline reduces to three files: capture answers through six actors that share an endpoint and an envelope, normalize six citation schemas with a six-line field map, and count domains. The output is the number AI-era visibility work was missing — how often each answer engine credits you, tracked over time, per market. Schedule it and the chart draws itself.
Ready to Build Your AI-Answer Data Pipeline?
Join our community to claim a free plan and connect with developers building AI-answer pipelines: Discord · Telegram.
Sign up at app.scrapeless.com for free trial credits, and point the pipeline at the prompts, engines, and markets your brand answers to.
At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.



