eplace current crawl method with Firecrawl's agent endpoint - it has the ability to handle pagination and extract more data. https://docs.firecrawl.dev/features/agent
Do more web search if these docs are too sparse
{
"subagent_type": "Explore",
"prompt": "Explore the current Firecrawl scraping implementation in the ReviewLens project at /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py. Read the full file. I need to understand exactly how scrape_url works currently — the API endpoint it hits, the payload, and how it parses the response.\n\nAlso fetch these URLs to understand the Firecrawl agent endpoint:\n- https://docs.firecrawl.dev/features/agent\n\nI need to understand:\n1. The current implementation in ingest.py (read the file)\n2. The Firecrawl agent API — endpoint URL, request format, response format, how pagination/multi-step extraction works\n3. Any differences from the regular /scrape endpoint we currently use"
}{
"subagent_type": "Explore",
"prompt": "Search the web for detailed information about Firecrawl's agent endpoint/feature. I need:\n\n1. The exact API endpoint URL and HTTP method\n2. Request payload schema (all parameters)\n3. Response format — what does it return?\n4. How does it handle pagination automatically?\n5. How does it extract structured data?\n6. Any Python SDK usage examples\n7. Pricing/rate limits for the agent endpoint\n8. Any differences from the regular scrape/crawl endpoints\n\nSearch queries to try:\n- \"firecrawl agent endpoint API documentation\"\n- \"firecrawl extract agent pagination\"\n- \"firecrawl agent python SDK example\"\n- \"site:docs.firecrawl.dev agent\"\n\nFetch these URLs if you find them:\n- https://docs.firecrawl.dev/features/agent\n- https://docs.firecrawl.dev/api-reference/endpoint/agent\n- Any GitHub examples or blog posts about the agent feature"
}Perfect! I've successfully read the ingest.py file and fetched the Firecrawl agent documentation. Here's a comprehensive breakdown of what I found:
## Current Implementation in ingest.py
**File location:** `/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py`
### API Configuration
- **Base URL:** `https://api.firecrawl.dev/v1`
- **Endpoint:** `/scrape` (lines 144)
- **Authentication:** Bearer token via `FIRECRAWL_API_KEY` environment variable
### Request Format (lines 143-153)
```python
POST /scrape
Headers:
Authorization: Bearer {FIRECRAWL_API_KEY}
Content-Type: application/json
Payload:
{
"url": url,
"formats": ["markdown"]
}
```
### Response Parsing (lines 155-164)
- Expects JSON response with structure: `response.data.markdown`
- Extracts the markdown content from the response
- Falls back to empty list if no markdown is found
### Review Extraction Strategy (lines 167-212)
After getting markdown, the code uses `_extract_reviews_from_markdown()` to:
1. Split markdown on patterns: horizontal rules (`---`), asterisks (`***`), or headers (`#`)
2. Extract rating from star symbols (`★`) or patterns like "X/5" or "X out of 5"
3. Extract dates using multiple formats
4. Cap each review to 2000 characters
5. Guess platform from URL or use provided platform parameter
---
## Firecrawl Agent API
**Endpoint:** `https://api.firecrawl.dev/v2/agent`
### Key Differences from Current `/scrape` Endpoint
| Aspect | Current `/scrape` | Agent Endpoint |
|--------|-------------------|----------------|
| **API Version** | v1 | v2 |
| **Input** | URL + format selection | Natural language prompt (up to 10,000 chars) |
| **Autonomy** | Passive—scrapes single URL as-is | Active—searches, navigates, and gathers data autonomously |
| **Multi-page** | No built-in support | Built-in—processes multiple sources in parallel |
| **Output** | Raw markdown | Structured data matching optional JSON schema |
| **Models** | N/A | Two options: `spark-1-mini` (default, cheaper) or `spark-1-pro` |
### Request Format for Agent
```json
{
"prompt": "natural language description",
"urls": ["optional", "specific", "urls"],
"schema": {JSON schema for structured output},
"model": "spark-1-mini" or "spark-1-pro",
"maxCredits": 2500
}
```
### Response Format for Agent
```json
{
"success": boolean,
"status": "processing|completed|failed|cancelled",
"data": {extracted information matching schema},
"creditsUsed": number,
"expiresAt": "24-hour availability window"
}
```
### Important Notes
- **Pagination/Multi-step:** The agent documentation emphasizes it "autonomously searches and navigates deep into sites" and "processes multiple sources in parallel," but specific mechanics for handling pagination aren't detailed in the docs
- **Cost:** The agent has a credit-based model (tracked via `maxCredits` and `creditsUsed`)
- **Timeout:** Current implementation uses 60-second timeout; agent behavior under this constraint isn't specified
- **Schema Support:** Supports Pydantic or Zod schema definitions for structured extraction
---
## Summary of Key Insights
1. **Current approach** is simple and stateless—fetch URL, convert to markdown, regex-parse reviews
2. **Agent approach** is much more sophisticated—natural language driven, can navigate multi-page sites, returns structured data
3. **No pagination handling** in current code; agent claims to handle it autonomously
4. The agent would likely be better for complex review sites with JavaScript rendering, multi-page navigation, or varied HTML structuresagentId: adfd96a743cc82b66 (use SendMessage with to: 'adfd96a743cc82b66' to continue this agent) <usage>total_tokens: 26474 tool_uses: 2 duration_ms: 16041</usage>
Excellent! I have comprehensive information. Let me compile a detailed report for you.
## Firecrawl Agent Endpoint - Comprehensive Documentation
Based on my research, here's detailed information about Firecrawl's agent endpoint:
### 1. **API Endpoint URL & HTTP Method**
- **Endpoint:** `https://api.firecrawl.dev/v2/agent`
- **HTTP Method:** POST
- **Authentication:** Bearer token in Authorization header
### 2. **Request Payload Schema (All Parameters)**
| Parameter | Type | Required | Description | Notes |
|-----------|------|----------|-------------|-------|
| `prompt` | string | **Yes** | Describes the data you want to extract | Max 10,000 characters |
| `model` | string | No | Which AI model to use | `spark-1-mini` (default, 60% cheaper) or `spark-1-pro` (higher accuracy) |
| `urls` | array | No | Optional list of target pages to focus on | If omitted, agent discovers URLs via web search |
| `schema` | object | No | JSON schema for structured output | Define using Pydantic (Python) or Zod (Node.js) |
| `maxCredits` | number | No | Spending limit for the job | Default 2,500 credits |
| `strictConstrainToURLs` | boolean | No | Force agent to only visit provided URLs | Restricts autonomous navigation |
### 3. **Response Format**
**Successful Response (HTTP 200):**
```json
{
"success": boolean,
"id": "uuid",
"status": "processing|completed|failed|cancelled",
"data": { /* extracted results matching your schema */ },
"creditsUsed": number,
"expiresAt": "ISO 8601 timestamp"
}
```
**Key response characteristics:**
- Jobs run **asynchronously** — returns immediately with a Job ID
- Results are available for **24 hours** after completion
- Can be retrieved via the Job ID using polling
- Status can be: `processing`, `completed`, `failed`, or `cancelled`
**Error Responses:**
- **402:** Payment required (insufficient credits)
- **429:** Rate limit exceeded
### 4. **Pagination Handling**
The agent endpoint handles pagination **automatically and autonomously**:
- The agent itself navigates through paginated content by clicking buttons, following "next" links, and exploring multiple pages
- No explicit pagination parameters needed in the request
- The agent "clicks through authentication flows, navigates nested menus, handles dropdowns, and explores multi-step processes"
- This differs from the earlier `/extract` endpoint which required manual pagination configuration
- All results from multiple pages are automatically aggregated and structured according to your schema
### 5. **Structured Data Extraction**
The agent extracts structured data through:
- **Schema-based extraction:** You provide a JSON schema (Pydantic for Python, Zod for Node.js) defining the desired output structure
- **Natural language prompts:** Simply describe what data you need; the AI understands context and intent
- **Autonomous processing:** The agent "plans and takes actions to uncover data" by navigating websites intelligently
- **Format:** Results conform strictly to your specified schema, delivered as clean JSON ready for downstream processing
- **No pre-specified URLs required:** The agent can search the web to discover relevant websites before extracting data
### 6. **Python SDK Usage Examples**
**Basic Agent Usage:**
```python
from firecrawl import Firecrawl
firecrawl = Firecrawl(api_key="fc-YOUR_API_KEY")
# Simple autonomous data gathering
result = firecrawl.agent(
prompt="Find the founders of Stripe"
)
print(result.data)
# With structured output schema
result = firecrawl.agent(
prompt="Find all product names and prices from an e-commerce site",
schema={
"products": [
{
"name": "string",
"price": "number"
}
]
}
)
print(result.data)
# With specific URLs to focus on
result = firecrawl.agent(
prompt="Extract all job postings",
urls=["https://example.com/careers"],
model="spark-1-pro" # Higher accuracy for complex extraction
)
```
**Installation:**
```bash
pip install firecrawl-py
```
### 7. **Pricing & Rate Limits**
**Pricing Model (Research Preview):**
- **Credit-based billing** with dynamic costs based on task complexity
- **Simple single-page tasks:** Fewer credits consumed
- **Complex multi-domain research:** Higher credit consumption
- **Typical runs:** Consume "a few hundred credits"
- **Free daily runs:** 5 free runs per day included
- **Parallel processing (Spark-1 Fast):** 10 credits per cell
**Rate Limits:**
- **Free tier:** 5 runs per day
- **Paid tier:** 10 requests per minute maximum (from `/extract` endpoint which uses similar infrastructure)
- **402 errors** returned when credit limit exceeded
**Note:** Firecrawl provides a credit calculator on their website to estimate costs for specific tasks.
### 8. **Differences from Regular Scrape/Crawl Endpoints**
| Feature | `/scrape` | `/crawl` | `/agent` |
|---------|-----------|----------|---------|
| **URLs Required** | Yes (single) | Yes (starting URL) | No — uses web search |
| **Navigation** | None | Follows links within domain | Autonomous multi-site navigation |
| **Pagination** | Manual | Manual config required | Automatic autonomous |
| **Form Interaction** | No | No | Yes — clicks, fills forms |
| **Authentication Flows** | No | No | Yes — handles login flows |
| **Web Search** | No | No | Yes — discovers relevant sites |
| **Output Format** | Markdown/HTML | Markdown/HTML | Structured JSON (schema-based) |
| **Schema Support** | No | No | Yes — required for structure |
| **Model Options** | N/A | N/A | spark-1-mini or spark-1-pro |
| **Job Processing** | Synchronous | Asynchronous | Asynchronous with 24h availability |
| **Primary Use Case** | Extract from known page | Crawl known site structure | Autonomous research & extraction |
### Key Advantages of `/agent`
- **Successor to `/extract`:** Faster, more reliable, no URLs required
- **Autonomous discovery:** Searches for and finds data automatically
- **Complex interaction support:** Navigates authentication, forms, dynamic content
- **Multi-page intelligence:** Aggregates data across paginated results
- **Natural language:** Just describe what you need in plain language
- **Type-safe output:** Strict schema compliance for downstream processing
---
## Sources:
- [Agent | Firecrawl Features](https://docs.firecrawl.dev/features/agent)
- [Agent API Reference](https://docs.firecrawl.dev/api-reference/endpoint/agent)
- [FIRE-1 AI Agent Documentation](https://docs.firecrawl.dev/agents/fire-1)
- [Python SDK Documentation](https://docs.firecrawl.dev/sdks/python)
- [Introducing /agent Blog Post](https://www.firecrawl.dev/blog/introducing-agent)
- [How to Build AI Agents with Claude and Firecrawl](https://www.firecrawl.dev/blog/claude-agent-sdk-firecrawl)
- [Firecrawl Python SDK Repository](https://github.com/firecrawl/firecrawl-py)
- [Firecrawl Main Repository](https://github.com/firecrawl/firecrawl)agentId: a0f1ad3ea9deaf8af (use SendMessage with to: 'a0f1ad3ea9deaf8af' to continue this agent) <usage>total_tokens: 30175 tool_uses: 10 duration_ms: 45122</usage>
{
"subagent_type": "Plan",
"prompt": "I need to plan replacing the current Firecrawl `/v1/scrape` call with the `/v2/agent` endpoint in a ReviewLens project.\n\n## Current Implementation (app/ingest.py)\n\nThe current `scrape_url()` function:\n1. Calls `POST https://api.firecrawl.dev/v1/scrape` with `{\"url\": url, \"formats\": [\"markdown\"]}`\n2. Gets back markdown\n3. Uses regex-based `_extract_reviews_from_markdown()` to split the markdown into review blocks and extract ratings/dates\n4. Returns `list[Review]`\n\nThis is fragile — the regex parsing is best-effort and review platforms have varied structures.\n\n## Firecrawl Agent Endpoint\n\n**Endpoint:** `POST https://api.firecrawl.dev/v2/agent`\n\n**Key characteristics:**\n- Accepts a natural language `prompt` (up to 10,000 chars) describing what to extract\n- Accepts optional `urls` array to focus on specific pages\n- Accepts optional `schema` (JSON schema) for structured output\n- Has `model` param: `spark-1-mini` (default, cheaper) or `spark-1-pro`\n- Has `maxCredits` param (default 2500)\n- **Asynchronous** — returns a job ID, need to poll for results\n- Handles pagination autonomously\n- Can navigate multi-page content, click through pages\n- Results available for 24 hours\n\n**Response:**\n```json\n{\n \"success\": true,\n \"id\": \"job-uuid\",\n \"status\": \"processing|completed|failed|cancelled\", \n \"data\": { extracted data matching schema },\n \"creditsUsed\": number,\n \"expiresAt\": \"ISO timestamp\"\n}\n```\n\n**Python SDK available:** `firecrawl-py` package with `firecrawl.agent()` method.\n\n## Requirements\n\n1. Replace the current `scrape_url()` in `app/ingest.py` to use the agent endpoint\n2. Use a JSON schema that maps to our Review model (text, rating, date, author, platform)\n3. Handle the async nature — poll for completion\n4. Keep `scrape_url()` as an async function returning `list[Review]`\n5. This is best-effort secondary path (CSV is primary), so don't over-engineer error handling\n6. The assignment says zero financial cost — we should use `spark-1-mini` (cheaper) and set a reasonable `maxCredits` cap\n7. Consider whether to use raw httpx calls or the `firecrawl-py` SDK\n\n## Constraints\n- The function is called from FastAPI async route handlers\n- Keep the same function signature: `async def scrape_url(url: str, platform: str = \"\") -> list[Review]`\n- The `_guess_platform()` helper can be reused\n- The `_parse_date()` and `_parse_rating()` helpers may still be useful for normalizing agent output\n\n## Questions to address in the plan\n1. Raw httpx vs firecrawl-py SDK — which is better here?\n2. What should the prompt look like to get good review extraction?\n3. What JSON schema should we send?\n4. How to handle polling — what timeout, what interval?\n5. Should we keep the old markdown-based fallback or remove it entirely?"
}Now I have a complete picture. Let me formulate the plan.
---
## Implementation Plan: Replace Firecrawl v1/scrape with v2/agent
### Answers to Design Questions
**1. Raw httpx vs firecrawl-py SDK**
Use raw httpx. Reasons:
- The project already depends on httpx and uses it for the current Firecrawl call.
- Adding `firecrawl-py` introduces a new dependency for exactly two HTTP calls (submit job + poll).
- The SDK's `firecrawl.agent()` method is synchronous (or has its own async story) which may not integrate cleanly with the existing `async with httpx.AsyncClient` pattern.
- The v2/agent API surface is simple: one POST to submit, one GET to poll. Two requests do not justify an SDK.
**2. Prompt Design**
The prompt should be specific about what constitutes a "review" and what fields to extract. Something like:
```
Extract all customer/user reviews from this page. For each review, extract:
- The full review text (the main body of what the reviewer wrote)
- The star rating as a number from 1 to 5 (if present)
- The date the review was posted (in YYYY-MM-DD format if possible)
- The author/reviewer name (if present)
Only extract actual user-submitted reviews. Do not extract editorial content, product descriptions, or page navigation text. If reviews span multiple pages, navigate through all available pages.
```
This is under 600 characters, well within the 10,000 char limit. The platform hint from `_guess_platform()` can be prepended (e.g., "This is an Amazon product page.") to help the agent understand context.
**3. JSON Schema**
The schema sent to the agent endpoint should match this structure:
```json
{
"type": "object",
"properties": {
"reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string", "description": "Full review text"},
"rating": {"type": ["number", "null"], "description": "Star rating 1-5"},
"date": {"type": ["string", "null"], "description": "Date in YYYY-MM-DD format"},
"author": {"type": ["string", "null"], "description": "Reviewer name"}
},
"required": ["text"]
}
}
},
"required": ["reviews"]
}
```
This keeps it simple: an object with a `reviews` array. Each item has `text` (required) plus optional `rating`, `date`, `author`.
**4. Polling Strategy**
- **Interval:** 2 seconds between polls. The agent needs time to navigate pages, and more frequent polling wastes API calls.
- **Timeout:** 120 seconds total (matching the current `httpx.AsyncClient(timeout=60.0)` but doubled since the agent does more work). Use `asyncio.sleep(2)` between polls.
- **Max polls:** 60 (120s / 2s). After that, raise or return empty.
- Use the same job URL: `GET https://api.firecrawl.dev/v2/agent/{job_id}` with the auth header.
**5. Keep the old fallback or remove it?**
Remove it entirely. The old `_extract_reviews_from_markdown()` function and the regex-based parsing are the whole reason for this change -- they are fragile. The agent endpoint replaces both the scraping and the parsing. Delete `_extract_reviews_from_markdown()`. Keep `_guess_platform()`, `_parse_date()`, and `_parse_rating()` as they are useful for normalizing agent output.
### Step-by-Step Implementation
**Step 1: Update constants**
In `app/ingest.py`, change:
- `FIRECRAWL_API_URL` from `"https://api.firecrawl.dev/v1"` to `"https://api.firecrawl.dev/v2"`
Add new constants:
- `_AGENT_POLL_INTERVAL = 2` (seconds)
- `_AGENT_TIMEOUT = 120` (seconds)
- `_AGENT_MAX_CREDITS = 500` (conservative cap; `spark-1-mini` default is 2500 but we want to stay cheap)
Add import: `import asyncio` (for `asyncio.sleep`).
**Step 2: Define the extraction schema as a module-level dict**
Create `_REVIEW_SCHEMA` as the JSON schema dict described above.
**Step 3: Define the prompt builder**
A small helper `_build_agent_prompt(url: str, platform: str) -> str` that constructs the extraction prompt, optionally including the platform name for context.
**Step 4: Rewrite `scrape_url()`**
The new flow:
1. Check `FIRECRAWL_API_KEY` (same as now).
2. POST to `{FIRECRAWL_API_URL}/agent` with body:
```json
{
"urls": [url],
"prompt": _build_agent_prompt(url, platform),
"schema": _REVIEW_SCHEMA,
"model": "spark-1-mini",
"maxCredits": _AGENT_MAX_CREDITS
}
```
3. Extract `job_id` from the response.
4. Poll `GET {FIRECRAWL_API_URL}/agent/{job_id}` every `_AGENT_POLL_INTERVAL` seconds.
5. On `status == "completed"`, extract `data["reviews"]` from the response.
6. On `status == "failed"` or timeout, return `[]`.
7. Convert each raw review dict into a `Review` object, using `_parse_date()` and `_parse_rating()` to normalize values. Use `_guess_platform(url)` if platform not provided.
8. Return the list.
The httpx client timeout should be set to 30s per individual request (not the total polling duration). Use a single `httpx.AsyncClient` context for both the submit and all poll requests.
**Step 5: Remove dead code**
Delete `_extract_reviews_from_markdown()` entirely. It is only called from the old `scrape_url()`.
**Step 6: No changes needed to `app/main.py`**
The call site at line 156 (`reviews = await scrape_url(url, platform=platform)`) remains identical. The function signature is preserved.
**Step 7: No changes to `requirements.txt`**
No new dependencies. `httpx` is already present. We are not using `firecrawl-py`.
### Potential Challenges
- **Agent job duration:** The agent may take longer than 120 seconds for sites with many review pages. The 120-second timeout is a reasonable starting point; if it proves insufficient, increase it. The FastAPI endpoint at `/scrape` will be holding the connection open during polling. Consider whether the frontend uses HTMX with a loading state (it does, based on the HTMX patterns visible in main.py). The user will see the loading spinner during the wait.
- **Schema compliance:** The agent may not perfectly follow the schema. Defensive code is needed when iterating `data["reviews"]` -- skip entries without `text`, handle missing fields gracefully.
- **Credit exhaustion:** With `maxCredits: 500` and `spark-1-mini`, a single extraction should cost well under that. But if the free tier has a low monthly cap, multiple scrapes could exhaust it. This is acceptable since URL scraping is the secondary path.
- **Rate limits:** The polling loop makes ~60 GET requests over 120 seconds. This should be fine for Firecrawl's API.
### Critical Files for Implementation
- `/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py` - Primary file to modify: rewrite `scrape_url()`, add schema/prompt constants, remove `_extract_reviews_from_markdown()`
- `/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/models.py` - Review model definition that the JSON schema must align with
- `/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/main.py` - Call site for `scrape_url()` at line 156; verify no changes needed
- `/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/requirements.txt` - Confirm no new dependencies are neededagentId: ab24596fb7dcc1eda (use SendMessage with to: 'ab24596fb7dcc1eda' to continue this agent) <usage>total_tokens: 21159 tool_uses: 7 duration_ms: 60654</usage>
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py",
"offset": 120,
"limit": 95
} 120→ rating=_parse_rating(rating_raw),
121→ date=_parse_date(date_raw),
122→ author=author,
123→ platform=plat,
124→ metadata=metadata,
125→ )
126→ )
127→
128→ return reviews
129→
130→
131→# ── Firecrawl URL Scraping (secondary, best-effort) ─────────────────
132→
133→FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
134→FIRECRAWL_API_URL = "https://api.firecrawl.dev/v1"
135→
136→
137→async def scrape_url(url: str, platform: str = "") -> list[Review]:
138→ """Best-effort URL scraping via Firecrawl. Returns whatever we can get."""
139→ if not FIRECRAWL_API_KEY:
140→ raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")
141→
142→ async with httpx.AsyncClient(timeout=60.0) as client:
143→ resp = await client.post(
144→ f"{FIRECRAWL_API_URL}/scrape",
145→ headers={
146→ "Authorization": f"Bearer {FIRECRAWL_API_KEY}",
147→ "Content-Type": "application/json",
148→ },
149→ json={
150→ "url": url,
151→ "formats": ["markdown"],
152→ },
153→ )
154→ resp.raise_for_status()
155→ data = resp.json()
156→
157→ markdown = data.get("data", {}).get("markdown", "")
158→ if not markdown:
159→ return []
160→
161→ # Try to extract individual reviews from the markdown.
162→ # This is best-effort — review platforms have varied structures.
163→ reviews = _extract_reviews_from_markdown(markdown, platform, url)
164→ return reviews
165→
166→
167→def _extract_reviews_from_markdown(
168→ markdown: str, platform: str, url: str
169→) -> list[Review]:
170→ """Attempt to parse review blocks from scraped markdown."""
171→ reviews: list[Review] = []
172→
173→ # Strategy: split on patterns that look like review boundaries.
174→ # Common patterns: "★★★★☆", "Rating: X/5", numbered reviews, horizontal rules
175→ blocks = re.split(r"\n(?:---+|\*\*\*+|#{1,3}\s)", markdown)
176→
177→ for i, block in enumerate(blocks):
178→ block = block.strip()
179→ if len(block) < 20:
180→ continue
181→
182→ # Try to extract rating
183→ rating = None
184→ star_match = re.search(r"([★]{1,5})", block)
185→ if star_match:
186→ rating = float(len(star_match.group(1)))
187→ else:
188→ rating_match = re.search(r"(\d(?:\.\d)?)\s*(?:/\s*5|out of 5|stars?)", block, re.I)
189→ if rating_match:
190→ rating = float(rating_match.group(1))
191→
192→ # Try to extract date
193→ date = None
194→ date_match = re.search(
195→ r"(\w+ \d{1,2},? \d{4}|\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{4})",
196→ block,
197→ )
198→ if date_match:
199→ date = _parse_date(date_match.group(1))
200→
201→ reviews.append(
202→ Review(
203→ id=f"scraped_{i}",
204→ text=block[:2000], # Cap individual review length
205→ rating=rating,
206→ date=date,
207→ platform=platform or _guess_platform(url),
208→ metadata={"source_url": url},
209→ )
210→ )
211→
212→ return reviews
213→
214→# Replace Firecrawl /scrape with /agent endpoint
## Context
The current URL scraping in `app/ingest.py` uses Firecrawl's `/v1/scrape` endpoint which returns raw markdown, then tries to regex-parse review blocks from it. This is fragile — review platforms have varied HTML structures and the regex splitting misses most of them. Firecrawl's newer `/v2/agent` endpoint is purpose-built for this: it accepts a natural language prompt + JSON schema, autonomously navigates paginated content, and returns structured data directly. This eliminates the brittle markdown-to-reviews parsing entirely.
## Changes
**Single file modified:** `app/ingest.py`
### 1. Update constants
- `FIRECRAWL_API_URL` → `https://api.firecrawl.dev/v2`
- Add `_AGENT_POLL_INTERVAL = 2` (seconds)
- Add `_AGENT_TIMEOUT = 120` (seconds)
- Add `_AGENT_MAX_CREDITS = 500` (conservative cap for zero-cost constraint)
- Add `import asyncio`
### 2. Add review extraction schema
Module-level `_REVIEW_SCHEMA` dict — an object with a `reviews` array, each item having `text` (required), `rating`, `date`, `author` (optional).
### 3. Add prompt builder
`_build_agent_prompt(url, platform)` — constructs a focused extraction prompt telling the agent to find all user reviews with text/rating/date/author fields and navigate pagination. Prepends platform context if known.
### 4. Rewrite `scrape_url()`
New flow (same signature: `async def scrape_url(url, platform) -> list[Review]`):
1. POST to `/agent` with `urls=[url]`, prompt, schema, `model="spark-1-mini"`, `maxCredits=500`
2. Extract job ID from response
3. Poll `GET /agent/{job_id}` every 2s, timeout at 120s
4. On `completed` → parse `data["reviews"]` into `Review` objects using existing `_parse_date()` / `_parse_rating()` for normalization
5. On `failed` / timeout → return `[]`
Uses raw `httpx` (already a dependency) — no need for `firecrawl-py` SDK for two HTTP calls.
### 5. Delete `_extract_reviews_from_markdown()`
No longer needed. The agent returns structured data directly.
### 6. Keep unchanged
- `_guess_platform()`, `_parse_date()`, `_parse_rating()` — reused for normalizing agent output
- `parse_csv()`, `build_summary()` — untouched
- `app/main.py` call site — `scrape_url()` signature unchanged
- `requirements.txt` — no new deps
## Verification
1. Set `FIRECRAWL_API_KEY` in `.env`
2. Run the app: `uvicorn app.main:app --reload --reload-exclude '.venv'`
3. Open modal → URL Scrape tab → paste a review page URL → submit
4. Confirm it either returns reviews (redirects to chat with data) or shows a graceful error
5. Verify CSV upload still works unchanged
File created successfully at: /Users/shaynelarocque/.claude/plans/splendid-discovering-bentley.md
{
"plan": "# Replace Firecrawl /scrape with /agent endpoint\n\n## Context\n\nThe current URL scraping in `app/ingest.py` uses Firecrawl's `/v1/scrape` endpoint which returns raw markdown, then tries to regex-parse review blocks from it. This is fragile — review platforms have varied HTML structures and the regex splitting misses most of them. Firecrawl's newer `/v2/agent` endpoint is purpose-built for this: it accepts a natural language prompt + JSON schema, autonomously navigates paginated content, and returns structured data directly. This eliminates the brittle markdown-to-reviews parsing entirely.\n\n## Changes\n\n**Single file modified:** `app/ingest.py`\n\n### 1. Update constants\n\n- `FIRECRAWL_API_URL` → `https://api.firecrawl.dev/v2`\n- Add `_AGENT_POLL_INTERVAL = 2` (seconds)\n- Add `_AGENT_TIMEOUT = 120` (seconds)\n- Add `_AGENT_MAX_CREDITS = 500` (conservative cap for zero-cost constraint)\n- Add `import asyncio`\n\n### 2. Add review extraction schema\n\nModule-level `_REVIEW_SCHEMA` dict — an object with a `reviews` array, each item having `text` (required), `rating`, `date`, `author` (optional).\n\n### 3. Add prompt builder\n\n`_build_agent_prompt(url, platform)` — constructs a focused extraction prompt telling the agent to find all user reviews with text/rating/date/author fields and navigate pagination. Prepends platform context if known.\n\n### 4. Rewrite `scrape_url()`\n\nNew flow (same signature: `async def scrape_url(url, platform) -> list[Review]`):\n\n1. POST to `/agent` with `urls=[url]`, prompt, schema, `model=\"spark-1-mini\"`, `maxCredits=500`\n2. Extract job ID from response\n3. Poll `GET /agent/{job_id}` every 2s, timeout at 120s\n4. On `completed` → parse `data[\"reviews\"]` into `Review` objects using existing `_parse_date()` / `_parse_rating()` for normalization\n5. On `failed` / timeout → return `[]`\n\nUses raw `httpx` (already a dependency) — no need for `firecrawl-py` SDK for two HTTP calls.\n\n### 5. Delete `_extract_reviews_from_markdown()`\n\nNo longer needed. The agent returns structured data directly.\n\n### 6. Keep unchanged\n\n- `_guess_platform()`, `_parse_date()`, `_parse_rating()` — reused for normalizing agent output\n- `parse_csv()`, `build_summary()` — untouched\n- `app/main.py` call site — `scrape_url()` signature unchanged\n- `requirements.txt` — no new deps\n\n## Verification\n\n1. Set `FIRECRAWL_API_KEY` in `.env`\n2. Run the app: `uvicorn app.main:app --reload --reload-exclude '.venv'`\n3. Open modal → URL Scrape tab → paste a review page URL → submit\n4. Confirm it either returns reviews (redirects to chat with data) or shows a graceful error\n5. Verify CSV upload still works unchanged\n",
"planFilePath": "/Users/shaynelarocque/.claude/plans/splendid-discovering-bentley.md"
}User has approved your plan. You can now start coding. Start with updating your todo list if applicable
Your plan has been saved to: /Users/shaynelarocque/.claude/plans/splendid-discovering-bentley.md
You can refer back to it if needed during implementation.
## Approved Plan:
# Replace Firecrawl /scrape with /agent endpoint
## Context
The current URL scraping in `app/ingest.py` uses Firecrawl's `/v1/scrape` endpoint which returns raw markdown, then tries to regex-parse review blocks from it. This is fragile — review platforms have varied HTML structures and the regex splitting misses most of them. Firecrawl's newer `/v2/agent` endpoint is purpose-built for this: it accepts a natural language prompt + JSON schema, autonomously navigates paginated content, and returns structured data directly. This eliminates the brittle markdown-to-reviews parsing entirely.
## Changes
**Single file modified:** `app/ingest.py`
### 1. Update constants
- `FIRECRAWL_API_URL` → `https://api.firecrawl.dev/v2`
- Add `_AGENT_POLL_INTERVAL = 2` (seconds)
- Add `_AGENT_TIMEOUT = 120` (seconds)
- Add `_AGENT_MAX_CREDITS = 500` (conservative cap for zero-cost constraint)
- Add `import asyncio`
### 2. Add review extraction schema
Module-level `_REVIEW_SCHEMA` dict — an object with a `reviews` array, each item having `text` (required), `rating`, `date`, `author` (optional).
### 3. Add prompt builder
`_build_agent_prompt(url, platform)` — constructs a focused extraction prompt telling the agent to find all user reviews with text/rating/date/author fields and navigate pagination. Prepends platform context if known.
### 4. Rewrite `scrape_url()`
New flow (same signature: `async def scrape_url(url, platform) -> list[Review]`):
1. POST to `/agent` with `urls=[url]`, prompt, schema, `model="spark-1-mini"`, `maxCredits=500`
2. Extract job ID from response
3. Poll `GET /agent/{job_id}` every 2s, timeout at 120s
4. On `completed` → parse `data["reviews"]` into `Review` objects using existing `_parse_date()` / `_parse_rating()` for normalization
5. On `failed` / timeout → return `[]`
Uses raw `httpx` (already a dependency) — no need for `firecrawl-py` SDK for two HTTP calls.
### 5. Delete `_extract_reviews_from_markdown()`
No longer needed. The agent returns structured data directly.
### 6. Keep unchanged
- `_guess_platform()`, `_parse_date()`, `_parse_rating()` — reused for normalizing agent output
- `parse_csv()`, `build_summary()` — untouched
- `app/main.py` call site — `scrape_url()` signature unchanged
- `requirements.txt` — no new deps
## Verification
1. Set `FIRECRAWL_API_KEY` in `.env`
2. Run the app: `uvicorn app.main:app --reload --reload-exclude '.venv'`
3. Open modal → URL Scrape tab → paste a review page URL → submit
4. Confirm it either returns reviews (redirects to chat with data) or shows a graceful error
5. Verify CSV upload still works unchanged
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py"
} 1→"""Ingestion module: CSV parsing and Firecrawl URL scraping."""
2→
3→from __future__ import annotations
4→
5→import csv
6→import io
7→import os
8→import re
9→import uuid
10→from collections import Counter
11→from datetime import datetime
12→from typing import Any
13→
14→import httpx
15→
16→from .models import IngestionSummary, Review
17→
18→
19→# ── CSV Ingestion (primary path) ────────────────────────────────────
20→
21→# Common column name variants we normalise to our schema.
22→_COL_MAP: dict[str, list[str]] = {
23→ "text": ["text", "review", "review_text", "content", "body", "comment", "review_body", "reviews", "feedback"],
24→ "rating": ["rating", "score", "stars", "star_rating", "review_rating", "overall_rating"],
25→ "date": ["date", "review_date", "created_at", "timestamp", "time", "posted_date", "review_time"],
26→ "author": ["author", "reviewer", "user", "username", "reviewer_name", "name", "user_name"],
27→ "platform": ["platform", "source", "site", "channel"],
28→}
29→
30→
31→def _normalise_col(name: str) -> str:
32→ return re.sub(r"[^a-z0-9]", "", name.lower().strip())
33→
34→
35→def _map_columns(headers: list[str]) -> dict[str, str]:
36→ """Map CSV column names → our field names. Returns {our_field: csv_col}."""
37→ mapping: dict[str, str] = {}
38→ normalised = {_normalise_col(h): h for h in headers}
39→
40→ for field, aliases in _COL_MAP.items():
41→ for alias in aliases:
42→ norm = _normalise_col(alias)
43→ if norm in normalised:
44→ mapping[field] = normalised[norm]
45→ break
46→ return mapping
47→
48→
49→def _parse_date(val: str) -> datetime | None:
50→ for fmt in (
51→ "%Y-%m-%d",
52→ "%Y-%m-%dT%H:%M:%S",
53→ "%Y-%m-%dT%H:%M:%SZ",
54→ "%Y-%m-%d %H:%M:%S",
55→ "%m/%d/%Y",
56→ "%d/%m/%Y",
57→ "%B %d, %Y",
58→ "%b %d, %Y",
59→ "%Y/%m/%d",
60→ ):
61→ try:
62→ return datetime.strptime(val.strip(), fmt)
63→ except (ValueError, AttributeError):
64→ continue
65→ return None
66→
67→
68→def _parse_rating(val: str) -> float | None:
69→ try:
70→ # Handle "4/5", "4 out of 5", or plain "4.5"
71→ val = val.strip()
72→ match = re.match(r"([\d.]+)\s*(?:/|out of)\s*\d+", val)
73→ if match:
74→ return float(match.group(1))
75→ return float(val)
76→ except (ValueError, TypeError):
77→ return None
78→
79→
80→def parse_csv(content: str | bytes, platform: str = "", product_name: str = "") -> list[Review]:
81→ """Parse CSV content into Review objects. Handles flexible column names."""
82→ if isinstance(content, bytes):
83→ content = content.decode("utf-8-sig") # Handle BOM
84→
85→ reader = csv.DictReader(io.StringIO(content))
86→ if not reader.fieldnames:
87→ return []
88→
89→ col_map = _map_columns(list(reader.fieldnames))
90→
91→ if "text" not in col_map:
92→ # Try: if there's only one text-like long column, use it
93→ for h in reader.fieldnames:
94→ if h not in col_map.values():
95→ col_map["text"] = h
96→ break
97→
98→ if "text" not in col_map:
99→ return []
100→
101→ reviews: list[Review] = []
102→ for i, row in enumerate(reader):
103→ text = row.get(col_map.get("text", ""), "").strip()
104→ if not text:
105→ continue
106→
107→ rating_raw = row.get(col_map.get("rating", ""), "")
108→ date_raw = row.get(col_map.get("date", ""), "")
109→ author = row.get(col_map.get("author", ""), "").strip()
110→ plat = row.get(col_map.get("platform", ""), "").strip() or platform
111→
112→ # Collect unmapped columns as metadata
113→ mapped_cols = set(col_map.values())
114→ metadata = {k: v for k, v in row.items() if k not in mapped_cols and v}
115→
116→ reviews.append(
117→ Review(
118→ id=f"review_{i}",
119→ text=text,
120→ rating=_parse_rating(rating_raw),
121→ date=_parse_date(date_raw),
122→ author=author,
123→ platform=plat,
124→ metadata=metadata,
125→ )
126→ )
127→
128→ return reviews
129→
130→
131→# ── Firecrawl URL Scraping (secondary, best-effort) ─────────────────
132→
133→FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
134→FIRECRAWL_API_URL = "https://api.firecrawl.dev/v1"
135→
136→
137→async def scrape_url(url: str, platform: str = "") -> list[Review]:
138→ """Best-effort URL scraping via Firecrawl. Returns whatever we can get."""
139→ if not FIRECRAWL_API_KEY:
140→ raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")
141→
142→ async with httpx.AsyncClient(timeout=60.0) as client:
143→ resp = await client.post(
144→ f"{FIRECRAWL_API_URL}/scrape",
145→ headers={
146→ "Authorization": f"Bearer {FIRECRAWL_API_KEY}",
147→ "Content-Type": "application/json",
148→ },
149→ json={
150→ "url": url,
151→ "formats": ["markdown"],
152→ },
153→ )
154→ resp.raise_for_status()
155→ data = resp.json()
156→
157→ markdown = data.get("data", {}).get("markdown", "")
158→ if not markdown:
159→ return []
160→
161→ # Try to extract individual reviews from the markdown.
162→ # This is best-effort — review platforms have varied structures.
163→ reviews = _extract_reviews_from_markdown(markdown, platform, url)
164→ return reviews
165→
166→
167→def _extract_reviews_from_markdown(
168→ markdown: str, platform: str, url: str
169→) -> list[Review]:
170→ """Attempt to parse review blocks from scraped markdown."""
171→ reviews: list[Review] = []
172→
173→ # Strategy: split on patterns that look like review boundaries.
174→ # Common patterns: "★★★★☆", "Rating: X/5", numbered reviews, horizontal rules
175→ blocks = re.split(r"\n(?:---+|\*\*\*+|#{1,3}\s)", markdown)
176→
177→ for i, block in enumerate(blocks):
178→ block = block.strip()
179→ if len(block) < 20:
180→ continue
181→
182→ # Try to extract rating
183→ rating = None
184→ star_match = re.search(r"([★]{1,5})", block)
185→ if star_match:
186→ rating = float(len(star_match.group(1)))
187→ else:
188→ rating_match = re.search(r"(\d(?:\.\d)?)\s*(?:/\s*5|out of 5|stars?)", block, re.I)
189→ if rating_match:
190→ rating = float(rating_match.group(1))
191→
192→ # Try to extract date
193→ date = None
194→ date_match = re.search(
195→ r"(\w+ \d{1,2},? \d{4}|\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{4})",
196→ block,
197→ )
198→ if date_match:
199→ date = _parse_date(date_match.group(1))
200→
201→ reviews.append(
202→ Review(
203→ id=f"scraped_{i}",
204→ text=block[:2000], # Cap individual review length
205→ rating=rating,
206→ date=date,
207→ platform=platform or _guess_platform(url),
208→ metadata={"source_url": url},
209→ )
210→ )
211→
212→ return reviews
213→
214→
215→def _guess_platform(url: str) -> str:
216→ url_lower = url.lower()
217→ if "amazon" in url_lower:
218→ return "Amazon"
219→ if "google" in url_lower and "maps" in url_lower:
220→ return "Google Maps"
221→ if "g2" in url_lower:
222→ return "G2"
223→ if "capterra" in url_lower:
224→ return "Capterra"
225→ if "yelp" in url_lower:
226→ return "Yelp"
227→ if "trustpilot" in url_lower:
228→ return "Trustpilot"
229→ return "Web"
230→
231→
232→# ── Summary builder ──────────────────────────────────────────────────
233→
234→def build_summary(
235→ reviews: list[Review],
236→ source_type: str,
237→ product_name: str = "",
238→ platform: str = "",
239→) -> IngestionSummary:
240→ """Build an ingestion summary from parsed reviews."""
241→ if not reviews:
242→ return IngestionSummary(source_type=source_type)
243→
244→ ratings = [r.rating for r in reviews if r.rating is not None]
245→ dates = [r.date for r in reviews if r.date is not None]
246→
247→ # Rating distribution (bucket by integer star)
248→ dist: dict[str, int] = {}
249→ for r in ratings:
250→ bucket = str(int(round(r)))
251→ dist[bucket] = dist.get(bucket, 0) + 1
252→
253→ # Date range
254→ date_range = ""
255→ if dates:
256→ earliest = min(dates).strftime("%Y-%m-%d")
257→ latest = max(dates).strftime("%Y-%m-%d")
258→ date_range = f"{earliest} to {latest}" if earliest != latest else earliest
259→
260→ # Platform detection
261→ platforms = [r.platform for r in reviews if r.platform]
262→ detected_platform = platform or (Counter(platforms).most_common(1)[0][0] if platforms else "Unknown")
263→
264→ return IngestionSummary(
265→ total_reviews=len(reviews),
266→ date_range=date_range,
267→ rating_distribution=dist,
268→ average_rating=round(sum(ratings) / len(ratings), 2) if ratings else None,
269→ platform=detected_platform,
270→ product_name=product_name or "Unknown Product",
271→ source_type=source_type,
272→ )
273→Now I'll replace the Firecrawl section. I need to add asyncio import, rewrite constants + scrape_url(), add schema/prompt, and delete _extract_reviews_from_markdown().
import csv import io import os import re import uuid from collections import Counter from datetime import datetime from typing import Any import httpx
import asyncio import csv import io import os import re import uuid from collections import Counter from datetime import datetime from typing import Any import httpx
The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated successfully.
# ── Firecrawl URL Scraping (secondary, best-effort) ─────────────────
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
FIRECRAWL_API_URL = "https://api.firecrawl.dev/v1"
async def scrape_url(url: str, platform: str = "") -> list[Review]:
"""Best-effort URL scraping via Firecrawl. Returns whatever we can get."""
if not FIRECRAWL_API_KEY:
raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")
async with httpx.AsyncClient(timeout=60.0) as client:
resp = await client.post(
f"{FIRECRAWL_API_URL}/scrape",
headers={
"Authorization": f"Bearer {FIRECRAWL_API_KEY}",
"Content-Type": "application/json",
},
json={
"url": url,
"formats": ["markdown"],
},
)
resp.raise_for_status()
data = resp.json()
markdown = data.get("data", {}).get("markdown", "")
if not markdown:
return []
# Try to extract individual reviews from the markdown.
# This is best-effort — review platforms have varied structures.
reviews = _extract_reviews_from_markdown(markdown, platform, url)
return reviews
def _extract_reviews_from_markdown(
markdown: str, platform: str, url: str
) -> list[Review]:
"""Attempt to parse review blocks from scraped markdown."""
reviews: list[Review] = []
# Strategy: split on patterns that look like review boundaries.
# Common patterns: "★★★★☆", "Rating: X/5", numbered reviews, horizontal rules
blocks = re.split(r"\n(?:---+|\*\*\*+|#{1,3}\s)", markdown)
for i, block in enumerate(blocks):
block = block.strip()
if len(block) < 20:
continue
# Try to extract rating
rating = None
star_match = re.search(r"([★]{1,5})", block)
if star_match:
rating = float(len(star_match.group(1)))
else:
rating_match = re.search(r"(\d(?:\.\d)?)\s*(?:/\s*5|out of 5|stars?)", block, re.I)
if rating_match:
rating = float(rating_match.group(1))
# Try to extract date
date = None
date_match = re.search(
r"(\w+ \d{1,2},? \d{4}|\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{4})",
block,
)
if date_match:
date = _parse_date(date_match.group(1))
reviews.append(
Review(
id=f"scraped_{i}",
text=block[:2000], # Cap individual review length
rating=rating,
date=date,
platform=platform or _guess_platform(url),
metadata={"source_url": url},
)
)
return reviews# ── Firecrawl URL Scraping (secondary, best-effort) ─────────────────
# Uses the /v2/agent endpoint which autonomously navigates pagination
# and returns structured data — no brittle markdown regex parsing.
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
FIRECRAWL_API_URL = "https://api.firecrawl.dev/v2"
_AGENT_POLL_INTERVAL = 2 # seconds between status checks
_AGENT_TIMEOUT = 120 # total seconds before giving up
_AGENT_MAX_CREDITS = 500 # conservative cap for zero-cost constraint
_REVIEW_SCHEMA = {
"type": "object",
"properties": {
"reviews": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string", "description": "Full review text written by the customer"},
"rating": {"type": ["number", "null"], "description": "Star rating from 1 to 5"},
"date": {"type": ["string", "null"], "description": "Date posted in YYYY-MM-DD format"},
"author": {"type": ["string", "null"], "description": "Reviewer's display name"},
},
"required": ["text"],
},
}
},
"required": ["reviews"],
}
def _build_agent_prompt(url: str, platform: str) -> str:
platform_hint = f"This is a {platform} review page. " if platform else ""
return (
f"{platform_hint}"
f"Extract all customer/user reviews from {url}. "
"For each review, extract: the full review text, the star rating as a number "
"from 1 to 5 (if shown), the date it was posted in YYYY-MM-DD format (if shown), "
"and the reviewer's name (if shown). "
"Only extract actual user-submitted reviews — not editorial content, product "
"descriptions, or page navigation text. If reviews span multiple pages, "
"navigate through all available pages to collect them."
)
async def scrape_url(url: str, platform: str = "") -> list[Review]:
"""Best-effort URL scraping via Firecrawl agent. Returns structured reviews."""
if not FIRECRAWL_API_KEY:
raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")
detected_platform = platform or _guess_platform(url)
headers = {
"Authorization": f"Bearer {FIRECRAWL_API_KEY}",
"Content-Type": "application/json",
}
async with httpx.AsyncClient(timeout=30.0) as client:
# Submit agent job
resp = await client.post(
f"{FIRECRAWL_API_URL}/agent",
headers=headers,
json={
"urls": [url],
"prompt": _build_agent_prompt(url, detected_platform),
"schema": _REVIEW_SCHEMA,
"model": "spark-1-mini",
"maxCredits": _AGENT_MAX_CREDITS,
},
)
resp.raise_for_status()
job = resp.json()
job_id = job.get("id")
if not job_id:
return []
# Poll for completion
elapsed = 0.0
while elapsed < _AGENT_TIMEOUT:
await asyncio.sleep(_AGENT_POLL_INTERVAL)
elapsed += _AGENT_POLL_INTERVAL
poll = await client.get(
f"{FIRECRAWL_API_URL}/agent/{job_id}",
headers=headers,
)
poll.raise_for_status()
result = poll.json()
status = result.get("status", "")
if status == "completed":
return _parse_agent_reviews(result.get("data", {}), detected_platform, url)
if status in ("failed", "cancelled"):
return []
# Timed out
return []
def _parse_agent_reviews(data: dict, platform: str, url: str) -> list[Review]:
"""Convert the agent's structured output into Review objects."""
raw_reviews = data.get("reviews", [])
if not isinstance(raw_reviews, list):
return []
reviews: list[Review] = []
for i, item in enumerate(raw_reviews):
if not isinstance(item, dict):
continue
text = str(item.get("text", "")).strip()
if not text:
continue
# Normalise rating — agent may return int, float, string, or null
rating = None
raw_rating = item.get("rating")
if raw_rating is not None:
rating = _parse_rating(str(raw_rating))
# Normalise date — agent should return YYYY-MM-DD but be defensive
date = None
raw_date = item.get("date")
if raw_date:
date = _parse_date(str(raw_date))
author = str(item.get("author", "") or "").strip()
reviews.append(
Review(
id=f"scraped_{i}",
text=text[:2000],
rating=rating,
date=date,
author=author,
platform=platform,
metadata={"source_url": url},
)
)
return reviewsThe file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated successfully.
source .venv/bin/activate && python3 -c "import ast; ast.parse(open('app/ingest.py').read()); print('Syntax OK')"source .venv/bin/activate && python3 -c "
from app.ingest import parse_csv, scrape_url, build_summary, _build_agent_prompt, _REVIEW_SCHEMA
print('All imports OK')
print('Prompt:', _build_agent_prompt('https://amazon.com/reviews', 'Amazon')[:80], '...')
print('Schema keys:', list(_REVIEW_SCHEMA['properties']['reviews']['items']['properties'].keys()))
"All imports OK Prompt: This is a Amazon review page. Extract all customer/user reviews from https://ama ... Schema keys: ['text', 'rating', 'date', 'author']
curl -s -X POST http://localhost:8000/upload \ -F "file=@/tmp/test_reviews.csv" \ -F "product_name=RegressionTest" \ -F "platform=Amazon" \ -D - | head -5
HTTP/1.1 200 OK date: Sat, 21 Mar 2026 22:23:30 GMT server: uvicorn hx-redirect: /chat/1e3dcd60-5164-419c-8150-7cf854180802 content-length: 83
CSV path unaffected. Done. Here's what changed:
app/ingest.py — single file, replaced the Firecrawl section:
/v1/scrape → /v2/agenturls array instead of raw URL + markdown format{reviews: [{text, rating, date, author}]} directly — no more regex splitting of markdown blocks_extract_reviews_from_markdown() — the brittle regex parser is gone_REVIEW_SCHEMA, _build_agent_prompt(), _parse_agent_reviews() — clean structured extractionspark-1-mini model, maxCredits=500_guess_platform(), _parse_date(), _parse_rating() — reused for normalizing agent outputJust attemped a test w/ https://www.g2.com/products/notion/reviews#reviews and clicking Scrape & Analyse netted nothing - the modal didn't dismiss.
grep FIRECRAWL /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/.env
Key is commented out. That means scrape_url() raised "FIRECRAWL_API_KEY not set" and the error HTML went into #scrape-result inside the modal — but since it's a small div at the bottom of the tab, the user may not have noticed it, and the modal stayed open because there's no redirect.
Two issues to fix: 1. Uncomment the key so the actual Firecrawl agent runs 2. Make errors more visible in the modal
# FIRECRAWL_API_KEY=fc-c1a794a831494cbdab6308fe1aebd585
FIRECRAWL_API_KEY=fc-c1a794a831494cbdab6308fe1aebd585
<tool_use_error>File has not been read yet. Read it first before writing to it.</tool_use_error>
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/.env"
} 1→# Required
2→ANTHROPIC_API_KEY=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
3→
4→# Optional: model override (default: claude-sonnet-4-6-20250514)
5→# CLAUDE_MODEL=claude-opus-4-6
6→
7→# Optional: Firecrawl for URL scraping (secondary path)
8→# FIRECRAWL_API_KEY=fc-c1a794a831494cbdab6308fe1aebd585
9→
10→# Optional: data directory (default: data)
11→# DATA_DIR=data
12→
13→# Optional: ChromaDB directory (default: data/chroma)
14→# CHROMA_DIR=data/chroma
15→The API key is commented out. Uncomment it, and also the env var is read at import time so we need to make it read lazily. But first — the real issue is the key was commented out, so scrape_url raised immediately and the error rendered silently inside the modal. Let me fix both.
# FIRECRAWL_API_KEY=fc-c1a794a831494cbdab6308fe1aebd585
FIRECRAWL_API_KEY=fc-c1a794a831494cbdab6308fe1aebd585
The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/.env has been updated successfully.
Now the env var is read at module load time before load_dotenv() runs in main.py. Let me fix that — make it lazy:
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py",
"offset": 131,
"limit": 15
} 131→
132→# ── Firecrawl URL Scraping (secondary, best-effort) ─────────────────
133→# Uses the /v2/agent endpoint which autonomously navigates pagination
134→# and returns structured data — no brittle markdown regex parsing.
135→
136→FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
137→FIRECRAWL_API_URL = "https://api.firecrawl.dev/v2"
138→
139→_AGENT_POLL_INTERVAL = 2 # seconds between status checks
140→_AGENT_TIMEOUT = 120 # total seconds before giving up
141→_AGENT_MAX_CREDITS = 500 # conservative cap for zero-cost constraint
142→
143→_REVIEW_SCHEMA = {
144→ "type": "object",
145→ "properties": {FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY", "")
FIRECRAWL_API_URL = "https://api.firecrawl.dev/v2"FIRECRAWL_API_URL = "https://api.firecrawl.dev/v2"
The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated successfully.
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py",
"offset": 173,
"limit": 10
} 173→ "navigate through all available pages to collect them."
174→ )
175→
176→
177→async def scrape_url(url: str, platform: str = "") -> list[Review]:
178→ """Best-effort URL scraping via Firecrawl agent. Returns structured reviews."""
179→ if not FIRECRAWL_API_KEY:
180→ raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")
181→
182→ detected_platform = platform or _guess_platform(url) """Best-effort URL scraping via Firecrawl agent. Returns structured reviews."""
if not FIRECRAWL_API_KEY:
raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.") """Best-effort URL scraping via Firecrawl agent. Returns structured reviews."""
api_key = os.getenv("FIRECRAWL_API_KEY", "")
if not api_key:
raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated successfully.
"Authorization": f"Bearer {FIRECRAWL_API_KEY}", "Authorization": f"Bearer {api_key}",The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated. All occurrences were successfully replaced.
source .venv/bin/activate && python3 -c "
from dotenv import load_dotenv
load_dotenv()
import os
key = os.getenv('FIRECRAWL_API_KEY', '')
print(f'Key loaded: {bool(key)} ({key[:8]}...)')
"Good. Now let me test the actual scrape endpoint — this will take a bit since the agent polls:
curl -s -X POST http://localhost:8000/scrape \ -F "url=https://www.g2.com/products/notion/reviews" \ -F "product_name=Notion" \ -F "platform=G2" \ -D - 2>&1
HTTP/1.1 400 Bad Request
date: Sat, 21 Mar 2026 22:27:18 GMT
server: uvicorn
content-length: 118
content-type: text/html; charset=utf-8
<div class="error-banner">
<p>No reviews could be extracted from that URL. Try uploading a CSV instead.</p>
</div>Got a 400 with empty results. The agent ran but returned nothing. Let me add some debug logging to see what happened:
source .venv/bin/activate && python3 -c "
import asyncio, httpx, os, json
from dotenv import load_dotenv
load_dotenv()
async def test():
api_key = os.getenv('FIRECRAWL_API_KEY')
headers = {'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json'}
async with httpx.AsyncClient(timeout=30.0) as client:
resp = await client.post(
'https://api.firecrawl.dev/v2/agent',
headers=headers,
json={
'urls': ['https://www.g2.com/products/notion/reviews'],
'prompt': 'Extract all customer reviews from this G2 page. For each review get the full text, star rating 1-5, date in YYYY-MM-DD, and reviewer name.',
'schema': {
'type': 'object',
'properties': {
'reviews': {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'text': {'type': 'string'},
'rating': {'type': ['number', 'null']},
'date': {'type': ['string', 'null']},
'author': {'type': ['string', 'null']},
},
'required': ['text']
}
}
},
'required': ['reviews']
},
'model': 'spark-1-mini',
'maxCredits': 500,
},
)
print(f'Submit status: {resp.status_code}')
job = resp.json()
print(f'Job response: {json.dumps(job, indent=2)[:500]}')
job_id = job.get('id')
if not job_id:
print('No job ID!')
return
# Poll a few times
for i in range(30):
await asyncio.sleep(3)
poll = await client.get(f'https://api.firecrawl.dev/v2/agent/{job_id}', headers=headers)
result = poll.json()
status = result.get('status')
print(f'Poll {i+1}: status={status}')
if status == 'completed':
data = result.get('data', {})
reviews = data.get('reviews', [])
print(f'Got {len(reviews)} reviews')
if reviews:
print(f'First: {json.dumps(reviews[0], indent=2)[:300]}')
else:
print(f'Data keys: {list(data.keys()) if isinstance(data, dict) else type(data)}')
print(f'Raw data: {json.dumps(data)[:500]}')
return
if status in ('failed', 'cancelled'):
print(f'Failed: {json.dumps(result)[:500]}')
return
asyncio.run(test())
" 2>&1Submit status: 200
Job response: {
"success": true,
"id": "019d1284-c3d1-7641-9dbb-7b0050037eee"
}
Poll 1: status=processing
Poll 2: status=processing
Poll 3: status=processing
Poll 4: status=processing
Poll 5: status=processing
Poll 6: status=processing
Poll 7: status=processing
Poll 8: status=processing
Poll 9: status=processing
Poll 10: status=processing
Poll 11: status=processing
Poll 12: status=processing
Poll 13: status=processing
Poll 14: status=processing
Poll 15: status=processing
Poll 16: status=processing
Poll 17: status=processing
Poll 18: status=processing
Poll 19: status=processing
Poll 20: status=processing
Poll 21: status=processing
Poll 22: status=processing
Poll 23: status=processing
Poll 24: status=processing
Poll 25: status=processing
Poll 26: status=processing
Poll 27: status=processing
Poll 28: status=processing
Poll 29: status=processing
Poll 30: status=processingOK — the agent job works but takes longer than 120s to complete. G2 is a heavy JS-rendered site with anti-bot measures, so the agent needs time. Two things to address:
Let me fix both. The scrape should redirect to the chat page immediately with a "processing" status, and the page should poll for completion — same pattern as briefbot's SSE streaming.
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/main.py",
"offset": 120,
"limit": 65
} 120→ summary = build_summary(reviews, source_type="csv", product_name=product_name, platform=platform)
121→
122→ session = Session(
123→ session_id=session_id,
124→ product_name=summary.product_name,
125→ platform=summary.platform,
126→ summary=summary,
127→ status="indexing",
128→ )
129→ store.save_session(session)
130→ store.save_reviews_json(session_id, [r.model_dump(mode="json") for r in reviews])
131→
132→ indexed = vectordb.index_reviews(session_id, reviews)
133→ summary.total_reviews = indexed
134→ store.update_summary(session_id, summary)
135→ store.set_status(session_id, "ready")
136→
137→ return HTMLResponse(
138→ status_code=200,
139→ content=f'<script>window.location.href="/chat/{session_id}";</script>',
140→ headers={"HX-Redirect": f"/chat/{session_id}"},
141→ )
142→
143→
144→# ── URL Scrape ───────────────────────────────────────────────────────
145→
146→@app.post("/scrape")
147→async def scrape(
148→ request: Request,
149→ url: str = Form(...),
150→ product_name: str = Form(""),
151→ platform: str = Form(""),
152→):
153→ session_id = str(uuid.uuid4())
154→
155→ try:
156→ reviews = await scrape_url(url, platform=platform)
157→ except Exception as e:
158→ return templates.TemplateResponse(
159→ "partials/error.html",
160→ {"request": request, "error": f"Scraping failed: {e}. Try uploading a CSV instead."},
161→ status_code=400,
162→ )
163→
164→ if not reviews:
165→ return templates.TemplateResponse(
166→ "partials/error.html",
167→ {"request": request, "error": "No reviews could be extracted from that URL. Try uploading a CSV instead."},
168→ status_code=400,
169→ )
170→
171→ summary = build_summary(reviews, source_type="url", product_name=product_name, platform=platform)
172→
173→ session = Session(
174→ session_id=session_id,
175→ product_name=summary.product_name,
176→ platform=summary.platform,
177→ summary=summary,
178→ status="indexing",
179→ )
180→ store.save_session(session)
181→ store.save_reviews_json(session_id, [r.model_dump(mode="json") for r in reviews])
182→
183→ indexed = vectordb.index_reviews(session_id, reviews)
184→ summary.total_reviews = indexedHere's the plan: The /scrape route should create the session immediately with status="scraping", redirect to /chat/{id}, and run the Firecrawl agent in the background. The chat page will show a scraping progress view when status == "scraping", polling via SSE or a simple HTMX poll until the session flips to "ready".
# ── URL Scrape ───────────────────────────────────────────────────────
@app.post("/scrape")
async def scrape(
request: Request,
url: str = Form(...),
product_name: str = Form(""),
platform: str = Form(""),
):
session_id = str(uuid.uuid4())
try:
reviews = await scrape_url(url, platform=platform)
except Exception as e:
return templates.TemplateResponse(
"partials/error.html",
{"request": request, "error": f"Scraping failed: {e}. Try uploading a CSV instead."},
status_code=400,
)
if not reviews:
return templates.TemplateResponse(
"partials/error.html",
{"request": request, "error": "No reviews could be extracted from that URL. Try uploading a CSV instead."},
status_code=400,
)
summary = build_summary(reviews, source_type="url", product_name=product_name, platform=platform)
session = Session(
session_id=session_id,
product_name=summary.product_name,
platform=summary.platform,
summary=summary,
status="indexing",
)
store.save_session(session)
store.save_reviews_json(session_id, [r.model_dump(mode="json") for r in reviews])
indexed = vectordb.index_reviews(session_id, reviews)
summary.total_reviews = indexed
store.update_summary(session_id, summary)
store.set_status(session_id, "ready")
return HTMLResponse(
status_code=200,
content=f'<script>window.location.href="/chat/{session_id}";</script>',
headers={"HX-Redirect": f"/chat/{session_id}"},
)# ── URL Scrape ───────────────────────────────────────────────────────
@app.post("/scrape")
async def scrape(
request: Request,
url: str = Form(...),
product_name: str = Form(""),
platform: str = Form(""),
):
session_id = str(uuid.uuid4())
# Create session immediately with "scraping" status so user gets feedback
session = Session(
session_id=session_id,
product_name=product_name or "Unknown Product",
platform=platform or "Web",
status="scraping",
)
store.save_session(session)
# Run Firecrawl agent in background — it can take minutes
asyncio.create_task(_run_scrape(session_id, url, product_name, platform))
# Redirect to chat page which shows a progress view
return HTMLResponse(
status_code=200,
content=f'<script>window.location.href="/chat/{session_id}";</script>',
headers={"HX-Redirect": f"/chat/{session_id}"},
)
async def _run_scrape(session_id: str, url: str, product_name: str, platform: str):
"""Background task: scrape URL, index reviews, update session status."""
try:
reviews = await scrape_url(url, platform=platform)
if not reviews:
store.set_status(session_id, "error")
await _emit(session_id, "No reviews could be extracted from that URL.", "error")
return
summary = build_summary(reviews, source_type="url", product_name=product_name, platform=platform)
store.save_reviews_json(session_id, [r.model_dump(mode="json") for r in reviews])
indexed = vectordb.index_reviews(session_id, reviews)
summary.total_reviews = indexed
store.update_summary(session_id, summary)
store.set_status(session_id, "ready")
await _emit(session_id, f"Scraping complete — {indexed} reviews indexed.", "info")
except Exception as e:
store.set_status(session_id, "error")
await _emit(session_id, f"Scraping failed: {e}", "error")
# ── Session status polling (for scraping progress) ───────────────────
@app.get("/api/status/{session_id}")
async def get_status(session_id: str):
session = store.load_session(session_id)
if not session:
return JSONResponse({"status": "not_found"}, status_code=404)
return JSONResponse({"status": session.status})The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/main.py has been updated successfully.
_AGENT_TIMEOUT = 120 # total seconds before giving up
_AGENT_TIMEOUT = 300 # total seconds before giving up
The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated successfully.
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/templates/app.html",
"offset": 86,
"limit": 55
} 86→ <span class="session-detail">{{ s.summary.total_reviews }} reviews · {{ s.platform }}</span>
87→ </a>
88→ {% else %}
89→ <p class="empty-history">No analyses yet</p>
90→ {% endfor %}
91→ </nav>
92→ </div>
93→ </aside>
94→
95→ <!-- ── Main Area ────────────────────────────────────────── -->
96→ <main class="main-area">
97→ {% if session %}
98→ <!-- Chat view -->
99→ <div class="chat-messages" id="chat-messages">
100→ {% for msg in messages %}
101→ {% if msg.role == 'assistant' %}
102→ <div class="message assistant-message">
103→ <div class="message-content">{{ msg.content }}</div>
104→ </div>
105→ {% else %}
106→ <div class="message user-message">
107→ <div class="message-content"><p>{{ msg.content }}</p></div>
108→ </div>
109→ {% endif %}
110→ {% endfor %}
111→
112→ {% if not messages %}
113→ <div class="message assistant-message welcome">
114→ <div class="message-content">
115→ <p>I have <strong>{{ summary.total_reviews }} reviews</strong> loaded for <strong>{{ summary.product_name }}</strong> from <strong>{{ summary.platform }}</strong>{% if summary.date_range %}, spanning <strong>{{ summary.date_range }}</strong>{% endif %}.</p>
116→ <p>I can only answer questions grounded in this dataset. What would you like to explore?</p>
117→ </div>
118→ <div class="follow-ups">
119→ <button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What's the overall sentiment breakdown?">Sentiment breakdown</button>
120→ <button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What are the most common complaints?">Top complaints</button>
121→ <button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="Show me the rating distribution over time">Ratings over time</button>
122→ <button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What do people praise the most?">Most praised aspects</button>
123→ </div>
124→ </div>
125→ {% endif %}
126→ </div>
127→
128→ <div class="chat-input-area">
129→ <form id="chat-form" onsubmit="sendMessage(event)">
130→ <div class="input-row">
131→ <textarea id="chat-input" name="message" placeholder="Ask about the reviews..." rows="1" autofocus></textarea>
132→ <button type="submit" class="btn-send" id="send-btn">
133→ <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
134→ <line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/>
135→ </svg>
136→ </button>
137→ </div>
138→ </form>
139→ <p class="input-hint">Analyses only the uploaded dataset. No external knowledge.</p>
140→ </div> <main class="main-area">
{% if session %}
<!-- Chat view -->
<div class="chat-messages" id="chat-messages">
{% for msg in messages %}
{% if msg.role == 'assistant' %}
<div class="message assistant-message">
<div class="message-content">{{ msg.content }}</div>
</div>
{% else %}
<div class="message user-message">
<div class="message-content"><p>{{ msg.content }}</p></div>
</div>
{% endif %}
{% endfor %}
{% if not messages %}
<div class="message assistant-message welcome">
<div class="message-content">
<p>I have <strong>{{ summary.total_reviews }} reviews</strong> loaded for <strong>{{ summary.product_name }}</strong> from <strong>{{ summary.platform }}</strong>{% if summary.date_range %}, spanning <strong>{{ summary.date_range }}</strong>{% endif %}.</p>
<p>I can only answer questions grounded in this dataset. What would you like to explore?</p>
</div>
<div class="follow-ups">
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What's the overall sentiment breakdown?">Sentiment breakdown</button>
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What are the most common complaints?">Top complaints</button>
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="Show me the rating distribution over time">Ratings over time</button>
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What do people praise the most?">Most praised aspects</button>
</div>
</div>
{% endif %}
</div>
<div class="chat-input-area">
<form id="chat-form" onsubmit="sendMessage(event)">
<div class="input-row">
<textarea id="chat-input" name="message" placeholder="Ask about the reviews..." rows="1" autofocus></textarea>
<button type="submit" class="btn-send" id="send-btn">
<svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/>
</svg>
</button>
</div>
</form>
<p class="input-hint">Analyses only the uploaded dataset. No external knowledge.</p>
</div> <main class="main-area">
{% if session and session.status == 'scraping' %}
<!-- Scraping in progress -->
<div class="scraping-view" id="scraping-view">
<div class="scraping-content">
<div class="scraping-animation">
<svg class="scraping-icon" width="40" height="40" viewBox="0 0 32 32" fill="none">
<circle cx="16" cy="16" r="14" stroke="var(--accent)" stroke-width="1.5" opacity="0.2"/>
<circle cx="16" cy="16" r="14" stroke="var(--accent)" stroke-width="1.5" stroke-dasharray="20 68" class="scraping-spinner"/>
</svg>
</div>
<h3>Scraping reviews...</h3>
<p class="scraping-detail">Firecrawl is navigating <strong>{{ session.platform }}</strong> to extract reviews for <strong>{{ session.product_name }}</strong>. This may take a few minutes for sites with pagination or anti-bot protection.</p>
<div class="scraping-steps">
<div class="step active">
<span class="step-dot"></span>
<span>Submitting to Firecrawl agent</span>
</div>
<div class="step" id="step-navigating">
<span class="step-dot"></span>
<span>Navigating pages & extracting reviews</span>
</div>
<div class="step" id="step-indexing">
<span class="step-dot"></span>
<span>Indexing into vector database</span>
</div>
</div>
</div>
</div>
{% elif session and session.status == 'error' %}
<!-- Scraping failed -->
<div class="empty-state">
<svg width="40" height="40" viewBox="0 0 24 24" fill="none" stroke="var(--error)" stroke-width="1.5" stroke-linecap="round">
<circle cx="12" cy="12" r="10"/><line x1="15" y1="9" x2="9" y2="15"/><line x1="9" y1="9" x2="15" y2="15"/>
</svg>
<h2>Scraping failed</h2>
<p>Could not extract reviews from this URL. The site may be blocking automated access.</p>
<button class="btn btn-primary" onclick="openModal()" style="width:auto">Try Again</button>
</div>
{% elif session %}
<!-- Chat view -->
<div class="chat-messages" id="chat-messages">
{% for msg in messages %}
{% if msg.role == 'assistant' %}
<div class="message assistant-message">
<div class="message-content">{{ msg.content }}</div>
</div>
{% else %}
<div class="message user-message">
<div class="message-content"><p>{{ msg.content }}</p></div>
</div>
{% endif %}
{% endfor %}
{% if not messages %}
<div class="message assistant-message welcome">
<div class="message-content">
<p>I have <strong>{{ summary.total_reviews }} reviews</strong> loaded for <strong>{{ summary.product_name }}</strong> from <strong>{{ summary.platform }}</strong>{% if summary.date_range %}, spanning <strong>{{ summary.date_range }}</strong>{% endif %}.</p>
<p>I can only answer questions grounded in this dataset. What would you like to explore?</p>
</div>
<div class="follow-ups">
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What's the overall sentiment breakdown?">Sentiment breakdown</button>
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What are the most common complaints?">Top complaints</button>
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="Show me the rating distribution over time">Ratings over time</button>
<button class="follow-up-btn" onclick="sendFollowUp(this)" data-question="What do people praise the most?">Most praised aspects</button>
</div>
</div>
{% endif %}
</div>
<div class="chat-input-area">
<form id="chat-form" onsubmit="sendMessage(event)">
<div class="input-row">
<textarea id="chat-input" name="message" placeholder="Ask about the reviews..." rows="1" autofocus></textarea>
<button type="submit" class="btn-send" id="send-btn">
<svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<line x1="5" y1="12" x2="19" y2="12"/><polyline points="12 5 19 12 12 19"/>
</svg>
</button>
</div>
</form>
<p class="input-hint">Analyses only the uploaded dataset. No external knowledge.</p>
</div>The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/templates/app.html has been updated successfully.
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/static/style.css",
"offset": 370,
"limit": 10
} 370→ color: var(--text-3);
371→ padding: 0.5rem 0;
372→}
373→
374→/* ═══════════════════════════════════════════════════════════════════
375→ MAIN AREA
376→ ═══════════════════════════════════════════════════════════════════ */
377→
378→.main-area {
379→ display: flex;/* ═══════════════════════════════════════════════════════════════════
MAIN AREA
═══════════════════════════════════════════════════════════════════ */
.main-area {
display: flex;/* ═══════════════════════════════════════════════════════════════════
MAIN AREA
═══════════════════════════════════════════════════════════════════ */
/* ── Scraping Progress ─────────────────────────────────────────── */
.scraping-view {
flex: 1;
display: flex;
align-items: center;
justify-content: center;
padding: 2rem;
}
.scraping-content {
text-align: center;
max-width: 420px;
display: flex;
flex-direction: column;
align-items: center;
gap: 1rem;
}
.scraping-content h3 {
font-size: 1.1rem;
font-weight: 500;
}
.scraping-detail {
font-size: 0.85rem;
color: var(--text-2);
line-height: 1.6;
}
@keyframes scrape-spin {
to { transform: rotate(360deg); }
}
.scraping-spinner {
animation: scrape-spin 1.2s linear infinite;
transform-origin: center;
}
.scraping-animation {
margin-bottom: 0.25rem;
}
.scraping-steps {
display: flex;
flex-direction: column;
gap: 0.5rem;
align-items: flex-start;
margin-top: 0.5rem;
text-align: left;
}
.step {
display: flex;
align-items: center;
gap: 0.6rem;
font-size: 0.8rem;
color: var(--text-3);
transition: color 0.2s;
}
.step.active { color: var(--text-1); }
.step.done { color: var(--accent); }
.step-dot {
width: 7px;
height: 7px;
border-radius: 50%;
background: var(--bg-4);
flex-shrink: 0;
transition: background 0.2s;
}
.step.active .step-dot { background: var(--accent); }
.step.done .step-dot { background: var(--accent); }
.main-area {
display: flex;The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/static/style.css has been updated successfully.
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/static/app.js",
"offset": 250,
"limit": 15
} 250→
251→function removeThinking() {
252→ var el = document.getElementById("thinking-indicator");
253→ if (el) el.remove();
254→}
255→
256→function appendError(msg) {
257→ var div = document.createElement("div");
258→ div.className = "message assistant-message error";
259→ div.innerHTML = '<div class="message-content"><p>' + escapeHtml(msg) + "</p></div>";
260→ chatMessages.appendChild(div);
261→ scrollToBottom();
262→}
263→
264→function sendFollowUp(btn) {{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/static/app.js",
"offset": 338
} 338→ color: textColor,
339→ padding: { bottom: 12 },
340→ },
341→ legend: {
342→ labels: { color: textColor, font: { family: "'DM Sans', sans-serif", size: 11 } },
343→ },
344→ },
345→ scales:
346→ config.type !== "pie" && config.type !== "doughnut"
347→ ? {
348→ x: {
349→ ticks: { color: textColor, font: { family: "'IBM Plex Mono', monospace", size: 10 } },
350→ grid: { color: gridColor },
351→ },
352→ y: {
353→ ticks: { color: textColor, font: { family: "'IBM Plex Mono', monospace", size: 10 } },
354→ grid: { color: gridColor },
355→ beginAtZero: true,
356→ },
357→ }
358→ : undefined,
359→ },
360→ });
361→}
362→
363→function runChartScripts() {
364→ if (!chatMessages) return;
365→ var scripts = chatMessages.querySelectorAll("script");
366→ scripts.forEach(function (script) {
367→ if (script.textContent.indexOf("renderChart") !== -1 && !script.dataset.executed) {
368→ script.dataset.executed = "true";
369→ eval(script.textContent);
370→ }
371→ });
372→}
373→
374→// ── Utilities ────────────────────────────────────────────────────────
375→
376→function escapeHtml(str) {
377→ var div = document.createElement("div");
378→ div.appendChild(document.createTextNode(str));
379→ return div.innerHTML;
380→}
381→
382→if (chatMessages) {
383→ var observer = new MutationObserver(scrollToBottom);
384→ observer.observe(chatMessages, { childList: true, subtree: true });
385→}
386→
387→scrollToBottom();
388→if (chatMessages) {
var observer = new MutationObserver(scrollToBottom);
observer.observe(chatMessages, { childList: true, subtree: true });
}
scrollToBottom();if (chatMessages) {
var observer = new MutationObserver(scrollToBottom);
observer.observe(chatMessages, { childList: true, subtree: true });
}
scrollToBottom();
// ── Scraping status poll ────────────────────────────────────────────
(function () {
var scrapingView = document.getElementById("scraping-view");
if (!scrapingView || !SESSION_ID) return;
var stepNav = document.getElementById("step-navigating");
var stepIdx = document.getElementById("step-indexing");
var pollCount = 0;
// Mark "navigating" step active after 3s
setTimeout(function () {
if (stepNav) stepNav.classList.add("active");
}, 3000);
var poller = setInterval(function () {
pollCount++;
fetch("/api/status/" + SESSION_ID)
.then(function (r) { return r.json(); })
.then(function (data) {
if (data.status === "ready") {
if (stepNav) stepNav.classList.add("done");
if (stepIdx) { stepIdx.classList.add("active"); stepIdx.classList.add("done"); }
clearInterval(poller);
// Brief pause so user sees "done" state, then reload
setTimeout(function () { window.location.reload(); }, 600);
} else if (data.status === "error") {
clearInterval(poller);
window.location.reload();
}
})
.catch(function () {}); // Ignore network blips
}, 3000);
// Stop polling after 6 minutes
setTimeout(function () { clearInterval(poller); }, 360000);
})();The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/static/app.js has been updated successfully.
source .venv/bin/activate && python3 -c "import ast; ast.parse(open('app/ingest.py').read()); ast.parse(open('app/main.py').read()); print('All OK')"# Quick test: submit a scrape and check the redirect + progress page curl -s -X POST http://localhost:8000/scrape \ -F "url=https://www.g2.com/products/notion/reviews" \ -F "product_name=Notion" \ -F "platform=G2" \ -D - 2>&1 | head -8
HTTP/1.1 200 OK date: Sat, 21 Mar 2026 22:33:09 GMT server: uvicorn hx-redirect: /chat/0c1d8495-38eb-4cdb-bd9a-bb4d87bf4966 content-length: 83 content-type: text/html; charset=utf-8 <script>window.location.href="/chat/0c1d8495-38eb-4cdb-bd9a-bb4d87bf4966";</script>
curl -s http://localhost:8000/chat/0c1d8495-38eb-4cdb-bd9a-bb4d87bf4966 | grep -o 'scraping-view\|Scraping reviews\|scraping-steps' | head -3
curl -s http://localhost:8000/api/status/0c1d8495-38eb-4cdb-bd9a-bb4d87bf4966
Working. Here's what was fixed:
.env — uncommented itscrape_url() so it picks up load_dotenv()/scrape now creates the session immediately with status="scraping", redirects to the chat page, and runs Firecrawl in a background tasksession.status == "scraping", the chat page shows an animated spinner with step indicators (submitting → navigating → indexing)/api/status/{id} every 3s, reloads the page when status flips to ready or errorOpen http://localhost:8000/chat/0c1d8495-38eb-4cdb-bd9a-bb4d87bf4966 to see the scraping progress view live.
Says: Scraping failed
Could not extract reviews from this URL. The site may be blocking automated access.
What do logs say?
pkill -f "uvicorn app.main" 2>/dev/null; sleep 1 source .venv/bin/activate && uvicorn app.main:app --reload --reload-exclude '.venv' --port 8000 2>&1 | tail -30 & sleep 2 echo "--- Server restarted ---"
# Test scrape with a simpler URL first and watch logs curl -s -X POST http://localhost:8000/scrape \ -F "url=https://www.g2.com/products/notion/reviews" \ -F "product_name=Notion" \ -F "platform=G2" > /dev/null 2>&1 sleep 5 # Check what the server logged cat /dev/stderr 2>/dev/null || true
The logs went to the background process. Let me add explicit error logging to the scrape background task so we can see what happened:
source .venv/bin/activate && python3 -c "
import asyncio, os, json
from dotenv import load_dotenv
load_dotenv()
from app.ingest import scrape_url
async def test():
try:
reviews = await scrape_url('https://www.g2.com/products/notion/reviews', platform='G2')
print(f'Got {len(reviews)} reviews')
if reviews:
print(json.dumps(reviews[0].model_dump(mode='json'), indent=2, default=str)[:400])
except Exception as e:
print(f'ERROR: {type(e).__name__}: {e}')
asyncio.run(test())
" 2>&1ERROR: HTTPStatusError: Server error '502 Bad Gateway' for url 'https://api.firecrawl.dev/v2/agent/019d128d-5f0a-708d-9f67-73e2568e6b38' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502
502 on the poll request. The agent job exists but the poll endpoint is returning a transient server error. We need to handle this gracefully — retry on 5xx instead of crashing.
{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py",
"offset": 177,
"limit": 30
} 177→async def scrape_url(url: str, platform: str = "") -> list[Review]:
178→ """Best-effort URL scraping via Firecrawl agent. Returns structured reviews."""
179→ api_key = os.getenv("FIRECRAWL_API_KEY", "")
180→ if not api_key:
181→ raise ValueError("FIRECRAWL_API_KEY not set. Use CSV upload instead.")
182→
183→ detected_platform = platform or _guess_platform(url)
184→ headers = {
185→ "Authorization": f"Bearer {api_key}",
186→ "Content-Type": "application/json",
187→ }
188→
189→ async with httpx.AsyncClient(timeout=30.0) as client:
190→ # Submit agent job
191→ resp = await client.post(
192→ f"{FIRECRAWL_API_URL}/agent",
193→ headers=headers,
194→ json={
195→ "urls": [url],
196→ "prompt": _build_agent_prompt(url, detected_platform),
197→ "schema": _REVIEW_SCHEMA,
198→ "model": "spark-1-mini",
199→ "maxCredits": _AGENT_MAX_CREDITS,
200→ },
201→ )
202→ resp.raise_for_status()
203→ job = resp.json()
204→
205→ job_id = job.get("id")
206→ if not job_id:{
"file_path": "/Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py",
"offset": 206,
"limit": 25
} 206→ if not job_id:
207→ return []
208→
209→ # Poll for completion
210→ elapsed = 0.0
211→ while elapsed < _AGENT_TIMEOUT:
212→ await asyncio.sleep(_AGENT_POLL_INTERVAL)
213→ elapsed += _AGENT_POLL_INTERVAL
214→
215→ poll = await client.get(
216→ f"{FIRECRAWL_API_URL}/agent/{job_id}",
217→ headers=headers,
218→ )
219→ poll.raise_for_status()
220→ result = poll.json()
221→
222→ status = result.get("status", "")
223→ if status == "completed":
224→ return _parse_agent_reviews(result.get("data", {}), detected_platform, url)
225→ if status in ("failed", "cancelled"):
226→ return []
227→
228→ # Timed out
229→ return []
230→ # Poll for completion
elapsed = 0.0
while elapsed < _AGENT_TIMEOUT:
await asyncio.sleep(_AGENT_POLL_INTERVAL)
elapsed += _AGENT_POLL_INTERVAL
poll = await client.get(
f"{FIRECRAWL_API_URL}/agent/{job_id}",
headers=headers,
)
poll.raise_for_status()
result = poll.json()
status = result.get("status", "")
if status == "completed":
return _parse_agent_reviews(result.get("data", {}), detected_platform, url)
if status in ("failed", "cancelled"):
return []
# Timed out
return [] # Poll for completion — tolerate transient 5xx errors
elapsed = 0.0
while elapsed < _AGENT_TIMEOUT:
await asyncio.sleep(_AGENT_POLL_INTERVAL)
elapsed += _AGENT_POLL_INTERVAL
try:
poll = await client.get(
f"{FIRECRAWL_API_URL}/agent/{job_id}",
headers=headers,
)
if poll.status_code >= 500:
continue # Retry on transient server errors
poll.raise_for_status()
result = poll.json()
except httpx.HTTPStatusError:
continue # Retry
except httpx.RequestError:
continue # Network blip, retry
status = result.get("status", "")
if status == "completed":
return _parse_agent_reviews(result.get("data", {}), detected_platform, url)
if status in ("failed", "cancelled"):
return []
# Timed out
return []The file /Users/shaynelarocque/Documents/GitHub/futuresight/reviewlens/app/ingest.py has been updated successfully.
source .venv/bin/activate && python3 -c "
import asyncio, os, json
from dotenv import load_dotenv
load_dotenv()
from app.ingest import scrape_url
async def test():
try:
reviews = await scrape_url('https://www.g2.com/products/notion/reviews', platform='G2')
print(f'Got {len(reviews)} reviews')
if reviews:
for r in reviews[:3]:
print(f' [{r.rating}] {r.author}: {r.text[:80]}...')
else:
print('Empty result (agent may have timed out or returned no reviews)')
except Exception as e:
print(f'ERROR: {type(e).__name__}: {e}')
asyncio.run(test())
" 2>&1The user doesn't want to proceed with this tool use. The tool use was rejected (eg. if it was a file edit, the new_string was NOT written to the file). STOP what you are doing and wait for the user to tell you how to proceed.