Skip to content

Conversation

dan-and
Copy link

@dan-and dan-and commented Sep 27, 2025

Hi,

I know that this pull request is quite unorthodox, as I have too many different features included.

Please hear me out: I tried to get firecrawl-simple updated with fixes, which have been applied to the original firecrawl repository. This is quite a journey, as there are nearly 2400 commits since this fork was done.

I reduced the amount by filtering out:

  • Rust code (a migration firecrawl-simple missed out)
  • removed all billing/authentication etc. based codes
  • Removed everything with was v2 api only

I got to a point where I had 900 commits, which then splitted into several categories, mainly performance, html conversion, bug-fixes, library updates, security fixes etc.

This is my first set of commits which I have migrated. Most prominent are the redis updates to use lazy loading as most modern implementations.

I will be fine when this PR is not merged, but please take a look at my commits.

Git log:
commit 66a869c000ea2c8a27b2fbeb2c2f3c1c7b2f2205 (HEAD -> main)
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 20:29:08 2025 +0200

feat: add URL deduplication and redis connection optimization

Add advanced Redis memory management features:

URL Deduplication:
- Implement generateURLPermutations() for detecting similar URLs
- Support www/http/https variations and common path permutations
- Enable deduplicateSimilarURLs crawler option
- Achieve (very optimistic) ~16x memory reduction for crawled URL tracking

Connection Management:
- Add getRedisConnection() with lazy loading pattern
- Replace direct Redis instantiation across all services
- Include connection monitoring and error handling
- Optimize resource usage in concurrent environments

Updated components:
- Queue service, crawl Redis logic, job priority system
- All crawl controllers (v0/v1) with enhanced URL locking
- Worker processes with optimized Redis usage

Based on original firecrawl commits 308e4f43 (connection opt) + 7611f819 (URL dedup)
Impact: Significant memory savings + improved reliability

commit 9381721 (origin/main, origin/HEAD)
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 15:42:02 2025 +0200

fix: increase Docker ulimit -n for better performance

- Increase file descriptor limit from 65535 to 1048576
- Apply ulimit to API, Worker, and Puppeteer services
- Improve support for high-concurrency web scraping scenarios
- Prevent "Too Many Open Files" errors under load

Based on the original firecrawl repository commit f0a1a2e4

commit fa316b2
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 15:33:28 2025 +0200

refactor: remove unnecessary logs and clean up console statements

- Replace console.log with Logger calls across all files
- Remove commented out console.log statements
- Enhance error logging with structured metadata
- Improve logging consistency and professionalism

Based on the original firecrawl commit 4c49bb9f

commit c000568
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 15:24:38 2025 +0200

feat: implement structured logging with Winston

- Add Winston dependency for structured logging
- Update TypeScript config to support ES2022 features
- Replace basic logging with JSON format and metadata
- Add zero data retention and error serialization
- Update controllers and workers with contextual logging

Based on original firecrawl repository commit 4a6b46d0

commit d238fdd
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 14:51:22 2025 +0200

fix: status handling improvements on V0 and V1 api

- Update V0 and V1 status controllers
- Fix status logic and type definitions
- Improve status handling reliability

Based on: original firecrawl repository commit 6637dce6

commit 4480a2e
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 14:46:48 2025 +0200

fix: implement visited_unique tracking for limit enforcement

- Add visited_unique Redis set for proper limit counting
- Update lockURL and lockURLs functions to track unique URLs
- Match original firecrawl architecture for limit handling
- Ensure limit parameter is properly enforced during crawling

Based on: firecrawl-original commit c6ebbc6f

commit fe075a4
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 14:46:32 2025 +0200

fix: respect limit parameter in sitemap processing

- Slice sitemap URLs to respect the specified limit
- Add debug logging for sitemap URL limiting
- Prevent creating excessive jobs from large sitemaps

Based on: firecrawl-original commit c6ebbc6f

commit c4aa67f
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 12:51:48 2025 +0200

feat: add Docker ulimit configuration for better performance

- Add ulimits configuration to docker-compose.yaml
- Set nofile limits to 65535 (soft and hard)
- Improves performance for high-concurrency web scraping
- Prevents 'too many open files' errors
- Synchronized with docker-compose.dev.yaml

Based on original commit f0a1a2e4 from the originate firecrawl repository"

commit 9da8569
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 12:41:34 2025 +0200

git commit -m "feat: add scrapeId to document metadata for traceability

- Add scrapeId field to DocumentMetadata interface
- Update legacyDocumentConverter to include scrapeId in API response
- Modify scrape controller to pass crawl_id to job data for scrape jobs
- Update single_url scraper to include scrapeId in document metadata
- Fix queue worker to properly handle scrape vs crawl job distinction
- Fix Docker build issues and align pnpm lockfile for reproducible builds

Based on original commit d1f3b963 from the original firecrawl repository"

commit 9da8569
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 12:41:34 2025 +0200

git commit -m "feat: add scrapeId to document metadata for traceability

- Add scrapeId field to DocumentMetadata interface
- Update legacyDocumentConverter to include scrapeId in API response
- Modify scrape controller to pass crawl_id to job data for scrape jobs
- Update single_url scraper to include scrapeId in document metadata
- Fix queue worker to properly handle scrape vs crawl job distinction
- Fix Docker build issues and align pnpm lockfile for reproducible builds

Based on original commit d1f3b963 from the original firecrawl repository"

commit 853dfee
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 10:43:10 2025 +0200

feat: improve robots.txt filtering and URL validation

- Add content-type filtering to robots.txt parsing to prevent HTML error pages from being treated as robots.txt rules (ba3e4cd3)
- Fix URL validation regex to allow query parameters and fragments (cfd776a5)

Resolves issues where sites like JPMorgan Chase had robots.txt rules ignored and URLs with params were failing validation.

commit 7796f32
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 00:53:01 2025 +0200

fix: update axios to 1.12.2 for security vulnerability

- Update axios from ^1.3.4 to ^1.12.2
- Fixes CVE-2024-24691 and other security vulnerabilities
- Based on commit 50343bc9 from firecrawl-original
- No breaking changes, all functionality preserved

commit af8f411
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 00:30:42 2025 +0200

feat: add bulk scrape functionality

- Add bulk scrape controller for processing multiple URLs
- Add bulkScrapeRequestSchema and BulkScrapeRequest types
- Add /v1/bulk/scrape POST and GET endpoints
- Make originUrl optional in StoredCrawl type
- Update queue worker to handle bulk scrape jobs
- Add null safety for crawlerOptions in runWebScraper
- Based on commit 03b37998 from firecrawl-original

- Add bulk scrape controller for processing multiple URLs
- Add bulkScrapeRequestSchema and BulkScrapeRequest types
- Add /v1/bulk/scrape POST and GET endpoints
- Make originUrl optional in StoredCrawl type
- Update queue worker to handle bulk scrape jobs
- Add null safety for crawlerOptions in runWebScraper
- Based on commit 03b3799 from firecrawl-original
- Update axios from ^1.3.4 to ^1.12.2
- Fixes CVE-2024-24691 and other security vulnerabilities
- Based on commit 50343bc from firecrawl-original
- No breaking changes, all functionality preserved
- Add content-type filtering to robots.txt parsing to prevent HTML error pages from being treated as robots.txt rules (ba3e4cd)
- Fix URL validation regex to allow query parameters and fragments (cfd776a)

Resolves issues where sites like JPMorgan Chase had robots.txt rules ignored and URLs with params were failing validation.
- Add scrapeId field to DocumentMetadata interface
- Update legacyDocumentConverter to include scrapeId in API response
- Modify scrape controller to pass crawl_id to job data for scrape jobs
- Update single_url scraper to include scrapeId in document metadata
- Fix queue worker to properly handle scrape vs crawl job distinction
- Fix Docker build issues and align pnpm lockfile for reproducible builds

Based on original commit d1f3b96 from the original firecrawl repository"
- Add ulimits configuration to docker-compose.yaml
- Set nofile limits to 65535 (soft and hard)
- Improves performance for high-concurrency web scraping
- Prevents 'too many open files' errors
- Synchronized with docker-compose.dev.yaml

Based on original commit f0a1a2e from the originate firecrawl repository"
- Slice sitemap URLs to respect the specified limit
- Add debug logging for sitemap URL limiting
- Prevent creating excessive jobs from large sitemaps

Based on: firecrawl-original commit c6ebbc6
- Add visited_unique Redis set for proper limit counting
- Update lockURL and lockURLs functions to track unique URLs
- Match original firecrawl architecture for limit handling
- Ensure limit parameter is properly enforced during crawling

Based on: firecrawl-original commit c6ebbc6
- Update V0 and V1 status controllers
- Fix status logic and type definitions
- Improve status handling reliability

Based on: original firecrawl repository commit 6637dce
- Add Winston dependency for structured logging
- Update TypeScript config to support ES2022 features
- Replace basic logging with JSON format and metadata
- Add zero data retention and error serialization
- Update controllers and workers with contextual logging

Based on original firecrawl repository commit 4a6b46d
- Replace console.log with Logger calls across all files
- Remove commented out console.log statements
- Enhance error logging with structured metadata
- Improve logging consistency and professionalism

Based on the original firecrawl commit 4c49bb9
- Increase file descriptor limit from 65535 to 1048576
- Apply ulimit to API, Worker, and Puppeteer services
- Improve support for high-concurrency web scraping scenarios
- Prevent "Too Many Open Files" errors under load

Based on the original firecrawl repository commit f0a1a2e
@ifx-querido
Copy link

Hi! I'm also awaiting some updates on the firecrawl-simple... Does your pull request add either the Delete or Map endpoints? Or add the Crawl options ignoreQueryParameters or maxDiscoveryDepth?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants