forked from firecrawl/firecrawl
-
Notifications
You must be signed in to change notification settings - Fork 47
feat: Migrating Update from the original firecrawl repository #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dan-and
wants to merge
11
commits into
devflowinc:main
Choose a base branch
from
dan-and:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+10,011
−925
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add bulk scrape controller for processing multiple URLs - Add bulkScrapeRequestSchema and BulkScrapeRequest types - Add /v1/bulk/scrape POST and GET endpoints - Make originUrl optional in StoredCrawl type - Update queue worker to handle bulk scrape jobs - Add null safety for crawlerOptions in runWebScraper - Based on commit 03b3799 from firecrawl-original
- Update axios from ^1.3.4 to ^1.12.2 - Fixes CVE-2024-24691 and other security vulnerabilities - Based on commit 50343bc from firecrawl-original - No breaking changes, all functionality preserved
- Add content-type filtering to robots.txt parsing to prevent HTML error pages from being treated as robots.txt rules (ba3e4cd) - Fix URL validation regex to allow query parameters and fragments (cfd776a) Resolves issues where sites like JPMorgan Chase had robots.txt rules ignored and URLs with params were failing validation.
- Add scrapeId field to DocumentMetadata interface - Update legacyDocumentConverter to include scrapeId in API response - Modify scrape controller to pass crawl_id to job data for scrape jobs - Update single_url scraper to include scrapeId in document metadata - Fix queue worker to properly handle scrape vs crawl job distinction - Fix Docker build issues and align pnpm lockfile for reproducible builds Based on original commit d1f3b96 from the original firecrawl repository"
- Add ulimits configuration to docker-compose.yaml - Set nofile limits to 65535 (soft and hard) - Improves performance for high-concurrency web scraping - Prevents 'too many open files' errors - Synchronized with docker-compose.dev.yaml Based on original commit f0a1a2e from the originate firecrawl repository"
- Slice sitemap URLs to respect the specified limit - Add debug logging for sitemap URL limiting - Prevent creating excessive jobs from large sitemaps Based on: firecrawl-original commit c6ebbc6
- Add visited_unique Redis set for proper limit counting - Update lockURL and lockURLs functions to track unique URLs - Match original firecrawl architecture for limit handling - Ensure limit parameter is properly enforced during crawling Based on: firecrawl-original commit c6ebbc6
- Update V0 and V1 status controllers - Fix status logic and type definitions - Improve status handling reliability Based on: original firecrawl repository commit 6637dce
- Add Winston dependency for structured logging - Update TypeScript config to support ES2022 features - Replace basic logging with JSON format and metadata - Add zero data retention and error serialization - Update controllers and workers with contextual logging Based on original firecrawl repository commit 4a6b46d
- Replace console.log with Logger calls across all files - Remove commented out console.log statements - Enhance error logging with structured metadata - Improve logging consistency and professionalism Based on the original firecrawl commit 4c49bb9
- Increase file descriptor limit from 65535 to 1048576 - Apply ulimit to API, Worker, and Puppeteer services - Improve support for high-concurrency web scraping scenarios - Prevent "Too Many Open Files" errors under load Based on the original firecrawl repository commit f0a1a2e
Hi! I'm also awaiting some updates on the firecrawl-simple... Does your pull request add either the Delete or Map endpoints? Or add the Crawl options ignoreQueryParameters or maxDiscoveryDepth? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
I know that this pull request is quite unorthodox, as I have too many different features included.
Please hear me out: I tried to get firecrawl-simple updated with fixes, which have been applied to the original firecrawl repository. This is quite a journey, as there are nearly 2400 commits since this fork was done.
I reduced the amount by filtering out:
I got to a point where I had 900 commits, which then splitted into several categories, mainly performance, html conversion, bug-fixes, library updates, security fixes etc.
This is my first set of commits which I have migrated. Most prominent are the redis updates to use lazy loading as most modern implementations.
I will be fine when this PR is not merged, but please take a look at my commits.
Git log:
commit 66a869c000ea2c8a27b2fbeb2c2f3c1c7b2f2205 (HEAD -> main)
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 20:29:08 2025 +0200
commit 9381721 (origin/main, origin/HEAD)
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 15:42:02 2025 +0200
commit fa316b2
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 15:33:28 2025 +0200
commit c000568
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 15:24:38 2025 +0200
commit d238fdd
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 14:51:22 2025 +0200
commit 4480a2e
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 14:46:48 2025 +0200
commit fe075a4
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 14:46:32 2025 +0200
commit c4aa67f
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 12:51:48 2025 +0200
commit 9da8569
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 12:41:34 2025 +0200
commit 9da8569
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 12:41:34 2025 +0200
commit 853dfee
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 10:43:10 2025 +0200
commit 7796f32
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 00:53:01 2025 +0200
commit af8f411
Author: Daniel Andersen [email protected]
Date: Wed Sep 24 00:30:42 2025 +0200