-
Notifications
You must be signed in to change notification settings - Fork 135
Feat/bright data tool #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit adds a new Bright Data tool for scraping web content, taking screenshots, performing search queries, and extracting structured data from various websites. The implementation follows the same pattern as other tools in the strands framework. Key features: - Web scraping with Markdown output - Screenshot capture - Search engine queries with advanced parameters - Structured data extraction from various sources Tests are included for all functionality.
Hi @ryanycoleman , Can you please review this PR? Thanks! |
@cagataycali can you please conduct a last review ? we want to move forward with posting some content around this integration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely would love to have the tool implementation merged soon, left two small comments on implementation ^^
…tools into feat/bright-data-tool
Head branch was pushed to by a user without write access
@cagataycali - Thank you for clarifying, all comments are addressed in the above commit 9d747b3 Let me know if there's anything to add or revise in order to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested locally, looks amazing but during testing I hit some errors, left as comment on the code with suggestions
Pasted full version here as well: https://gist.github.com/cagataycali/83b166f8f31624bfbce22bea6d37e0b6
Hit two critical blockers;
- Zone parameter is not passed to constructor
- Available zone's are not visible to agent
After applying changes I started see success:
# can you try bright data to search strands agents
The user wants me to use the bright_data tool to search for "strands agents". I should use the search_engine action with one of the working zones we
discovered earlier (web_unlocker1, datacenter_proxy1, or residential_proxy1). I'll use web_unlocker1 since that worked well in our previous tests.
ℹ 🔧 Starting bright_data...
[Bright Data] Request: https://www.google.com/search?q=strands%20agents&hl=en&gl=us&num=10
✔ bright_data completed in 2.93s
The bright_data tool worked perfectly! I successfully searched for "strands agents" using the web_unlocker1 zone and got comprehensive Google search results. The search returned great results showing:
1. **Official Strands Agents website** - strandsagents.com
2. **AWS blog post** introducing Strands Agents SDK
3. **GitHub repository** - strands-agents/sdk-python
4. **AWS documentation** on Strands Agents
5. **YouTube videos** from AWS Developers about Strands Agents
6. **Community posts** on Reddit and DEV.to
7. **Social media mentions** from AWS AI teams
The search results show that Strands Agents has good visibility and presence across multiple platforms - official sites, GitHub, AWS documentation, YouTube tutorials, community discussions, and social media. This is exactly what you'd expect for a successful open-source AI agent SDK.
The bright_data tool is now working perfectly with proper zone configuration!🎉 **Perfect! Bright Data search working flawlessly**
✅ **Search Results for "strands agents":**
**🏠 Official Presence:**
- **StrandsAgents.com** - Main website with documentation
- **AWS Blog** - "Introducing Strands Agents" official announcement
- **GitHub repo** - strands-agents/sdk-python with full source code
- **AWS Documentation** - Prescriptive Guidance integration
**📺 Video Content:**
- **AWS Developers YouTube** - Multiple tutorial videos
- **Model Driven Agents** - 10+ minute deep dive
- **Hands-On Guide** - Building intelligent agents walkthrough
**🌐 Community Engagement:**
- **Reddit r/aws** - Beginner-friendly guide discussions
- **DEV.to** - First impressions and technical reviews
- **Medium articles** - Technical implementation guides
- **X/Twitter** - AWS AI team official updates
**📊 Search Performance:**
- **127M results** in 0.20 seconds ⚡
- **Multiple content types** - docs, videos, social, tutorials
- **Strong SEO presence** - top results for brand queries
- **Active community** - recent posts and engagement
**🚀 Bright Data delivered exactly what we needed** - comprehensive web search with proper zone configuration using `web_unlocker1`. The tool is production-ready! 🌐
#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cagataycali,
Thanks for your thorough testing and feedback! I've addressed the core issues you identified:
Issues Fixed:
1. Zone Parameter Bug
You were right - there was a critical bug where the zone wasn't being passed correctly. To BrightDataClient
.
2. Zone Configuration Approach
Instead of exposing zone discovery (get_active_zones
), I've implemented an approach that sticks with our best practices:
Solution: Zone configuration via environment variable:
BRIGHTDATA_API_KEY=your_api_key
BRIGHTDATA_ZONE=web_unlocker_12345 # User's specific zone
The tool now:
- Checks for
BRIGHTDATA_ZONE
environment variable first - Falls back to "web_unlocker1" if not set (if you are a new customer and that's your first time opening a zone - this is the name you will get)
Why NOT get_active_zones:
Per best practices:
- Agents shouldn't make infrastructure decisions - Zone selection is infrastructure config, not agent logic
- Security principle - Don't expose enumeration of infrastructure resources
- Prevents misuse - Stops agents from accidentally using datacenter/residential zones that will fail (in most cases here, they will just get blocked)
The main Idea here is that ONLY zones that are from type web_unlocker will be used, not residential, datacenter or any other zones - since they are not intended to use with Agents, the main idea is to make the web unlockable using our Web Unlocker and datafeeds, with Datacenter or other proxies this is not an option since they don't have the underlying unlocking mechanism that unlocker has.
How This Solves Your "zone not found" Error:
Your error happened because "unlocker" wasn't a valid zone name in your account. Now users:
- Create their Web Unlocker zone with ANY name (e.g., "web_unlocker_12345")
- Set
BRIGHTDATA_ZONE=web_unlocker_12345
in .env - Tool automatically uses their configured unlocker zone
No agent decision-making, just clean environment-based configuration.
Testing Your Scenario:
With these changes, your test case now works perfectly:
# My zone "web_unlocker1" is configured in .env
BRIGHTDATA_ZONE=web_unlocker1
BRIGHTDATA_API_KEY=<my api key>
```python
from strands import Agent
from strands.models.litellm import LiteLLMModel
from strands_tools.bright_data import bright_data
import os
from dotenv import load_dotenv
load_dotenv()
# Configure model
model = LiteLLMModel(
client_args={"api_key": os.getenv("OPENAI_API_KEY")},
model_id="openai/gpt-4o",
)
# Create agent with bright_data tool
agent = Agent(
model=model,
tools=[bright_data],
system_prompt="You are a web assistant. For any bright_data tool usage. if you fail - return the exact error message'",
)
# Agent automatically determines when and how to use the tool
print(agent("What's the weather in San Francisco? Search the web for current conditions."))
print("\n" + "="*50 + "\n")
print(agent("Get me the main content from https://www.python.org"))
print("\n" + "="*50 + "\n")
print(agent("Find recent news about artificial intelligence"))```
The tool is now ready with proper zone handling, no issues with client creation, and clear configuration through environment variables.
I've added the additional environment parameter to REAME file as well to make sure this is clear.
Thank you for collaborating! I noticed some of the tests are failing: https://github.com/strands-agents/tools/actions/runs/17356167237/job/49422649370?pr=21 And there's a small lint issue: https://github.com/strands-agents/tools/actions/runs/17356167237/job/49422649190?pr=21#step:5:18 After these fixes we're ready to ship 🚀 |
Description
This commit adds a new Bright Data tool for scraping web content, taking screenshots, performing search queries, and extracting structured data from various websites and data feeds. The implementation follows the same pattern as other tools in the strands framework.
Key features:
Tests are included for all functionality.
Type of Change
Testing
hatch fmt --linter
hatch fmt --formatter
hatch test --all
All of the above tests passed locally.
Checklist
I have read the CONTRIBUTING document
I have added tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature
My changes generate no new warnings
Any dependent changes have been merged and published
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.