A high-performance web scraper that counts words within HTML tags using FastAPI and async operations.
- Asynchronous HTTP requests with connection pooling
- HTTP client-side caching with SQLite or memory
- Automatic retries with exponential backoff and jitter
- Parallelism automatically adjusted according to number of CPUs
- Optimized TCP connections with DNS caching
- Custom request headers for better compatibility
- Efficient regex pattern matching with pre-compilation
- Comprehensive unit testing with mocking
- Proper error handling and timeouts
- Semaphore-based concurrency control
- Improved FastAPI caching with customizable expiration
- Clone the repository
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtYou can also run the application using Docker:
# Build the Docker image
docker build -t fastapi-counter-href .
# Run the container
docker run -d -p 8080:80 --name fastapi-counter fastapi-counter-href
# Check the health status
docker exec fastapi-counter curl -f http://localhost/health || exit 1The application provides the following endpoints:
GET /- Welcome message and basic informationGET /health- Health check endpointGET /href/{url}- Count href tags in the specified URL- Example:
GET /href/https://www.python.org - Returns: JSON with the count of href tags found
- Example:
Start the FastAPI server:
# Development mode
uvicorn main:app --reload --host 0.0.0.0 --port 8080
# Production mode
uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4Run the tests with coverage:
# Run all tests with coverage report
pytest --cov=. --cov-report=term --cov-report=html
# Run specific tests
pytest tests/test_helpers.py
pytest tests/test_helpers_advanced.py # Advanced tests for optimizations
pytest tests/test_api_advanced.py # API performance testsThis will generate coverage reports in both terminal and HTML formats.
To evaluate the impact of the optimizations, use the performance testing script:
# Make sure the application is running on localhost:8080
python performance_test.pyThis script will generate comparative graphs showing performance improvements.
The main functionality is provided through the helpers.py module which offers:
get_session(): Returns an optimized HTTP session with connection pooling and cachingfetch(session, url, max_retries=3): Fetches content from a URL with error handling and retriessearch_tag(data, pattern): Searches for words within HTML tags using optimized regexresults(url): Combines the above to analyze a URL with cachingcleanup(): Properly closes resources when done
- Connection pooling (100 connections limit)
- DNS caching (5 minutes TTL)
- Custom optimized headers
- Configurable timeouts
- Proper resource cleanup
- Automatic retries with exponential backoff and jitter
- HTTP client with cache using SQLite or memory
- Pre-compiled regex patterns with IGNORECASE and DOTALL flags
- Efficient list comprehensions and direct sum
- Early returns for empty data
- Semaphore with dynamic adjustment based on available CPUs
- Proper error handling and exception management
- FastAPI cache with customizable expiration
- LRU-cached settings
- Pre-defined response objects
- Input validation with Path parameters
- HTTP client-side caching to reduce repetitive requests
The behavior can be modified by adjusting the settings in conf/settings.py:
cache_expire: Cache expiration time in secondstimeout: HTTP request timeout in secondspattern: Regex pattern for searchingcache_backend: Cache backend type ("sqlite" or any other value for memory)cache_db_path: SQLite path if used as backend
The code includes proper error handling for:
- Network timeouts
- Invalid URLs
- HTTP errors
- Connection limits
- Malformed HTML
- Invalid request parameters
All resources are properly managed:
- Sessions are reused and properly closed
- Connection pools are efficiently utilized
- Memory usage is optimized with generators and comprehensions
- DNS cache reduces lookup overhead
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with FastAPI
- Uses aiohttp for async HTTP requests
- Performance testing with matplotlib
- Testing framework: pytest