This project provides a simple API wrapped in a Docker container that leverages the powerful Readability.js
library (from Mozilla Firefox) to extract clean, article-like content from raw HTML. It's perfect for:
- LLM (Large Language Model) Analysis: Get clean text for summarization, sentiment analysis, or knowledge base building.
- Content Display: Strip away distractions and present main article content.
- Data Processing: Extract key article components (title, author, content).
- HTML to Article Extraction: Uses
@mozilla/readability
to parse HTML and extract the main article content. - HTML Sanitization: Employs
DOMPurify
to ensure the output HTML content is safe and free from malicious scripts (XSS). - Flexible Text Output: Get either the full structured JSON output or a clean, paragraph-separated plain text output (ideal for LLMs).
- Token-Based Authentication: Protect your API with a simple secret token.
- Containerized: Easy to deploy anywhere Docker runs, including platforms like Coolify or DockPloy.
- Multi-Architecture Image: The official image (
ghcr.io/imad07mos/readabilityjs-api:latest
) is built to work on bothlinux/amd64
(Intel/AMD) andlinux/arm64
(ARM, e.g., M1/M2 Mac, AWS Graviton) hosts, ensuring broad compatibility.
This section guides you through getting the API up and running on your local machine for testing.
Before you begin, ensure you have:
- Docker Desktop: (Recommended) Installed and running on your machine.
curl
: A command-line tool for making HTTP requests (usually pre-installed on Linux/macOS, available for Windows).
Clone this repository to your local machine:
git clone https://github.com/imad07mos/readabilityjs-api.git
cd readabilityjs-api
For local development, it's best practice to keep your secret token in a .env
file. Docker Compose automatically loads variables from this file.
- In the root of your
readabilityjs-api
directory (wheredocker-compose.yml
is), create a new file named.env
. - Add the following line to the
.env
file. Choose a strong, unique string for your token!
SECRET_TOKEN="your_super_secret_local_dev_token_123"
Important: Always add .env
to your .gitignore
file to prevent accidentally committing your secrets to version control!
Now, start the API using Docker Compose. This command will build the Docker image (if you have local code changes) or pull it from a registry, and then run your API service.
docker-compose up --build -d
--build
: Ensures any local changes to yourapp.js
orDockerfile
are included in the image.-d
: Runs the containers in "detached" mode (in the background).
You can check if the API container is up and healthy:
docker ps
You should see a container named readabilityjs-api-readability-api-1
(or similar) with status Up
.
You can also check the health endpoint:
curl http://localhost:3000/health
Expected Output: OK
Now, let's send a sample HTML request to your running API. Remember to replace your_super_secret_local_dev_token_123
with the token you set in your .env
file.
curl -X POST "http://localhost:3000/readability?token=your_super_secret_local_dev_token_123" \
-H "Content-Type: application/json" \
-d '{
"html": "<!DOCTYPE html><html><head><title>My Awesome Article</title></head><body><h1>Main Title</h1><p>This is some content.</p><h2>Another Heading</h2><p>More paragraphs here, trying to make it somewhat interesting.</p><script>alert(\"evil script!\")</script></body></html>",
"url": "http://example.com/blog/sample-article"
}'
Expected Output (JSON): You should receive a JSON response containing the extracted article details, including content
(sanitized HTML, with the script tag removed!), and improvedTextContent
(the clean, LLM-friendly text).
Let's understand the important parameters that control this API: the token
for security and the strip
parameter for output format.
- What it is: The
token
is a security measure, acting like a secret password that grants you permission to use the API. - How to use it: You must include your
token
in the URL of your API requests as a query parameter (e.g.,?token=YOUR_SECRET_TOKEN
). The API will check if this token matches the one it expects. - Why it's important:
- Security: Protects your API from unauthorized access and misuse.
- Resource Control: Helps prevent unintended or malicious actors from consuming your server's resources.
- Error Responses:
- If you don't provide a
token
: You'll receive a401 Unauthorized
error. - If you provide a
token
that doesn't match: You'll receive a403 Forbidden
error.
- If you don't provide a
- Best Practice: Always use a very long, random, and cryptographically strong token, especially in production environments.
- What it is: The
strip
parameter is an optional setting you can add to your request URL (e.g.,?token=YOUR_SECRET&strip=true
). It controls the format of the API's response. - How to use it:
- Append
&strip=true
to your URL: The API will return only the clean, plain text of the extracted article content. This text is specifically formatted with newlines for paragraphs and headings, making it ideal for direct input into LLMs. TheContent-Type
of the response will betext/plain
. - If you omit
&strip=true
(or use&strip=false
): The API will return a full JSON object. This JSON includes the article'stitle
,excerpt
, the sanitized HTMLcontent
, the originalrawTextContent
(from Readability, often concatenated), and the improved, LLM-friendly text inimprovedTextContent
. TheContent-Type
will beapplication/json
.
- Append
- Why it's useful for LLMs: Large Language Models generally prefer clean, structured plain text without HTML tags. The
strip=true
option provides this exact format, simplifying your LLM input pipeline.
Example curl
for plain text output:
curl -X POST "http://localhost:3000/readability?token=your_super_secret_local_dev_token_123&strip=true" \
-H "Content-Type: application/json" \
-d '{
"html": "<!DOCTYPE html><html><head><title>My Awesome Article</title></head><body><h1>Main Title</h1><p>This is some content.</p><h2>Another Heading</h2><p>More paragraphs here, trying to make it somewhat interesting.</p></body></html>",
"url": "http://example.com/blog/sample-article"
}'
Expected Plain Text Output:
Main Title
This is some content.
Another Heading
More paragraphs here, trying to make it somewhat interesting.
This section details how to deploy your API using self-hosting platforms that leverage Docker Compose, such as Coolify or DockPloy.
Your Docker image is already built for multiple architectures (linux/amd64
and linux/arm64
) and available on GitHub Container Registry.
Here's the docker-compose.yml
content you will use on your deployment platform:
version: '3.8'
services:
readability-api:
# This specifies the Docker image to pull from GitHub Container Registry.
# It's multi-architecture, so Coolify will automatically pick the right version for its server.
image: ghcr.io/imad07mos/readabilityjs-api:latest
# The internal port your Node.js application listens on (port 3000).
# Coolify will map this to an external port and set up routing/SSL automatically.
ports:
- "3000"
# Environment variables passed to your container.
# IMPORTANT: The SECRET_TOKEN must be set securely in Coolify/DockPloy's UI!
environment:
SECRET_TOKEN: ${SECRET_TOKEN} # This tells Docker Compose to get the value from the environment.
PORT: 3000
# Ensures your container restarts automatically if it crashes or the Docker daemon restarts.
restart: unless-stopped
It is critical not to hardcode your actual SECRET_TOKEN
value directly into the docker-compose.yml
file if you commit this file to a public Git repository!
Instead, you should manage this secret securely within your deployment platform's UI:
- Go to your Coolify (or DockPloy) instance.
- Create a New Application/Service.
- Choose the "Docker Compose" deployment method.
- Paste the
docker-compose.yml
content provided above into the configuration area. - Find the "Environment Variables" or "Secrets" section within your application's settings in the Coolify/DockPloy UI.
- Add a new environment variable:
- Name:
SECRET_TOKEN
- Value: Paste your strong, unique production secret token here.
- Name:
- Complete any remaining deployment steps (e.g., adding a custom domain for easy access).
Coolify/DockPloy will securely inject this SECRET_TOKEN
into your running container's environment, keeping it safe from being exposed in your source code.
If you modify the source code (e.g., app.js
) or want to update Node.js dependencies (package.json
) to get the very latest versions, you'll need to rebuild your own Docker image and push it to your own registry.
-
Ensure
docker buildx
is set up:docker buildx create --name mybuilder --use # (Run this once if you haven't already)
-
Log in to your Docker registry:
docker login ghcr.io # Or `docker login` for Docker Hub
-
Build and push the multi-architecture image: Navigate to your project's root directory (
readabilityjs-api
) and run the following single command:docker buildx build --platform linux/amd64,linux/arm64 -t ghcr.io/yourusername/readabilityjs-api:latest . --push
Remember to replace
yourusername
with your Docker Hub or GitHub username!
You can pass configuration options for Readability.js
and DOMPurify
within your JSON request body. The API includes internal filtering to ensure only safe and expected options are processed.
Example curl
with options:
curl -X POST "http://localhost:3000/readability?token=your_super_secret_local_dev_token_123" \
-H "Content-Type: application/json" \
-d '{
"html": "<p class=\"important\">A short article with an <script>alert(\"XSS\")</script> evil script and an important class.</p>",
"url": "http://example.com/custom-options",
"readabilityOptions": {
"charThreshold": 1000,
"debug": true,
"classesToPreserve": ["important"]
},
"domPurifyOptions": {
"FORBID_TAGS": ["p"],
"USE_PROFILES": { "html": true }
}
}'
To stop and remove your locally running API containers and the associated network:
docker-compose down
This project was built with the aid of Gemini 2.5 Flash Preview 05-20, licensed under the MIT License - see the LICENSE file for details.