URL Detector

A URL detection tool that scans files using Tree-sitter parsers for accurate URL discovery across 20+ programming languages. Instead of simple regex matching, this tool performs AST (Abstract Syntax Tree) parsing to precisely locate URLs in strings, comments, and other appropriate contexts.

The SBOM Gap

Software Bill of Materials (SBOM) generation has become critical for security and compliance, but traditional SBOM tools miss a significant category of external dependencies: URLs embedded directly in source code.

Modern package managers and dependency scanners excel at tracking managed dependencies (npm packages, Maven artifacts, etc.), but they can't detect legacy patterns like:

<script src="https://cdn.jsdelivr.net/npm/[email protected]/lodash.min.js"></script>
<link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Roboto">

const API_ENDPOINT = "https://api.thirdparty.com/v1";
fetch("https://analytics.example.com/track", { ... });

These URLs represent real external dependencies that can impact security, availability, and compliance - but they won't appear in any SBOM generated from package metadata. URL Detector fills this gap by providing comprehensive URL inventory that complements traditional dependency tracking tools.

Features

🌐 20+ Language Support: JavaScript, TypeScript, Java, C/C++, C#, HTML, CSS, Python, PHP, Ruby, Go, Scala, JSON, XML, TOML, Bash, Swift, Kotlin, and more
🌳 AST-Based Parsing: Uses Tree-sitter for accurate tokenization and context-aware URL detection
🚀 High Performance: Concurrent file processing with configurable concurrency limits
📊 Multiple Output Formats: Table, JSON, and CSV output with customizable formatting
🎯 Advanced Filtering: Domain allowlists/blocklists with wildcard support, protocol filtering, and regex fallback
📍 Precise Location Tracking: Line numbers, columns, and character positions for each URL
🔍 Context Detection: Finds URLs in string literals, comments, and appropriate language constructs
🛡️ False Positive Filtering: Automatically excludes common schema patterns (//W3C//DTD, //EN, etc.)
⚙️ Highly Configurable: Extensive CLI options and programmatic API
📦 Zero Config: Easy setup without complex configuration

Installation

To suppress warnings from tree-sitter transitive dependencies, all these commands can be run optionally with --loglevel=error flag.

Global Installation (Recommended)

npm install -g @morgan-stanley/url-detector

Local Installation

npm install @morgan-stanley/url-detector

NPX Usage (No Installation)

npx @morgan-stanley/url-detector --scan "src/**/*.js" --format table

Quick Start

Command Line Interface

# Scan all files in current directory
url-detector

# Scan specific files/patterns
url-detector --scan "src/**/*.{js,ts}" --format table

# Exclude directories and ignore domains
url-detector --scan "**/*" --exclude "**/node_modules" --ignore-domains "*example.com"

# Export results to CSV
url-detector --scan "src/**/*" --format csv --output urls.csv

# Run in CI/CD (fail if URLs found)
url-detector --scan "**/*.js" --fail-on-error --results-only

Programmatic Usage

import { URLDetector, LanguageManager } from '@morgan-stanley/url-detector';

// Basic usage
const detector = new URLDetector();
const sourceCode = `
const apiUrl = "https://api.example.com/v1/users";
// Documentation: https://docs.example.com
`;

const urls = detector.detectURLs(sourceCode, 'javascript', 'app.js');
console.log(urls);

// Advanced usage with custom options
const detector = new URLDetector({
    includeComments: true,
    ignoreDomains: ['*.example.com', 'localhost'],
    protocol: ['https'],
    unique: true,
    logger: new ConsoleLogger()
});

// Custom language configurations
const customLanguageManager = new LanguageManager(undefined, [
    { name: 'mylang', module: 'tree-sitter-mylang', extensions: ['.ml'] }
]);

CLI Options

Option	Description	Default
`-s, --scan <patterns...>`	Glob patterns for files to scan	`["*/"]`
`-e, --exclude <patterns...>`	Glob patterns for files to exclude	`[]`
`-i, --ignore-domains <domains...>`	Additional domains to ignore (supports wildcards, always includes `www.w3.org`)	`[]`
`--include-comments`	Also scan commented-out lines for URLs	`false`
`--include-non-fqdn`	Include non-fully qualified domain names like "localhost"	`false`
`-f, --format <format>`	Output format: `table`, `json`, or `csv`	`"table"`
`-o, --output <file>`	Output file path (stdout if not specified)	`null`
`-q, --quiet`	Run in quiet mode with no console output	`false`
`--results-only`	Show only results, suppressing progress and info messages	`false`
`--fail-on-error`	Exit with non-zero code if any URLs are found	`false`
`--concurrency <number>`	Maximum number of files to scan concurrently	`10`
`--scan-file <file>`	File containing glob patterns to scan (one per line)	`null`
`--exclude-file <file>`	File containing glob patterns to exclude (one per line)	`null`

Supported Languages

Language	Extensions	Tree-sitter Parser
JavaScript	`.js`, `.mjs`	`tree-sitter-javascript`
TypeScript	`.ts`, `.tsx`	`tree-sitter-typescript`
Java	`.java`	`tree-sitter-java`
C	`.c`, `.h`	`tree-sitter-c`
C++	`.cpp`, `.cc`, `.cxx`, `.hpp`, `.hh`, `.hxx`	`tree-sitter-cpp`
C#	`.cs`	`tree-sitter-c-sharp`
Python	`.py`, `.pyw`	`tree-sitter-python`
PHP	`.php`, `.phtml`	`tree-sitter-php`
Ruby	`.rb`, `.rake`, `.gemspec`	`tree-sitter-ruby`
Go	`.go`	`tree-sitter-go`
Swift	`.swift`	`tree-sitter-swift`
Kotlin	`.kt`, `.kts`	`@tree-sitter-grammars/tree-sitter-kotlin`
Scala	`.scala`, `.sc`	`tree-sitter-scala`
HTML	`.html`, `.htm`	`tree-sitter-html`
CSS	`.css`	`tree-sitter-css`
JSON	`.json`, `.jsonc`	`tree-sitter-json`
XML	`.xml`, `.xsd`, `.xsl`, `.xslt`	`@tree-sitter-grammars/tree-sitter-xml`
TOML	`.toml`	`@tree-sitter-grammars/tree-sitter-toml`
Bash	`.sh`, `.bash`, `.zsh`, `.fish`	`tree-sitter-bash`
YAML	`.yaml`, `.yml`	`@tree-sitter-grammars/tree-sitter-yaml`

Note: For unsupported file types, the tool automatically falls back to regex-based detection.

Examples

Basic File Scanning

# Scan all JavaScript and TypeScript files
url-detector --scan "**/*.{js,ts}" --format table

# Scan source code only, exclude build artifacts
url-detector --scan "src/**/*" --exclude "build/**" "dist/**" "**/node_modules"

Domain Filtering

The tool automatically ignores common non-meaningful domains found in code (like www.w3.org in XML namespaces). You can add additional domains to ignore:

# Ignore all example.com subdomains
url-detector --ignore-domains "*.example.com"

# Ignore multiple domain patterns
url-detector --ignore-domains "*.example.com" "localhost" "*.local"

Output Formats

# Table output (default)
url-detector --scan "src/**/*" --format table

# JSON output for programmatic processing
url-detector --scan "src/**/*" --format json --output results.json

# CSV output for spreadsheet analysis
url-detector --scan "src/**/*" --format csv --output urls.csv

CI/CD Integration

# Fail build if any URLs are found
url-detector --scan "**/*" --exclude "**/node_modules" --fail-on-error

# Quiet mode for CI logs
url-detector --scan "src/**/*" --quiet --format json --output scan-results.json

# Results-only mode (no progress messages)
url-detector --scan "**/*" --results-only --format table

API Reference

URLDetector Class

class URLDetector {
    constructor(options?: DetectorOptionsConfig, logger?: Logger);
    detectURLs(sourceCode: string, language: string, filePath?: string): URLMatch[];
    process(): Promise<FileResult[]>;
}

DetectorOptionsConfig Interface

interface DetectorOptionsConfig {
    // File scanning options
    scan?: string[];                  // Glob patterns for files to scan (default: ["**/*"])
    exclude?: string[];               // Glob patterns to exclude (default: [])
    
    // Filtering options
    ignoreDomains?: string[];         // Additional domains to ignore (default: [], always includes `www.w3.org`)
    includeComments?: boolean;        // Include URLs from comments (default: false)
    includeNonFqdn?: boolean;         // Include non-FQDN domains like "localhost" (default: false)
    
    // Output options  
    format?: 'table' | 'json' | 'csv'; // Output format (default: "table")
    output?: string | null;           // Output file path (default: null)
    
    // Control options
    resultsOnly?: boolean;            // Results only mode (default: false)
    failOnError?: boolean;            // Exit with error if URLs found (default: false)
    
    // Performance options
    concurrency?: number;             // Max concurrent files (default: 10)
    
    // Advanced options (programmatic only)
    fallbackRegex?: boolean;          // Use regex fallback when tree-sitter fails (default: true)
    context?: number;                 // Lines of context to include (default: 0)
    maxDepth?: number;                // Max directory depth (default: Infinity)
    quiet?: boolean;                  // Suppress informational output (default: false)
}

URLMatch Structure

interface URLMatch {
    url: string;                      // The detected URL
    start: number;                    // Start character position
    end: number;                      // End character position
    line: number;                     // Line number (1-based)
    column: number;                   // Column number (1-based)
    sourceType: 'string' | 'comment' | 'unknown';  // Context type
    context?: string[];               // Surrounding lines (if requested)
}

Language Customization

import { LanguageManager, LanguageConfig } from '@morgan-stanley/url-detector';

// Add custom language support
const customLanguages: LanguageConfig[] = [
    {
        name: 'mylang',
        module: 'tree-sitter-mylang',
        extensions: ['.ml', '.mylang'],
        filenames: ['Mylangfile'] 
    }
];

const languageManager = new LanguageManager(undefined, customLanguages);

How It Works

Language Detection: Automatically detects programming language from file extension or filename
AST Parsing: Uses Tree-sitter to parse source code into an Abstract Syntax Tree
Node Traversal: Recursively walks through AST to find string literals and comment nodes
URL Extraction: Applies URL regex patterns to content of relevant nodes
Context Analysis: Determines if URLs are in strings, comments, or other contexts
Filtering: Applies domain filters and other criteria
Position Tracking: Calculates precise line/column positions for each URL
Fallback Support: Falls back to regex scanning for unsupported languages

Performance

Concurrent Processing: Processes multiple files simultaneously (configurable concurrency)
Memory Efficient: Streams large files and processes incrementally
Fast Parsing: Tree-sitter provides high-performance parsing
Smart Caching: Reuses parser instances where possible

Testing

# Run all tests
npm test

# Run tests in watch mode
npm run test:watch

# Run with coverage
npm test -- --coverage

Development

Project Structure

src/
├── index.ts              # Main library entry point
├── cli.ts               # Command-line interface
├── urlDetector.ts       # Core URL detection logic
├── languageManager.ts   # Language/parser management
├── urlFilter.ts         # URL filtering and validation
├── outputFormatter.ts   # Output formatting (table/json/csv)
├── options.ts          # Configuration options
└── logger.ts           # Logging interfaces

tests/
├── urlDetector.test.ts
├── languageManager.test.ts
└── integration.test.ts

Local Development Setup

When cloning this project for local development, you'll need to use the --legacy-peer-deps flag due to complex peer dependencies across Tree-sitter packages:

# Clone the repository
git clone https://github.com/morgan-stanley/url-detector.git
cd url-detector

# Install dependencies with legacy peer deps support
npm install --legacy-peer-deps

Building

# Build TypeScript to JavaScript
npm run build

# Build and watch for changes
npm run dev

# Clean build artifacts
npm run clean

Linting

# Check code style
npm run lint

# Fix auto-fixable issues
npm run lint:fix

License

Apache License 2.0 - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
dco		dco
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DISCLAIMER		DISCLAIMER
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.test.json		tsconfig.test.json

License

morganstanley/url-detector

Folders and files

Latest commit

History

Repository files navigation

URL Detector

The SBOM Gap

Features

Installation

Global Installation (Recommended)

Local Installation

NPX Usage (No Installation)

Quick Start

Command Line Interface

Programmatic Usage

CLI Options

Supported Languages

Examples

Basic File Scanning

Domain Filtering

Output Formats

CI/CD Integration

API Reference

URLDetector Class

DetectorOptionsConfig Interface

URLMatch Structure

Language Customization

How It Works

Performance

Testing

Development

Project Structure

Local Development Setup

Building

Linting

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages