SemiCART is a semi-supervised decision tree algorithm that enhances the traditional Classification and Regression Tree (CART) algorithm by incorporating semi-supervised learning principles. Published in the International Journal of Machine Learning and Cybernetics, our approach addresses a critical limitation of standard CART algorithms by leveraging unlabeled data in the training process.
- Installation - Get started with SemiCART
- Quick Start - Simple example to get you started
- Benchmark Results - See performance comparisons
- How It Works - Learn about the algorithm
- Examples - Detailed usage examples
- Overview
- Key Features
- When to Use SemiCART
- Installation
- Quick Start
- How It Works
- Performance Visualization
- Benchmark Results
- Parameters
- Command-Line Interface
- Benchmarking
- Examples
- Advantages
- Requirements
- Troubleshooting
- Citation
- Contributing
- Community
- License
Decision trees like CART form the foundation of modern boosting methodologies such as GBM, XGBoost, and LightGBM. However, standard CART algorithms can't learn from unlabeled data. SemiCART introduces "Distance-based Weighting," which leverages principles from graph-based semi-supervised learning to:
- Calculate relevance of training records relative to test data
- Remove irrelevant records to accelerate training
- Improve overall performance through modified Gini index calculations
Our comprehensive evaluations across thirteen datasets from various domains demonstrate that SemiCART consistently outperforms standard CART methods, offering a significant contribution to statistical learning.
- Distance-based Weighting: Assigns weights to training instances based on their similarity to test instances, focusing the model on more relevant training data.
- Modified Gini Index: Incorporates instance weights into the splitting criteria, improving the decision tree's structure.
- scikit-learn Compatible: Fully compatible with the scikit-learn API, making it easy to integrate into existing ML pipelines.
- Multiple Distance Metrics: Supports a wide range of distance metrics (euclidean, manhattan, cosine, etc.)
- Comprehensive Benchmarking: Includes a benchmarking module for performance evaluation.
- Cost-Effective Learning: Efficiently utilizes both labeled and unlabeled data, reducing the need for expensive data labeling.
SemiCART is particularly effective in scenarios where:
- You have limited labeled data but abundant unlabeled data
- There's a significant cost associated with data labeling
- You're working with datasets where traditional decision trees show high variance
- Your data comes from domains like medical diagnostics, fraud detection, or customer segmentation
- You require models with good interpretability, unlike black-box models
- You want to incorporate the structure of unlabeled data into your classification model
SemiCART's advantage increases with higher ratios of unlabeled to labeled data, making it ideal for semi-supervised learning tasks.
pip install semicart
git clone https://github.com/WeightedAI/semicart.git
cd semicart
pip install -e .
from semicart import SemiCART
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load and prepare data
X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the Semi-CART model
model = SemiCART(k_neighbors=3, distance_metric='euclidean')
model.fit(X_train, y_train, X_test)
# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}")
# Output: Accuracy: 0.9778
SemiCART introduces a novel approach to incorporate test data into the training phase, inspired by graph-based semi-supervised learning techniques:
- For each test instance, distances to all training instances are calculated
- The k-nearest training instances are identified for each test instance
- Weights of these nearest training instances are incremented
- Training instances with zero weight (not selected as neighbors) are removed
- This focuses the model on the most relevant training data relative to the test set
SemiCART replaces the standard class proportions in the Gini index with weight-based proportions:
Modified Gini = 1 - Σ(w_i/S)²
Where:
- w_i = sum of weights of instances in class i
- S = total sum of weights in the subset
This modified splitting criterion ensures that the resulting decision tree better captures the underlying relationships between labeled and unlabeled data.
SemiCART consistently outperforms traditional CART across various evaluation metrics:
These visualizations demonstrate SemiCART's superior performance across multiple datasets, particularly when leveraging unlabeled data effectively.
Our extensive benchmarking across multiple datasets shows that SemiCART consistently outperforms traditional CART in classification tasks:
Dataset | Test Size | k | Best Distance Metric | CART | SemiCART | Improvement |
---|---|---|---|---|---|---|
banknote | 0.1 | 2 | hamming | 0.9928 | 1.0000 | +0.0072 |
banknote | 0.7 | 3 | yule | 0.9594 | 0.9875 | +0.0281 |
fertility | 0.1 | 1 | jaccard | 0.7000 | 0.9000 | +0.2000 |
fertility | 0.3 | 5 | jensenshannon | 0.6333 | 0.8333 | +0.2000 |
wdbc | 0.1 | 3 | sqeuclidean | 0.9298 | 1.0000 | +0.0702 |
wdbc | 0.3 | 7 | cosine | 0.9006 | 0.9825 | +0.0819 |
glass | 0.1 | 18 | yule | 0.6364 | 0.8636 | +0.2273 |
glass | 0.2 | 18 | sqeuclidean | 0.7209 | 0.8837 | +0.1628 |
transfusion | 0.1 | 6 | chebyshev | 0.7067 | 0.7733 | +0.0667 |
Dataset | Test Size | k | Best Distance Metric | CART | SemiCART | Improvement |
---|---|---|---|---|---|---|
fertility | 0.2 | 13 | jaccard | 0.4722 | 0.9444 | +0.4722 |
fertility | 0.5 | 11 | jensenshannon | 0.4545 | 0.7273 | +0.2727 |
wdbc | 0.1 | 3 | sqeuclidean | 0.9147 | 1.0000 | +0.0853 |
wdbc | 0.3 | 7 | cosine | 0.8892 | 0.9797 | +0.0905 |
glass | 0.1 | 3 | yule | 0.8137 | 0.9386 | +0.1249 |
glass | 0.7 | 12 | hamming | 0.7189 | 0.8346 | +0.1157 |
Key observations:
- SemiCART shows greatest improvements with smaller test sizes (more unlabeled data)
- Different distance metrics work best for different datasets
- Significant improvements even on datasets with complex decision boundaries
- Some datasets show dramatic improvements in AUC (up to +0.4722)
max_depth
: Maximum depth of the tree (default=None)min_samples_split
: Minimum samples required to split a node (default=2)k_neighbors
: Number of nearest neighbors to consider for weight assignment (default=1)distance_metric
: Distance metric for similarity calculation (default='euclidean')- Supported values: 'euclidean', 'manhattan', 'cosine', 'braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'dice', 'hamming', 'jaccard', 'jensenshannon', 'minkowski', 'sqeuclidean', 'yule'
initial_weight
: Initial weight for each training instance (default=1.0)weight_increment
: Weight increment for nearest neighbors (default=1.0)random_state
: Random seed for reproducibility (default=None)log_level
: Logging level (default=logging.INFO)
SemiCART includes a convenient command-line interface for quick experimentation:
# Run with default parameters on Iris dataset
semicart
# Run with custom parameters
semicart --dataset wine --test-size 0.4 --k-neighbors 5 --distance-metric manhattan
# Get help on available options
semicart --help
SemiCART includes a comprehensive benchmarking module for evaluating performance:
from semicart.benchmark import run_default_benchmark
# Run a default benchmark on common datasets
runner = run_default_benchmark()
# Or create a custom benchmark
from semicart.benchmark import BenchmarkRunner
runner = BenchmarkRunner(output_dir='my_results')
runner.run_comparison(
dataset_names=['iris', 'wine'],
test_sizes=[0.3, 0.5],
k_neighbors_values=[1, 3, 5],
distance_metrics=['euclidean', 'manhattan']
)
Check out the examples
directory for more detailed usage examples:
simple_example.py
: Basic comparison with standard CARTdistance_metrics_comparison.py
: Comparing different distance metricsadvanced_usage.py
: More advanced options and configurations
from semicart import SemiCART
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load and prepare data
X, y = load_wine(return_X_y=True)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train standard CART
cart = DecisionTreeClassifier(random_state=42)
cart.fit(X_train, y_train)
cart_pred = cart.predict(X_test)
cart_acc = accuracy_score(y_test, cart_pred)
# Train SemiCART
semicart = SemiCART(k_neighbors=5, distance_metric='euclidean', random_state=42)
semicart.fit(X_train, y_train, X_test)
semicart_pred = semicart.predict(X_test)
semicart_acc = accuracy_score(y_test, semicart_pred)
print(f"CART Accuracy: {cart_acc:.4f}")
print(f"SemiCART Accuracy: {semicart_acc:.4f}")
print(f"Improvement: {semicart_acc - cart_acc:.4f}")
- Improved Accuracy: SemiCART consistently outperforms CART on a wide range of datasets
- Utilizes Unlabeled Data: Leverages unlabeled instances to enhance the learning process
- Cost-Effective: Reduces the need for extensive data labeling
- Flexibility: Works with various distance metrics to adapt to different data distributions
- Interpretability: Maintains the interpretability of decision trees
- Integration: Easily integrates into existing ML pipelines through scikit-learn compatibility
- Domain-Agnostic: Performs well across various domains and data types
SemiCART requires the following dependencies:
- Python ≥ 3.7
- NumPy ≥ 1.19.0
- scikit-learn ≥ 0.24.0
- SciPy ≥ 1.6.0
- pandas ≥ 1.0.0
Compatible with all major operating systems (Windows, macOS, Linux).
ImportError: No module named 'semicart'
- Make sure you've installed the package with
pip install semicart
- Verify your Python environment is activated if using virtual environments
AttributeError when using SemiCART with custom datasets
- Ensure your data is properly formatted (numerical, no NaN values)
- Check that feature scaling is applied for distance-based metrics
Poor performance on specific datasets
- Try different distance metrics (results vary by dataset characteristics)
- Adjust the k_neighbors parameter (often 3-7 works well for most datasets)
- Ensure proper feature scaling is applied
For more help, please open an issue on our GitHub repository.
If you use SemiCART in your research, please cite the following paper:
@article{abedinia2024semicart,
title={Building Semi-Supervised Decision Trees with Semi-CART Algorithm},
author={Abedinia, Aydin and Seydi, Vahid},
journal={International Journal of Machine Learning and Cybernetics},
volume={15},
pages={4493--4510},
year={2024},
publisher={Springer},
doi={10.1007/s13042-024-02161-z}
}
Contributions to SemiCART are welcome! Please check our contributing guidelines for more details.
- Fork the repository on GitHub
- Clone your fork locally
- Create a virtual environment and install development dependencies:
pip install -e ".[dev]"
- Create a branch for your changes
- Make your changes and add tests
- Run tests locally:
pytest
- Submit a pull request
- GitHub Issues: For bug reports and feature requests
- Discussions: For usage questions and discussions
- Pull Requests: For contributing code and documentation
Join our community of data scientists and machine learning practitioners to improve SemiCART and expand its capabilities!
SemiCART is released under the MIT License. See the LICENSE file for details.