scikit-learn_bench benchmarks various implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently support the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.
See benchmark results here.
- Prerequisites
- How to create conda environment for benchmarking
- Running Python benchmarks with runner script
- Supported algorithms
- Algorithms parameters
- Legacy automatic building and running
pythonandscikit-learnto run python versions- pandas when using its DataFrame as input data format
icc,ifort,mkl,daalto compile and run native benchmarks- machine learning frameworks, that you want to test. Check this item to get additional information how to set environment.
Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
Run python runner.py --config configs/config_example.json [--output-format json --verbose] to launch benchmarks.
runner options:
config: the path to configuration filedummy-run: run configuration parser and datasets generation without benchmarks runningverbose: print additional information during benchmarks runningoutput-format: json or csv. Output type of benchmarks to use with their runner
Benchmarks currently support the following frameworks:
- scikit-learn
- daal4py
- cuml
- xgboost
The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
You can configure benchmarks by editing a config file. Check config.json schema for more details.
| algorithm | benchmark name | sklearn | daal4py | cuml | xgboost |
|---|---|---|---|---|---|
| DBSCAN | dbscan | ✅ | ✅ | ✅ | ❌ |
| RandomForestClassifier | df_clfs | ✅ | ✅ | ✅ | ❌ |
| RandomForestRegressor | df_regr | ✅ | ✅ | ✅ | ❌ |
| pairwise_distances | distances | ✅ | ✅ | ❌ | ❌ |
| KMeans | kmeans | ✅ | ✅ | ✅ | ❌ |
| KNeighborsClassifier | knn_clsf | ✅ | ❌ | ✅ | ❌ |
| LinearRegression | linear | ✅ | ✅ | ✅ | ❌ |
| LogisticRegression | log_reg | ✅ | ✅ | ✅ | ❌ |
| PCA | pca | ✅ | ✅ | ✅ | ❌ |
| Ridge | ridge | ✅ | ✅ | ✅ | ❌ |
| SVM | svm | ✅ | ✅ | ✅ | ❌ |
| train_test_split | train_test_split | ✅ | ❌ | ✅ | ❌ |
| GradientBoostingClassifier | gbt | ❌ | ❌ | ❌ | ✅ |
| GradientBoostingRegressor | gbt | ❌ | ❌ | ❌ | ✅ |
You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark:
cd <framework>
Run the following command:
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
The list of supported parameters for each algorithm you can find here:
- Run
make. This will generate data, compile benchmarks, and run them.- To run only scikit-learn benchmarks, use
make sklearn. - To run only native benchmarks, use
make native. - To run only daal4py benchmarks, use
make daal4py. - To run a specific implementation of a specific benchmark,
directly request the corresponding file:
make output/<impl>/<bench>.out. - If you have activated a conda environment, the build will use daal from the conda environment, if available.
- To run only scikit-learn benchmarks, use