A collection of tools, visualizations, snippets, and thoughts for implementing and explaining data science concepts.
-
Introductory Statistics
- Quantiles / Percent Rank calculation
- Measures of Central Tendency
- Mean (arithmetic, geometric, harmonic)
- Median
- Mode
- Measures of Dispersion:
- Range
- IQR
- Variance
- Standard Deviation
- Skew
- Kurtosis
- Z scores & Confidence Intervals
- T tests
- Effect sizes
- Cohen's data
- Cramer's V
- Tests of Normality
- Non-Parametric Tests
- Chi-Square Test
- Heteroskedasticity
-
Probability
- Basics
- Unions, intersections, conditional probability
- PDF, CDF
- Marginal / joint / conditional distributions
-
Linear Algebra & Calculus
- Matrix 101: determinant, rank, etc.
- Eigenvectors & Eigenvalues (Eigendecomp)
- Hessian, Jacobian, Laplacian matrix
- Properties of matrices (rank, determinant, etc)
-
Programming
- Python
- R
- Java
- Linux / Bash
- Slurm
-
Software & Tools
- Conda / Anaconda
- Virtual Environments
- Docker / Singularity
- Git/github
-
EDA, Data Cleaning, & Feature Engineering
- Scaling: log, sqrt, minmax
- one-hot encoding
-
Classic Statistical Tools (maybe combine with Intro)
- Correlation: Pearson, Spearman, Kendall's tau, distance
- Confusion matrix
- Power, Type 1 error, type 2 error, alpha
- Metrics:
- MSE, MAE, R2, Adjusted R2, Pseudo R2, AUC, BIC
- Distributions & Thier attributes:
- Gamma
- Cauchy
- Normal, Standard Normal
- Uniform
- Binomial
- Poisson
- Bernoulli
- Kernel density estimation
- QQ plot
-
Advanced Statistics (??)
- Linear Mixed Modeling
- Mediation vs Moderation; Interactions, Effects
- Tukey's Hinges, Ladder of Power
- Family-Wise Error & Bonferroni correction
-
Data Mining
- Clustering Methods
- Spectral
- DBSCAN
- Hierarchical Clustering
- Kmeans / Kmedoid
- Elbow curve vs Silhoutte score
- Affinity propogation
- Regression
- Linear
- Logisitic
- Gamma
- Poisson
- Quantile
- Decision Trees / Ensemble Methods
- Random Forest
- AdaBoost / XGBoost
- Bagging / Boosting / Bootstrapping
- Gridsearch vs random search
- Entropy, Gini index
- CART
- Support Vector Machine
- k Nearest Neighbors
- Clustering Methods
-
Machine Learning
- Maximum Likelihood Estimation
- Taylor Series/ Newton-Raphson
- Expectation Maximization (EM)
- Find maximum likelihood solutions for models with latent variables
- Log-Likelihood
- Survival Analysis
- ICA / PCA / t-SNE / UMAP
- Non-negative matrix factorization
-
Time Series
- White noise / random walk, etc.
- Stationarity
- ADF vs PACF
- ARIMA Suite
- Test of Causality (Granger, etc.)
-
MLOps
-
Bayesian Statistics
-
Deep Learning
- Activation functions
- Loss functions
- Glorot Uniform Initializer
- Mixture of Experts
- Attention
- Architectures (CNN, RNN, UNet, Transformer, diffusion, autoencoder, adversarial)
- feed forward
- backpropogation
- optimizers (Adam, etc.)
-
NLP
- Similarity Metrics: Jaccard, cosine, Manhattan, Euclidean, Hamming, Minkowski, etc.
- Kullback-Leibler divergence
- Latent Dirichlet allocation
- Sentiment analysis
- Named Entity Recognition (NER)
-
Computer Vision
- Image preprocessing
- Image registration
- segmentation
-
Practice Sets
-
Data Visualization
- Voronoi diagrams
- QQ Plots
-
Other (dump):
- SHAPP values
- Monte Carlo // Markov chain
- Monte cristo problem
- Kolmogorov-Smirnoff test
- Wilcoxon
- Cronbach's alpha
- ROC curves
- precision /recall curves
- opencv
- internal covariance shift
- causality analysis
-
Mixture Models:
- Bernoulli Mixture Model:
- Applications: demixing data samples from mixture models, "model-based clustering"
- Takes discrete binary variables
- Uninformative prior:
- e.g., Dirichlet
- Parameters are set such that all outcomes are equally likely
- Bernoulli Mixture Model: