by Oscar Amarilla, 2023
ML_TDA-ENSO stands for Machine Learning, Topological Data Analysis and El Niño-South Oscillation, which are topics covered in this academic work. This porject will be presented by the main author as undergraduate thesis to obtain a degree in Atmospheric Sciences at the National University of Asunción (Paraguay). This project has been developed at the Scientific Computing and Applied Mathematics group at the NIDTEC research center of the same university.
- Oscar Amarilla ¹ (Main author)
- Cristhian Schaerer¹ (Advisor and contributor)
- Inocencio Ortiz² (Advisor and contributor)
¹National University of Asunción, Polytechnic School, San Lorenzo, Paraguay.
²National University of Asunción, Faculty of Engineering, San Lorenzo, Paraguay.
The user is free to copy, modify, and distribute this proyect, even for commercial purposes under the Creative Commons Attribution 4.0 International Public License. See license dedication for details.
- About
- El Niño-Southern Oscillation
- Topological Data Analysis
- Support Vector Machine
- Dependencies
- Installation
- Structure of the Project
- Results
- Future Work
- References
- Contributing to ML_TDA-ENSO
- License
In this project the mean monthly sea surface temperature fields of the topical Pacific region defined by (10°N-10°S,160°E-90°O) of each month in the period 1950-2021 are taken and topological data analysis is applied on them. This process consists in computing the sublevel set filtration and the Euler characteristic curve of each field. Then, this curves are given as input data to a support vector machine algorithm to verify if the data set is linearly separable.
The aim is to develop a classifier of ENSO phases based on topological data.
El Niño-Southern Oscillation (ENSO) is an irregular oscillation with periods in the range of two to seven years that alters the normal conditions of the atmosphere of the central tropical Pacific. It consists in a warm phase (El Niño), a cold phase (La Niña) and a neutral phase (normal conditions). Depending in which of phases of the extrmees phases it is active, its manifestation implies anomalies in the amount of rainfall in the regions bordering the Pacific, opening the possiblily of periods of droughts or floods[1][2].
The National Oceanic and Atmospheric Administration (NOAA) monitors the current state of the phenomena with an index developed exclusively for this purpose called the Oceanic Niño Index (ONI). The ONI statablishes that if the absolute value of the anomaly of the running quarterly average of the sea surface temperature (SST) of the Niño 3.4 region (5ºN - 5ºS, 120º - 170ºO) with respect to 30-year average updated every 5 years is greater than 0.5ºC for 5 consecutive overlaping three-month periods, the normal conditions are broken. If the anomaly is negative, a Niña phase is running, and if it is positive, is a Niño phase[4].
Topological Data Analysis (TDA) is a branch of applied mathematics based on algebraic topology that look after trends and structures in the underlying topology of a particular dataset. The core of TDA is the "simplex", wich is an geometric element that conect data poits in its particular ambience. Simplices are setted by order:
- 0-dimensional: a vertex (point),
- 1-dimensional: a line segment,
- 2-dimensional: a triangle (surface),
- 3-dimensional: a tetrahedron,
and so on. Every
Let
A very interesting output from
Some of this joined simplicials can configure an
Finally, the formal sums that strictly are
called hology group. Then, the dimension of
Filtration is a technique used to compute something called persistent homology, which is a way to track the topological features at different scales. There are different approaches of this technique, in this work will be applied a sublevel set filtration.
In a more formal way, a filtration is a sequence of neasted subcomplexes
where the homology group is computed in each
In a sublevel set filtration, the simplicial complex is already given. Here, one go from
The Euler characteristic is a conpcept from the study of polyhedras that is bringed to topology, adapted and ended up being a topological invariant. In topology, it is an integer number that results of the alter sum af the Betti number of a simplicial complex
being
The Euler characteristic curve induced by the height function is a continuous map
that computes the Euler characteristics of each subcomplex⁷.
Support vector machine (SVM) is a machine learning algorithm that takes a set of data
Sometimes data is not linearly separable, for this cases a technique was developed that envolves a special type of non-linear functions called kernels. This functions map the data to another enviorment with different dimension, tipically higher, where the data can be separable with an hyperplane⁸.
This project was builed upon a series of libraries for the Python and R programming language.
- netCDF4 A Python interface for NetCDF, a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
- rpy2: An iterface that allow Python users to work with R language objects.
- TDA: Is a R language package that provides statistical tools for topological data analysis.
- NumPy: A package for scientific computing in Python.
- pandas: A Python library that provides high-performance, easy-to-use data structures and data analysis tools.
- scikit-learn: A Python library built on NumPy, SciPy, and matplotlib for data analysis and machine learning.
- matplotlib: A library for creating static, animated, and interactive visualizations in Python.
- seaborn: A data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics for Python programming language.
The details of each Python library are specified in the requirements.txt file.
The user is suggested to install the dependecies indicated in the requirements.txt file in an virtual environment, in order to avoid inconsistencies with the user native enviorment. For that, the following instructions must be followed:
After cloning the repository, create a virtual enviornment in the same folder
python -m venv name_of_the_venv # The name of the venv is up to the user.
Activate the virtual enviorment
source name_of_the_venv/bin/activate
Then install the requirements listed in the requirements.txt using pip.
pip install -r requirements.txt
Finally, the notebook can be executed.
jupyter notebook
The structure of the project is the following:
|--input/ (Necessary data for the project.)
|
|--src/
| |--config.py (File and directory names are specified.)
| |
| |--extract.py (Extract the information from the netCDF and csv files.)
| |
| |--TDA_extractor.R (Performs the sublevel filtration.)
| |
| |--transform_and_load.py (Performs the Extract-Load-Transform proces.)
| |
| |--plots.py (Plot some graphs.)
|
|--outputs/
|
|--SVM_ECC_ENSO_v3.0.ipynp (A jupyter notebook that apply machine learing over the TDA outputs.)
|
|--requirements.txt (List of libraries used to develop this project and the versions of each one)
|
|--LICENSE.txt (Creative Commons 4.0 License specifications.)
|
|__ README.md (Classified document, those who read it are in danger. Hint: your mother-in-law is involved.)
The best model has as rbf kernel with
There are some further steps that can be applied in order to improve the resuts of this project like:
- Adding more features as inputs in order to help the classifier get to a linear separation of the data,
- a deeper study of the geometry and topology of the SST fields in order to understand how the three phases of the ENSO are simmilar, apply some transformations surrounding the similarities and check if this new configuration improve performance of the classifier,
- apply another machine learning method like a neural network.
- Nobre, G. G. et al. Achieving the reduction of disaster risk by better predicting impacts of El Niño and La Niña. Progress in Disaster Science, Volume 2, 2019.
- Diaz, H. F., Markgraf, V. El Nino and the southern oscillation : multiscale variability and global and regional impacts. Cambridge University Press, 2000.
- Changnon, S. A. El Nino 1997-1998 : the climate event of the century. Oxford University Press, Inc., 2000.
- Lindsey, R. Climate Variability: Oceanic Niño Index. In: Climate | NOAA[online]. National Oceanic and Atmospheric Administration, 2009 [viewed at: 31 May 2023].
- Edelsbrunner, H. A short course in computational geometry and topology. Springer, 2014.
- Edelsbrunner, H., Harer, J.. Computational topology : an introduction. American Mathematical Society, 2010.
- Beltramo, G. et al. Euler characteristic surfaces. Foundations of Data Science, Volume 4, No. 1, 2022.
- Cristianini, N, Shawe-Taylor, J.. Support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.
- OneVsRestClassifier. In: Scikit-learn: Machine Learning in Python[online] [viewed: 31 May 2023].
Contributing to ML_TDA-ENSO
Every comment and/or suggestion for improving ML_TDA-ENSO will be very wellcome, so every user is coordially invated to open an issue or pull request on GitHub.
This work is dedicated to the public Llicense (CC0 4.0). See the LICENSE file for all the legalese.