Skip to content

Adding purged cross-validation for time series datasets #115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mmerce
Copy link
Member

@mmerce mmerce commented May 23, 2025

No description provided.

@mmerce mmerce requested a review from jaor May 23, 2025 19:56
Copy link
Member

@jaor jaor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just a few cosmetics nits below.

- The second, third and fourth steps are repeated with each of the k parts,
so that k evaluations are generated
- Finally, the evaluation metrics are averaged to provide the cross-validation
metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note outside the scope of this PR: it'd be nice if in the metadata's description we could use a pointer to a file, something like:

"description": {"file": "./readme.md"}

@@ -0,0 +1,18 @@
# Script for purged k-fold cross-validation

The objective of this script is create a purged k-fold cross validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to create

# Script for purged k-fold cross-validation

The objective of this script is create a purged k-fold cross validation
starting form any classification model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: from

" predict " (if regression? "regressions"
"classifications")
".")
"code" 106}))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation looks off

(and (= model-type "linearregression") (not regression?))))
(when error
(raise {"message" (str "The " model-type " cannot be used to"
" predict " (if regression? "regressions"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation looks off

batch (round (/ rows k-folds))
k-fold-fn (lambda (x)
(log-info "range" (str (+ 1 (* x batch))) (str (+ 1 (* (+ x 1) batch))))
(create-dataset {"origin_dataset" dataset-id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would define a variable for range in the let, instead of computing it twice

pruning-rows (round (* (/ rows 100) 7.5)))
(log-info "range" (str (+ 1 pruning-rows)) (str (- rows pruning-rows 1)))
(create-dataset {"origin_dataset" ds-id
"range" [(+ 1 pruning-rows) (- rows pruning-rows 1)]})))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing about defining a variable for range (there could even be a function for computing it)


The following script performs a k-fold cross-validation compatible with time
series datasets. Test datasets are created by sampling linearly the original
dataset and some data is removed from the test dataset edges to avoid leakage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth mentioning what a dataset "edge" is, unless it's pretty common terminology in this context (i for one don't know what it is :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants