Skip to content

Conversation

kchanqvq
Copy link

Hi! When using MAGICL, I find that OLS is missing, a quite common use case, and I take a shot implementing it. This is my first PR and I'm happy to discuss improvements so that it meets the standard for inclusion!

BTW, when working with the source, I find some other things which can be improved, e.g.:

  • info from LAPACK is ignored, which could be used for better error reporting.

Is there interest working on these? Where is the right place to discuss?

@kchanqvq kchanqvq force-pushed the master branch 3 times, most recently from 7ace529 to f04c9a0 Compare June 16, 2025 13:12
@stylewarning
Copy link
Member

Lots of interest! You could make an issue with the feature you're interested in implementing.

@stylewarning
Copy link
Member

I will review this soon. Thanks!

@kchanqvq
Copy link
Author

kchanqvq commented Jun 17, 2025

RFC: should second value of ols be r^2 (correlation coefficient) instead? This should be more useful in practice.

Maybe it should append an all-1 column to input as well and return 3 values A, b, r^2.

Copy link
Contributor

@YarinHeffes YarinHeffes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great addition! Most of my comments are nits.

Two more nits:

  1. Replace usage of "least square" with "least squares" in docstrings.
  2. Consider renaming ols to be more readable, e.g. least-squares-solve to match linear-solve.

(b (make-array (* ldb nrhs) :element-type ',type))
(s (make-array (min m n) :element-type ',type))
(rcond (or rcond
(* (max m n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we multiply by (max m n) here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cargo-culted from NumPy. I assume this is an agreed sane default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, it just wasn't immediately obvious to me why not to use -1 as a default. Looks good.

Comment on lines 684 to 685
(defun ols (a b &optional rcond)
"Attempt to solve the ordinary least square problem argmin_X ||B-AX||^2. Returns X and sum of squared residuals."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend explaining rcond in the docstring here or removing rcond from the arglist (just for this function, keep it in lsd regardless)

@YarinHeffes
Copy link
Contributor

  • info from LAPACK is ignored, which could be used for better error reporting.

I agree, I encourage you to open an issue to continue the discussion!

@YarinHeffes
Copy link
Contributor

RFC: should second value of ols be r^2 (correlation coefficient) instead? This should be more useful in practice.

Maybe it should append an all-1 column to input as well and return 3 values A, b, r^2.

This would be useful. At minimum I think it should be documented that r is only returned when b has extra rows, or we can add a keyword arg &key get-residual-squares-p. What do you think?

@kchanqvq
Copy link
Author

This would be useful. At minimum I think it should be documented that r is only returned when b has extra rows, or we can add a keyword arg &key get-residual-squares-p. What do you think?

Currently r is always returned, when b doesn't have extra rows it is 0.0... I just realized this is wrong if A is rank deficient, I will handle and test this in the next version! We might have to return NIL in these cases, as gelsd require both m>=n and rank=n for sum of residuals to be available.

2. Consider renaming ols to be more readable, e.g. least-squares-solve to match linear-solve.

Originally I want to have an ordinary linear regression routine, but I guess this should live in a higher level library than magicl itself, and magicl just provides the lower level least square routine?

IMO least-squares-solve is a bit wordy. How about least-squares?

@stylewarning
Copy link
Member

least-squares is fine by me

@kchanqvq
Copy link
Author

I just realized this is wrong if A is rank deficient, I will handle and test this in the next version! We might have to return NIL in these cases, as gelsd require both m>=n and rank=n for sum of residuals to be available.

Hmm, it seems that I need to get the OUT argument for rank of *gelsd for this to work, but the generated CFFI only allow passing in a value. Is this problem similar to why we currently don't get info out? Any idea how to work around this? Help wanted!

@kchanqvq
Copy link
Author

Is this problem similar to why we currently don't get info out

How about change generate-interface.lisp so that the generated functions returns the values of all non-array arguments as multiple values? All Fortran functions return :void so this shouldn't conflict with anything existing. E.g. %gelsd would return (values m n nrhs lda ldb rcond rank lwork info) at the end.

@YarinHeffes
Copy link
Contributor

Regarding the name, it was just a suggestion for my part, least-squares seems good to me.

The only change request that I think needs to be addressed for this PR is the documentation for that function, which is meant to be the high-level "user-friendly" function. Specifically, it should at least be more clear under what circumstances residual squares are returned. I think it's even more user friendly to have a boolean keyword flag for the user to request the residual squares.

Regarding handling errors and edge cases, I was tinkering with this earlier today and made a draft of how this could be addressed based on your suggestion and I also described two alternatives in the PR description. I'm not sure whether that needs to be incorporated into this PR or provided in a follow-up PR.

@kchanqvq
Copy link
Author

The only change request that I think needs to be addressed for this PR is the documentation for that function, which is meant to be the high-level "user-friendly" function. Specifically, it should at least be more clear under what circumstances residual squares are returned. I think it's even more user friendly to have a boolean keyword flag for the user to request the residual squares.

Regarding handling errors and edge cases, I was tinkering with this earlier today and made a draft of how this could be addressed based on your suggestion and I also described two alternatives in the PR description. I'm not sure whether that needs to be incorporated into this PR or provided in a follow-up PR.

I think we have to get the rank out argument to do any of these reliably. If A is rank deficient but M>N, there are still values in higher rows of b, they're just gibberish value.

Or let's just not return sum of residuals in this PR for now?

@stylewarning
Copy link
Member

@kchanqvq For a specific function I would just write the interface manually. Solving the problem more generally in the generated CFFI code would be interesting though.

@kchanqvq
Copy link
Author

I remove the second residual value for now, we can later add it after we figure out OUT arguments (the user can compute it themselves for now). @stylewarning @YarinHeffes I think the PR is ready for another review!

@kchanqvq
Copy link
Author

This seems to be hanging for a while, can someone help get it going? ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants