-
-
Notifications
You must be signed in to change notification settings - Fork 612
Weird Side Effects of loadparams!
#1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess this is a good example, why parameters of a model should not be implicit. Normalizing them etc. was also a pain in the ass, as you can not use |
Not at the computer, but a quick suggestion would be to use |
Both fmap and loadmodel force a one to one relation between models and params. But I want to generate a grid of params and then try them out, plugging them into the model. So after I generated two directions of parameter vectors (by creating the model multiple times, which is already somewhat confusing), I use those directions to define a grid of parameters. Let us for example consider the point |
First things first:
You're in luck, because it does :) model1, model2 = ...
i = Ref(1) # avoids boxing
newmodel = fmap(model1, model2) do p1, p2
c1, c2 = grid[i[]]
i[] += 1
return @. c1 * p1 + c2 * p2
end |
Looks like Brian beat me to the punch, but I already typed this out, so here it is. Okay let me see if I understand the problem statement correctly. You have a "model distribution" induced by the initialization functions. You want to take two samples from this distribution (or do you want to sample two parameters from a single initialization? I was confused on this point). Then we think of these two model initializations as a flat vector of parameters in parameter space. We normalize them to make them unit vectors. They are also extremely likely to be orthogonal, giving me an orthonormal basis. Then I define a Cartesian grid of coordinates, allowing me to explore linear combinations of my basis. If the above is correct, then you could try the following. Explicit optionThis one is admittedly more complex, mostly because we haven't exposed using Functors.jl except in the most basic cases. There's work to be done to make this more friendly for users. using Flux
using Flux: fmap, destructure, trainable, functor
# this would eventually be something a user does not define
function walk_trainable(f, x, ys...)
func, re = functor(x)
ts = trainable(func)
yfuncs = map(y -> functor(typeof(x), y)[1], ys)
result = map(func, yfuncs...) do zs...
(zs[1] in ts) ? f(zs...) : zs[1]
end
return re(result)
end
function normalize_model(m)
flat, rebuild = destructure(m)
return rebuild(flat / norm(flat))
end
function combine_models(coefficients, models)
scaled_models = map(coefficients, models) do scale, m
fmap(m; walk = walk_trainable) do p
scale .* p
end
end
combined_model = fmap(sum, scaled_models...; walk = walk_trainable)
return combined_model
end
model_samples = [normalize_model(toy_model()) for _ in grid_cardinality]
grid = # make the grid
results = map(grid) do coefficients
combine_models(coefficients, models)
end Another option using using Flux
using Flux: destructure
function normalize_model(m)
flat, rebuild = destructure(m)
return rebuild(flat / norm(flat))
end
function combine_models(coefficients, models)
flat, rebuild = destructure(models[1])
flats = [destructure(m)[1] for m in models[2:end]]
pushfirst!(flats, flat)
combined_flats = mapreduce(coefficients, +, flats) do c, p
c .* p
end
return rebuild(combined_flats)
end
model_samples = [normalize_model(toy_model()) for _ in grid_cardinality]
grid = # make the grid
results = map(grid) do coefficients
combine_models(coefficients, models)
end Implicit optionusing Flux
using Flux: Params, params
# even here it might be better to use destructure
# but I am sticking with implicit on purpose
function normalize_params!(model)
ps = params(model)
flat = mapreduce(vec, vcat, ps)
flat .= flat ./ norm(flat)
copy!(ps, flat)
return model
end
function combine_models!(dst, coefficients, models)
ps = params(dst)
ps .= coefficients[1] .* params(models[1])
for (c, m) in zip(coefficients, models[2:end])
ps .= ps .+ c .* params(m)
end
return dst
end
model_samples = [normalize_params!(toy_model()) for _ in grid_cardinality]
grid = # make the grid
dst = toy_model()
results = map(grid) do coefficients
combine_models!(dst, coefficients, models)
end |
Thank you for all the suggestions - I'll give it another shot tomorrow :) |
Okay I had some time to properly read your comments:
That sounds like a bug. I mean that sounds very much like a shallow https://github.com/JuliaLang/julia/blob/master/base/deepcopy.jl
I still would expect Your first interpretation is correct I think. I did not completely understand the I guess the main takeaway from this is that you do inted to identify parameters and models. And I am starting to understand that philosophy, reading about Functors. What I am a bit confused about is this description of
That sounds on the surface identical to the (currently undocumented) EDIT: it also makes much more sense to "destructure a Functor" than to "functor a Functor" |
You can think of a functor as a tree of nodes representing the model, with the array parameters as the leaves (as well as things like activations, convolution kernel parameters, etc.).
Yes they are very similar. In terms of naming, we have #1946 for exactly this reason, because we agree that |
Something which made me uncomfortable using this destructure technique: We are assuming that the I mean if the model was a class, this could be a class method. But they are in general anonymous structs generated by Chaining other model structs together. Throwing away a result all the time also feels wasteful performance wise. |
Most of the cost is walking the model which you have to do no matter what. The But the larger problem of assuming the models are the same structure...if they weren't, then the concept of linear combinations of them wouldn't make sense. You could add checks to |
I agree, it's really designed for bitstypes where a bitwise copy is a deep copy. There are many threads out there where core devs have explicitly disavowed people from using it, but I'm not sure if forbidding its use at the language level is possible without making breaking changes. For our part we could override On But that's not really important here since you need not touch |
@ToucheSir I like the destructure/restructure mechanic actually. It is quite readable and side effect free judging by the lack of |
It's exactly the opposite: |
I am posting a picture of the pluto notebook for better understanding/readability (what are the outputs etc). Below you can find copyable code.
Copyable Code
Okay so what is happening? I define a generator of a model
toy_model
, to essentially generate different parameter initializations. Because apparently that is faster than using loadparams #1764 and I can use the default initializer to get a more realistic distribution over initializations. Then I sample two prameter vectors from this distribution and normalize them. Since they are random with high dimension, they are very likely to be almost orthogonal. So due to normalization we get orthonormal vectors. So when I create the grid using the coordinates from a carthesian grid, I am essentially doing an orthonormal basis change.Okay, so far so good. Now when I determine the distances of all the points on the new grid, the end result should be the same as before the basis change. Now the longest distance is between the points
(0,0)
and(1,1)
. Which is the squareroot of 2. So ~1.4. And that is in fact the largets value of the output. Nice! Everything works as intended.But if I comment in the
Flux.loadparams!
my new output makes no senseThis implies that
loadparams!
somehow modifies ps even through the deepcopy?The text was updated successfully, but these errors were encountered: