Skip to content

Implementing multistart version of theta_est using multiple sampling methods #3575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

sscini
Copy link

@sscini sscini commented Apr 23, 2025

Fixes # .

Summary/Motivation:

Currently, the optimization is only done from a single initial value. This implementation adds the ability to specify multiple initial values using selected sampling techniques: from a random uniform distribution, using Latin Hypercube Sampling, or using Sobol Quasi-Monte Carlo sampling.

Changes proposed in this PR:

  • All changes made adding pseudocode in comments
  • Added inputs needed for multistart simulation
  • Added a function to generate points using the selected method
  • Added theta_est_multistart to work for the multistart process

TODO before converting from draft:

  • Receive feedback from collaborators on logical setup
  • Convert finalized pseudocode
  • Test and debug
  • Confirm function with examples

Legal Acknowledgement

By contributing to this software project, I have read the contribution guide and agree to the following terms and conditions for my contribution:

  1. I agree my contributions are submitted under the BSD license.
  2. I represent I am authorized to make the contributions and grant the license. If my employer has rights to intellectual property that includes these contributions, I represent that I have received permission to make contributions and grant the required license on behalf of that employer.

@sscini
Copy link
Author

sscini commented Apr 23, 2025

@djlaky @adowling2 Please provide early feedback.

@sscini
Copy link
Author

sscini commented Apr 30, 2025

Dynamic saving using flush, add.

Copy link
Member

@adowling2 adowling2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from our in-person discussion/informal code review

# # If only one restart, return an empty list
# return []

# return {theta_names[i]: initial_theta[i] for i in range(len(theta_names))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed adding a "dataframe" sampling method that uses multistart points defined by the user. This is helpful if we want to try the same set of multistart points for multiple experiments.

"Multistart is not supported in the deprecated parmest interface")
)

assert isinstance(n_restarts, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check that this is > 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look at other Pyomo code fgor exampels of throwing exceptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @adowling2 here, you need to throw an exception so you can test the exception is caught.

)


results = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make more sense to create a dataframe and then add rows as you go. Or you could preallocate the dataframe size because you know how many restarts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could even have your generate_samples function generate this empty dataframe.

@sscini
Copy link
Author

sscini commented Apr 30, 2025

Extend existing tests for parmest to include multistart, add.

@sscini
Copy link
Author

sscini commented Apr 30, 2025

Models provided need to include bounds, add exception

Copy link
Member

@adowling2 adowling2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more comments for you to consider are you continue to refine this.

upper_bound = np.array([parmest_model.find_component(name).ub for name in theta_names])
# Check if the lower and upper bounds are defined
if np.any(np.isnan(lower_bound)) or np.any(np.isnan(upper_bound)):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably already know this, but you will need to check all the errors are raised when expected.

)

if self.method == "random":
np.random.seed(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to skip setting the random seed if seed=None (default)?

Copy link
Author

@sscini sscini May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is none for all the functions I use that set seed, so if it receives seed = None, it would work as expected. Would skipping it still be best practice?

elif self.method == "latin_hypercube":
# Generate theta values using Latin hypercube sampling
sampler = scipy.stats.qmc.LatinHypercube(d=len(theta_names), seed=seed)
samples = sampler.random(n=self.n_restarts+1)[1:] # Skip the first sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you skipping the first sample? Please explain in the comments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment in code to explain as well. The first sample generated using qmc.sobol is always the origin (zero vector). I thought logic applied to all qmc methods, but no only sobol. So to get nonzero points, you need to skip first sample

@@ -921,6 +1020,116 @@ def theta_est(
cov_n=cov_n,
)

def theta_est_multistart(
self,
buffer=10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to explain the buffer in the doc string.

"Multistart is not supported in the deprecated parmest interface"
)

assert isinstance(self.n_restarts, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all of these with more descriptive error messages. Remember that we need tests for each error message.

# optimization. It will take the theta names and the initial theta values
# and return a dictionary of theta names and their corresponding values.
def _generate_initial_theta(self, parmest_model, seed=None, n_restarts=None, multistart_sampling_method=None, user_provided=None):
if n_restarts == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like just sending a warning, and not returning. For example, n_restarts might be 1 by default. You should check if n_restarts is an int as well. Then, if n_restarts is 1, you should send a warning that the tool is intended for this number to be greater than one and solve as normal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex had initially said just return message, but I will ask him again


# Get the theta names and initial theta values
theta_names = self._return_theta_names()
initial_theta = [parmest_model.find_component(name)() for name in theta_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use the suffix for this? The suffix value shouldn't change, but the theta value may if the model has been solved for some reason. I don't know if this is a potential issue but I think that grabbing these values from the suffixes would be more dummy-proof.

@adowling2
Copy link
Member

@sscini Tag me when this is ready for the next review

@sscini
Copy link
Author

sscini commented Jun 3, 2025

@adowling2 @djlaky Ready for next round of review, parmest.py file only, example still being developed. Currently functional (a few minor issues) with following output of separate MM example:

Setting theta_1 to 124.2278222925961
Setting theta_2 to 0.7336450144648552
Restart 1/5: Objective Value = 41.49573453661323, Theta = theta_1 212.683743
theta_2 0.064121
dtype: float64
Setting theta_1 to 201.92126417532563
Setting theta_2 to 0.18795708194375038
Restart 2/5: Objective Value = 41.49573453661323, Theta = theta_1 212.683743
theta_2 0.064121
dtype: float64
Setting theta_1 to 438.33825131878257
Setting theta_2 to 0.8983656093478203
Restart 3/5: Objective Value = 41.49573453661323, Theta = theta_1 212.683743
theta_2 0.064121
dtype: float64
Setting theta_1 to 401.7291865311563
Setting theta_2 to 0.01574030425399542
Restart 4/5: Objective Value = 41.49573453661323, Theta = theta_1 212.683743
theta_2 0.064121
dtype: float64
Setting theta_1 to 145.7563922740519
Setting theta_2 to 0.8198425201699138
/Applications/anaconda3/envs/parmest-dev-mac/lib/python3.13/site-packages/scipy/stats/_qmc.py:993: UserWarning: The balance properties of Sobol' points require n to be a power of 2.
sample = self.random(n, workers=workers)
/Users/scini/Documents/GitHub/pyomo/pyomo/contrib/parmest/parmest.py:1205: FutureWarning: Series.getitem treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos]
results_df.at[i, f'converged
{name}'] = converged_theta[j] if not np.isnan(converged_theta_vals[i, j]) else np.nan
/Users/scini/Documents/GitHub/pyomo/pyomo/contrib/parmest/parmest.py:1209: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'optimal' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
results_df.at[i, "solver termination"] = solver_termination if 'solver_termination' in locals() else np.nan
/Users/scini/Documents/GitHub/pyomo/pyomo/contrib/parmest/parmest.py:1205: FutureWarning: Series.getitem treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos]
results_df.at[i, f'converged_{name}'] = converged_theta[j] if not np.isnan(converged_theta_vals[i, j]) else np.nan
Restart 5/5: Objective Value = 41.49573453661323, Theta = theta_1 212.683743
theta_2 0.064121
dtype: float64

Thanks!

@sscini
Copy link
Author

sscini commented Jun 4, 2025

Note for reviewers: The current form of the code attempts to but cannot successfully rewrite the parameters for each iteration, and gives the same result each time. Advice on how to set this value from within theta_est_multistart() would be appreciated.

Update: Looks like something else. In the model pprint of the integrated ef_instance, the parameters are stale. So they are set but something else is happening. Looking into fix now.

@sscini
Copy link
Author

sscini commented Jun 4, 2025

I added my current example, using simple multimodal model (described in file) as a temporary .ipynb that has the data gen, model formation, a manual for loop version of multistart and the current implemented version to show the output differences.

Copy link
Contributor

@djlaky djlaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished up my initial review. Mostly comments about logging, shifting to Enum instead of strings, wondering why we aren't leveraging the suffixes from the model, and concerns about the buffered saving.

"""
if n_restarts == 1:
# If only one restart, return an empty list
return print("No multistart optimization needed. Please use normal theta_est()")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should raise a warning/log something here instead of using a print statement. That way you can use a debugger to control whether the message is displayed.


# Get the theta names and initial theta values
theta_names = self._return_theta_names()
initial_theta = [parmest_model.find_component(name)() for name in theta_names]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we can't do this with suffixes? It feels like you should be able to do all three of these (also for upper and lower bounds) using the suffixes on parmest_model?

)

# Check the length of theta_names and initial_theta, and make sure bounds are defined
if len(theta_names) != len(initial_theta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary. You populate initial_theta using theta_names so I would hope it's not possible to even design a test that could get here.

"The length of theta_names and initial_theta must be the same."
)

if multistart_sampling_method == "uniform":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe uniform_random would be more descriptive?

# The first value of the Sobol sequence is 0, so we skip it
samples = sampler.random(n=n_restarts+1)[1:]

elif multistart_sampling_method == "user_provided":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe user_provided does not imply values. For instance, what if a user wants to provide their own random sample generator?

Probably should use user_provided_values to be more explicit.

Also, should have some comments at the beginning of this describing what the method does, just like the other options.

results_df.to_csv(
file_name, mode=mode, header=header, index=False
)
print(f"Intermediate results saved after {i + 1} iterations.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about logging.

# Final save after all iterations
if save_results:
results_df.to_csv(file_name, mode='a', header=False, index=False)
print("Final results saved.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about logging.

print(f"Intermediate results saved after {i + 1} iterations.")

# Final save after all iterations
if save_results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this uses a buffer, will this not add the whole data to the intermediately saved data?

file_name, mode=mode, header=header, index=False
)
print(f"Intermediate results saved after {i + 1} iterations.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to append the remaining data with the buffer-based saving system when n_restarts is not a multiple of 10.


# Add buffer to save the dataframe dynamically, if save_results is True
if save_results and (i + 1) % buffer == 0:
mode = 'w' if i + 1 == buffer else 'a'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this keep appending the whole data frame each time to the csv? It's not clear that this is actually efficient. I'm concerned this will make files that have an enormous amount of entries.

@@ -91,7 +91,7 @@ def ef_nonants(ef):


def _experiment_instance_creation_callback(
scenario_name, node_names=None, cb_data=None
scenario_name, node_names=None, cb_data=None, fix_vars=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support having utility methods _fix_unknown_parameters and _fix_experiment_inputs live within the Experiment class. These methods could have fix=True as a default, and fix=False would unfix.

@sscini
Copy link
Author

sscini commented Jun 10, 2025

Plans this week:

  • Address above comments
  • Write a function to reinitialize the model using the suffix values to remove need for strings
  • Finish "user_provided_)samples" method
  • Finish new examples to confirm functionality
  • Bring new questions to 6/17 dev meeting

@sscini
Copy link
Author

sscini commented Jun 17, 2025

Notes for Pyomo Dev Meeting 6/17/25:

  • Plan to complete plan from previous message
    Context
  • Currently, Q_opt creates the parmest model using experiment_creation_callback (line 89) and then compares it to data.
  • Only option is if ThetaVals are provided: Fix them
  • I added an option to unfix them but provide them, which is needed for my multistart, but this requires strings, and I have heard a lot of work was spent moving away from strings.
    Problem:
  • No way to access that model or insert a prebuilt pyomo model into Q_opt
    Proposed solution
  • I would like to make a function to refactor or change the initialized values of the function and then send that model to Q_opt. This would prevent the need for strings to be used to provide ThetaVals to Qopt. Does this make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

4 participants