-
Notifications
You must be signed in to change notification settings - Fork 11
feat: Fast model loading for inference #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I added the new feature that allows fast model loading for inference. |
fms_mo/prep.py
Outdated
@@ -535,6 +535,30 @@ def has_quantized_module(model): | |||
"""Check if model is already quantized - do not want to quantize twice if so""" | |||
return any(isinstance(m, quantized_modules) for m in model.modules()) | |||
|
|||
def swap_qbmm(model, qcfg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add doc string and add datatypes to function args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fms_mo/utils/qconfig_utils.py
Outdated
@@ -623,7 +623,7 @@ def qconfig_save( | |||
def qconfig_load(fname: str = "qcfg.json") -> dict: | |||
"""Read config in json format, work together with qconfig_save""" | |||
config = get_recipe(fname) | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead spacing here. Delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nitpicks.
Also, please run the following and fix anything that lint or spellcheck does. "tox -e fix" will automatically change files, you just have to add + commit them. If multiple changes are needed, package them up in 1 commit if possible.
tox -e fix
tox -e lint
tox -e spellcheck
Signed-off-by: omobayode.fagbohungbe <[email protected]>
Description of the change
This PR enables faster loading of a quantized model by calling only the functions/sub-functions needed to load a model while ignoring the functions needed for quantizing the model. An inference argument was added to the fms_mo argument to activate the function.
Related issue number
None
How to verify the PR
The PR is validated by performing Direct Quantization with Smooth Quant and parsing the inference argument with the rest of the argument. The validation was done with/without Qbmm.
Was the PR tested