You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I trained my custom model for some commands in slovak and wanted to improve it using user-specific verifier. But here i found non-intuitive thing for me.
The model is trained by default on 3second recordings. With sample rate of 16000 it means 48000 time dimension, that is converted into feature with shape 28, 96. But when using predict_clip function in Evaluate model section or later train_custom_modifier, just 80ms recordings (chunks) are used and actually it works quite well.
How can this work technically if model is traned on 3s inputs(some padding?)? How can it work practically if most of commands are much longer than 80ms?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I trained my custom model for some commands in slovak and wanted to improve it using user-specific verifier. But here i found non-intuitive thing for me.
The model is trained by default on 3second recordings. With sample rate of 16000 it means 48000 time dimension, that is converted into feature with shape 28, 96. But when using predict_clip function in Evaluate model section or later train_custom_modifier, just 80ms recordings (chunks) are used and actually it works quite well.
How can this work technically if model is traned on 3s inputs(some padding?)? How can it work practically if most of commands are much longer than 80ms?
Beta Was this translation helpful? Give feedback.
All reactions