This looks absolutely fantastic, please accept my meagre professional jealousy. I have long bemoaned manual hyperparam fiddling . I have on occasion dabbled with nonparametric ("genetic") methods of hyperparam tuning inspired by AutoML... but then you still have to manually tune the evolutionary hyperparams.
Finding a way to derive this from the gradients is amazing.
1. an overly strong surprise is like PTSD in humans - it changes the model's previously learned experience forever, this is what we want to avoid
2. it's bound to happen, and our PILR-S is designed to keep the learning rate within the bell curve and decreasing as the surprise decreases (less new information, less learning).
This looks absolutely fantastic, please accept my meagre professional jealousy. I have long bemoaned manual hyperparam fiddling . I have on occasion dabbled with nonparametric ("genetic") methods of hyperparam tuning inspired by AutoML... but then you still have to manually tune the evolutionary hyperparams.
Finding a way to derive this from the gradients is amazing.
It’s an interesting idea, I have two questions.
- Surprise is detected by the norm of the gradients. So, doesn’t this suggest that the model already has a way of adjusting to surprise?
- Is there a danger of model instability when the gradients become larger and the learning rate is also increased?
1. an overly strong surprise is like PTSD in humans - it changes the model's previously learned experience forever, this is what we want to avoid
2. it's bound to happen, and our PILR-S is designed to keep the learning rate within the bell curve and decreasing as the surprise decreases (less new information, less learning).
Parameters I'd Like to Fiddle