There is no reason to start caching previous frames until AFTER the user has paused and pressed the "back frame" key. Only THEN does it need to rewind to the previous i-frame and re-render and cache frames. And there is no measurable cost to remembering the timestamp of the last i-frame, so you know where to rewind to.
You wouldn't have to any more than you need 0 through N frames in memory to calculate frame N+1. Whatever your decoding state completes at frame N can be considered a key frame.
It doesn’t have to be every frame though. Pre calculate to 25%/50%/75%, then as I’m playing, save key frames for more incremental points, and if I start scrubbing, calculate more around that region.
Edit: this doesn’t have to happen synchronously either, it can occur in a background thread or passively.