-
Notifications
You must be signed in to change notification settings - Fork 72
Description
PR #224 adds a "rachet" to the prompt progress callback. This is needed so that prompt progress does not appear to go backwards when we perform prefill in the CacheWrapper (the component that enables cache re-use for text-only prompts).
When prefill is done by the CacheWrapper, the progress callback is called as expected. The callback and a single input token are then passed to the mlx_lm.stream_generate method, which calls the progress callback again with "0 out of 1" tokens processed. This causes progress to appear to regress briefly. The ratchet patches this by ignoring any progress callback that is not monotonically increasing.
This is more of a band-aid than a proper fix. A proper solution would be to unify the prefill logic so that we don't need to "trick" mlx_lm.stream_generate into thinking there is only a single prompt token to process.