I would like to request 2-3 modes when answering a voice (or video) message:
- transcribe message and reply
- transcribe message only (don't reply, go straight to whisper, saves tokens)
- reply only (current implementation, could be kept as default or replaced by transcribe + reply)