high-level semantic planning 和low-level acoustic rendering,怎么理解？

Despite these advances, continuous models often entangle high-level semantic planning with low-level acoustic rendering, leading to instability in long sequences without explicit separation.
请问怎么理解title的这两个东西以及你们是如何发现或者说证明instability in long sequences without explicit separation的