Skip to content

Commit ff86d68

Browse files
committed
Replace invalid utf-8 sequences with the replacement character
Before, when parsing message content, if we came across any invalid utf-8 sequences, we'd forever accumulate them in `undecoded_tokens` and any subsequent content would get dropped when we eventually found our next stop token. Now, we detect invalid utf-8 sequences and replace them with the utf-8 replacement character '\uFFFD' and continue parsing further content. In real-world scenarios, sometimes invalid utf-8 sequences are being generated by gpt-oss models. This could be caused by too high temperature settings, prompts with extensive usage of utf-8 characters in unexpected ways that are outside the training datasets, or some combination of both. The net effect is that parsing will continue making forward progress after we hit an invalid utf-8 sequence, which is important for scenarios where inference servers are generating streaming long message contents and the users will expect those tokens to be streamed back as they're generated instead of buffered for long periods of time in our `StreamableParser`. See vllm-project/vllm#26480 for one such real-world scenario encountered in vLLM.
1 parent 508cbaa commit ff86d68

File tree

3 files changed

+35
-2
lines changed

3 files changed

+35
-2
lines changed

src/encoding.rs

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1155,8 +1155,22 @@ impl StreamableParser {
11551155
self.last_content_delta = Some(decoded);
11561156
self.undecoded_tokens.clear();
11571157
}
1158-
Err(_) => {
1159-
self.last_content_delta = None;
1158+
Err(e) => {
1159+
match e.error_len {
1160+
// We're trying to decode a sequence of tokens that is not valid utf-8.
1161+
// Replace the invalid sequence with a single utf-8 replacement character.
1162+
Some(_) => {
1163+
let replacement = '\u{FFFD}'.to_string();
1164+
self.encoding
1165+
.render_text_into(&replacement, content_tokens)?;
1166+
self.last_content_delta = Some(replacement);
1167+
self.undecoded_tokens.clear();
1168+
}
1169+
None => {
1170+
// waiting on next byte in our utf-8 sequence
1171+
self.last_content_delta = None;
1172+
}
1173+
}
11601174
}
11611175
}
11621176
// this was not an EOS

src/tiktoken.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ impl std::error::Error for DecodeKeyError {}
148148
#[derive(Debug, Clone)]
149149
pub struct DecodeError {
150150
pub message: String,
151+
pub error_len: Option<usize>,
151152
}
152153

153154
impl std::fmt::Display for DecodeError {
@@ -212,9 +213,11 @@ impl CoreBPE {
212213
{
213214
let bytes = self.decode_bytes(tokens).map_err(|e| DecodeError {
214215
message: format!("Invalid token error: {e}"),
216+
error_len: None,
215217
})?;
216218
String::from_utf8(bytes).map_err(|e| DecodeError {
217219
message: format!("Invalid utf-8 sequence: {e}"),
220+
error_len: e.utf8_error().error_len(),
218221
})
219222
}
220223

tests/test_harmony.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -981,3 +981,19 @@ def test_streamable_parser_tool_call_with_constrain_adjacent():
981981
]
982982

983983
assert parser.messages == expected
984+
985+
986+
def test_streamable_parser_invalid_utf8_decoding():
987+
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
988+
989+
# The sequence of two "9552" tokens in a row is not valid utf-8.
990+
# Each pair of those will give us one utf-8 replacement character.
991+
tokens = [200006, 173781, 200008, 9552, 9552, 9552, 9552, 135596, 200007]
992+
parser = StreamableParser(encoding, None)
993+
for token in tokens:
994+
parser.process(token)
995+
996+
expected = [
997+
Message.from_role_and_content(Role.ASSISTANT, "\uFFFD\uFFFDworked"),
998+
]
999+
assert parser.messages == expected

0 commit comments

Comments
 (0)