Commit ff86d68
committed
Replace invalid utf-8 sequences with the replacement character
Before, when parsing message content, if we came across any invalid
utf-8 sequences, we'd forever accumulate them in `undecoded_tokens` and
any subsequent content would get dropped when we eventually found our
next stop token.
Now, we detect invalid utf-8 sequences and replace them with the utf-8
replacement character '\uFFFD' and continue parsing further content. In
real-world scenarios, sometimes invalid utf-8 sequences are being
generated by gpt-oss models. This could be caused by too high
temperature settings, prompts with extensive usage of utf-8 characters
in unexpected ways that are outside the training datasets, or some
combination of both.
The net effect is that parsing will continue making forward progress
after we hit an invalid utf-8 sequence, which is important for scenarios
where inference servers are generating streaming long message contents
and the users will expect those tokens to be streamed back as they're
generated instead of buffered for long periods of time in our
`StreamableParser`.
See vllm-project/vllm#26480 for one such
real-world scenario encountered in vLLM.1 parent 508cbaa commit ff86d68
3 files changed
+35
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1155 | 1155 | | |
1156 | 1156 | | |
1157 | 1157 | | |
1158 | | - | |
1159 | | - | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
| 1165 | + | |
| 1166 | + | |
| 1167 | + | |
| 1168 | + | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
| 1172 | + | |
| 1173 | + | |
1160 | 1174 | | |
1161 | 1175 | | |
1162 | 1176 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
| 151 | + | |
151 | 152 | | |
152 | 153 | | |
153 | 154 | | |
| |||
212 | 213 | | |
213 | 214 | | |
214 | 215 | | |
| 216 | + | |
215 | 217 | | |
216 | 218 | | |
217 | 219 | | |
| 220 | + | |
218 | 221 | | |
219 | 222 | | |
220 | 223 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
981 | 981 | | |
982 | 982 | | |
983 | 983 | | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
0 commit comments