The calculation of score for foundation benchmark has two errors:
- For non-generation, the total count is not updated
- For predictions that do not result in one of letters [A, B, C, D] in either in predict[0] or predict[-2], the total count is not updated.
Therefore, the denominator while calculating the % accuracy is much smaller than the sample space. This makes the score high and non-representative of the actual model performance.