Skip to content

[Bug]: validator 字段类型校验错误地读取yaml中的 field_types为str 导致字段类型校验的 isinstance 抛出异常 #796

@kongzhinvwang2

Description

@kongzhinvwang2

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

Linux

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

latest

Python Version Python版本

3.10

Describe the bug 描述这个bug

yaml文件中配置 validator field_type(官方脚本)
`validators: # validators are a list of validators to be applied when loading a dataset
# it checks a sample of the dataset for each validator
# check data_juicer/ore/data/data_validator.py for more validator options

  • type: 'required_fields' # required_fields is a validator to check the required fields in the dataset.
    required_fields: # required_fields is a list of required fields.
    • "text"

    field_types: # field_types is a dictionary of field types.

    text: 'str'`

其中 field_types 在 data_juicer/core/data/data_validator.py 中被设置为expected_type = self.field_types.get(field)
这会导致读取到的 expected_type 为字符串类型的 str、list....
在校验时 invalid_types = [type(v) for v in sample_values if v is not None and not isinstance(v, expected_type)] 没有将 expected_type 转为 type 类型,导致抛出异常
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

To Reproduce 如何复现

只要 validator 的yaml 文件设置 field_types 即可复现

Configs 配置信息

No response

Logs 报错日志

TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

Screenshots 截图

No response

Additional 额外信息

只需要对expected_type进行类型转换即可解决此问题

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions