Skip to content

Commit c587172

Browse files
committed
first commit
0 parents  commit c587172

File tree

6 files changed

+259
-0
lines changed

6 files changed

+259
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/.DS_Store
2+
/.vscode
3+
/__pycache__

README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# 机器生成文本检测器
2+
3+
![screenshot](screenshot.png)
4+
5+
## 简介
6+
7+
本应用使用 BERT 模型和 SHAP 解释性分析技术,旨在帮助用户判断一个文本是否可能由机器生成。应用允许用户输入文本,然后使用预先训练好的 BERT 模型进行分析,最后通过 SHAP 提供文本的可解释性分析,帮助理解模型的预测结果。
8+
9+
## 功能
10+
11+
- **文本输入**:用户可以从预设的文本示例中选择,或者输入自定义的文本进行检测。
12+
- **机器生成文本概率评估**:应用将显示文本被判断为机器生成的概率。
13+
- **SHAP 分句可解释性分析**:对于给定的文本,应用将展示哪些部分对模型的判断起到了决定性作用。
14+
15+
## 安装
16+
17+
1. 克隆仓库或下载代码到本地。
18+
2. 本项目使用以下依赖:
19+
```
20+
matplotlib==3.8.3
21+
shap==0.44.1
22+
streamlit==1.31.1
23+
torch==2.2.0
24+
transformers==4.38.1
25+
```
26+
3. 在命令行中运行 `streamlit run app.py` 启动应用。
27+
28+
## 注意事项
29+
30+
- 应用需要一定时间来加载模型和分析文本,请耐心等待。
31+
- SHAP 可解释性分析需要至少 2 句话(以句号、问号、感叹号为划分),过短的文本可能无法进行分析。
32+
33+
## 致谢
34+
35+
- [shap](https://github.com/shap/shap)
36+
- [streamlit](https://streamlit.io/)
37+
- [streamlit-shap](https://github.com/snehankekre/streamlit-shap)
38+
- [huggingface](https://huggingface.co/)

app.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
import shap
2+
import streamlit as st
3+
import torch
4+
from transformers import BertForSequenceClassification, BertTokenizerFast
5+
6+
from streamlit_shap import st_shap
7+
8+
device = "cuda" if torch.cuda.is_available() else "cpu"
9+
10+
tokenizer = BertTokenizerFast.from_pretrained(
11+
"JeremyFeng/machine-generated-text-detection"
12+
)
13+
model = BertForSequenceClassification.from_pretrained(
14+
"JeremyFeng/machine-generated-text-detection"
15+
).to(device)
16+
17+
18+
def pred(x):
19+
predlist = []
20+
for text in x:
21+
encodings = tokenizer(
22+
text,
23+
return_tensors="pt",
24+
padding=True,
25+
truncation=True,
26+
max_length=512,
27+
return_token_type_ids=False,
28+
return_attention_mask=True,
29+
).to(device)
30+
input_ids, attention_mask = encodings["input_ids"], encodings["attention_mask"]
31+
32+
with torch.no_grad():
33+
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
34+
logits = outputs.logits
35+
y_score = torch.nn.functional.softmax(logits, dim=1)[:, 1].cpu().detach()
36+
predlist.append(y_score)
37+
38+
predtensor = torch.cat(predlist)
39+
return predtensor.numpy()
40+
41+
42+
st.title("机器生成文本检测器")
43+
44+
default_texts = [
45+
"图像识别,是指利用计算机对图像进行处理、分析和理解,以识别各种不同模式的目标和对象的技术,是应用深度学习算法的一种实践应用。现阶段图像识别技术一般分为人脸识别与商品识别,人脸识别主要运用在安全检查、身份核验与移动支付中;商品识别主要运用在商品流通过程中,特别是无人货架、智能零售柜等无人零售领域。图像的传统识别流程分为四个步骤:图像采集→图像预处理→特征提取→图像识别。随着科技的不断进步,图像识别技术也越来越成熟,现阶段已经能够高效准确地处理各种复杂场景。特别是卷积神经网络(CNN)等深度学习模型的运用,使得图像识别的精度大大提升。而随着 5G、云计算和人工智能等新一代信息技术的快速发展,图像识别将有可能在更多领域得到广泛应用,如医疗诊断、自动驾驶、无人机等。而且,有了大数据的支持,我们可以通过更多的样本来训练模型,提高模型的性能。",
46+
"我们在学习中有学习的环境,你是一个很优秀并且得到过奖学金的人,可见你对学习环境的适应力和掌控力是很强的。现在的状态需要你放下以前的心态,重新来过。从零开始,你要到哪个公司里,现在不招人,那就从侧面能多了解就多了解这个公司的状况和要求,让自己在同行业的可以进入的其他公司里磨练,时刻注视着你要去的地方,按那里的要求来要求自己的日常工作。然后再积累经验,提高自我,逐渐向你理想的公司接近。以上所述,无论你任何时候走入新的工作环境,都需要以谦虚的态度学习,以毅力和耐心去适应。但同时,也要积极向前看,进行自我提升,为未来的职业生涯铺路。通过自我磨练和不断学习,你可以获得新的技巧和知识,进一步理解你想要去的公司的工作方式和要求。在获得这些经验之后,你会发现自己的专业素质和适应能力有了显著的提升,也更接近你的职业目标了。",
47+
"我今年也大一,处境和你很相似。表面是过得去就行,大学里面还是要保持精神上的独立,如果还未遇到志同道合的同学,建议多和导员还有各科老师沟通,他们都是过来人,会理解你的处境。不要忘记,大学也是锻炼人的社交技巧和团队合作能力的地方。参加一些兴趣社团也是好的选择,可以让你结交到来自不同专业,但有着相同兴趣的人。这样你会发现,原来和你一样迷茫的人其实并不少。这种经历,会让你更加坚定,更懂得如何处理人际关系,如何在艰难困苦中找到自己的方向。同时,也一定要注意自我调节,以保持良好的精神和身体健康。这是你走向成功的重要因素。总的来说,通过这样的方式来发现和解决问题,并随着时间的推移,你会发现自己在很大程度上都有所改变和成长,这是最宝贵的。",
48+
"本文首先基于 Guo et al. (2023) 整理的中文人类-ChatGPT 问答对比语料集(HC3-Chinese),提取其中的人类生成文本。这些人类生成文本主要有两个来源:一是公开可用的问答数据集,这些数据集中的答案由特定领域的专家给出,或是网络用户投票选出的高质量答案;二是从维基百科和百度百科等资料中构造的“概念 - 解释”问答语句对。",
49+
]
50+
51+
selected_text = st.selectbox(
52+
"选择一个文本示例或输入待检测文本",
53+
options=["请选择..."] + default_texts,
54+
)
55+
56+
if selected_text != "请选择...":
57+
text_area_value = selected_text
58+
else:
59+
text_area_value = ""
60+
61+
user_input = st.text_area(
62+
"待检测文本",
63+
value=text_area_value,
64+
height=300,
65+
)
66+
67+
if user_input == "":
68+
st.stop()
69+
70+
y_score = pred([user_input])
71+
if y_score[0] < 0.5:
72+
st.success(f"该文本为机器生成的概率为 {y_score[0]*100:.2f}%", icon="🧑🏻‍💻")
73+
else:
74+
st.error(f"该文本为机器生成的概率为 {y_score[0]*100:.2f}%", icon="🤖")
75+
76+
st.subheader("SHAP 分句可解释性分析")
77+
try:
78+
masker = shap.maskers.Text(tokenizer=r"[\n。.??!!]")
79+
explainer = shap.Explainer(pred, masker)
80+
shap_values = explainer([user_input], fixed_context=1)
81+
82+
st_shap(shap.plots.text(shap_values, grouping_threshold=0.8), height=300)
83+
except Exception as e:
84+
if "zero-size array to reduction operation maximum which has no identity" in str(e):
85+
st.error("❗️文本长度过短,无法使用 SHAP 进行分句可解释性分析")
86+
st.stop()
87+
st.exception(e)

requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
matplotlib==3.8.3
2+
shap==0.44.1
3+
streamlit==1.31.1
4+
torch==2.2.0
5+
transformers==4.38.1

screenshot.png

1.07 MB
Loading

streamlit_shap.py

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
import base64
2+
3+
# Shap plots internally call plt.show()
4+
# On Linux, prevent plt.show() from emitting a non-GUI backend warning.
5+
import os
6+
from io import BytesIO
7+
8+
import matplotlib.pyplot as plt
9+
import shap
10+
import streamlit.components.v1 as components
11+
from matplotlib.figure import Figure
12+
13+
os.environ.pop("DISPLAY", None)
14+
# Text plots return a IPython.core.display.HTML object
15+
# Set diplay=False to return HTML string instead
16+
shap.plots.text.__defaults__ = (0, 0.01, "", None, None, None, False)
17+
# Prevent clipping of the ticks and axis labels
18+
plt.rcParams["figure.autolayout"] = True
19+
20+
# Note: Colorbar changes (introduced bugs) in matplotlib>3.4.3
21+
# cause the colorbar of certain shap plots (e.g. beeswarm) to not display properly
22+
# See: https://github.com/matplotlib/matplotlib/issues/22625 and
23+
# https://github.com/matplotlib/matplotlib/issues/22087
24+
# If colorbars are not displayed properly, try downgrading matplotlib to 3.4.3
25+
26+
27+
def st_shap(plot, height=None, width=None):
28+
"""Takes a SHAP plot as input, and returns a streamlit.delta_generator.DeltaGenerator as output.
29+
30+
It is recommended to set the height and width
31+
parameter to have the plot fit to the window.
32+
33+
Parameters
34+
----------
35+
plot : None or matplotlib.figure.Figure or SHAP plot object
36+
The SHAP plot object.
37+
height: int or None
38+
The height of the plot in pixels.
39+
width: int or None
40+
The width of the plot in pixels.
41+
42+
Returns
43+
-------
44+
streamlit.delta_generator.DeltaGenerator
45+
A SHAP plot as a streamlit.delta_generator.DeltaGenerator object.
46+
"""
47+
48+
# Plots such as waterfall and bar have no return value
49+
# They create a new figure and call plt.show()
50+
if plot is None:
51+
# Test whether there is currently a Figure on the pyplot figure stack
52+
# A Figure exists if the shap plot called plt.show()
53+
if plt.get_fignums():
54+
fig = plt.gcf()
55+
ax = plt.gca()
56+
57+
# Save it to a temporary buffer
58+
buf = BytesIO()
59+
60+
if height is None:
61+
_, height = fig.get_size_inches() * fig.dpi
62+
63+
if width is None:
64+
width, _ = fig.get_size_inches() * fig.dpi
65+
66+
fig.set_size_inches(width / fig.dpi, height / fig.dpi, forward=True)
67+
fig.savefig(buf, format="png")
68+
69+
# Embed the result in the HTML output
70+
data = base64.b64encode(buf.getbuffer()).decode("ascii")
71+
html_str = f"<img src='data:image/png;base64,{data}'/>"
72+
73+
# Enable pyplot to properly clean up the memory
74+
plt.cla()
75+
plt.close(fig)
76+
77+
fig = components.html(html_str, height=height, width=width)
78+
else:
79+
fig = components.html(
80+
"<p>[Error] No plot to display. Received object of type &lt;class 'NoneType'&gt;.</p>"
81+
)
82+
83+
# SHAP plots return a matplotlib.figure.Figure object when passed show=False as an argument
84+
elif isinstance(plot, Figure):
85+
fig = plot
86+
87+
# Save it to a temporary buffer
88+
buf = BytesIO()
89+
90+
if height is None:
91+
_, height = fig.get_size_inches() * fig.dpi
92+
93+
if width is None:
94+
width, _ = fig.get_size_inches() * fig.dpi
95+
96+
fig.set_size_inches(width / fig.dpi, height / fig.dpi, forward=True)
97+
fig.savefig(buf, format="png")
98+
99+
# Embed the result in the HTML output
100+
data = base64.b64encode(buf.getbuffer()).decode("ascii")
101+
html_str = f"<img src='data:image/png;base64,{data}'/>"
102+
103+
# Enable pyplot to properly clean up the memory
104+
plt.cla()
105+
plt.close(fig)
106+
107+
fig = components.html(html_str, height=height, width=width)
108+
109+
# SHAP plots containing JS/HTML have one or more of the following callable attributes
110+
elif hasattr(plot, "html") or hasattr(plot, "data") or hasattr(plot, "matplotlib"):
111+
shap_js = f"{shap.getjs()}".replace("height=350", f"height={height}").replace(
112+
"width=100", f"width={width}"
113+
)
114+
shap_html = f"<head>{shap_js}</head><body>{plot.html()}</body>"
115+
fig = components.html(shap_html, height=height, width=width)
116+
117+
# shap.plots.text plots have been overridden to return a string
118+
elif isinstance(plot, str):
119+
fig = components.html(plot, height=height, width=width, scrolling=True)
120+
121+
else:
122+
fig = components.html(
123+
"<p>[Error] No plot to display. Unable to understand input.</p>"
124+
)
125+
126+
return fig

0 commit comments

Comments
 (0)