You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
48 lines
2.1 KiB
48 lines
2.1 KiB
import json
|
|
|
|
UNIFIED_PROMPT_TEMPLATE = (
|
|
"你是一个医疗知识图谱构建专家。请从以下文本中:\n"
|
|
"1. 提取所有医学实体(去重),仅返回名称列表;\n"
|
|
"2. 在这些实体之间抽取高质量、术语化的语义关系三元组。\n\n"
|
|
"### 输出规则\n"
|
|
"- 实体类型无需标注,只输出名称字符串(如 \"慢性淋巴细胞白血病\")。\n"
|
|
"- 关系谓词必须是专业术语(2~6字),如:临床表现、诊断、相关疾病、禁忌症、治疗药物等。\n"
|
|
"- e1 和 e2 必须来自提取出的实体列表,且 e1 ≠ e2。\n"
|
|
"- 输出必须是纯 JSON,仅包含两个字段:\"entities\"(字符串列表)和 \"relations\"(对象列表,每个含 e1/r/e2)。\n"
|
|
"- 不要任何额外文本、解释或 Markdown。\n\n"
|
|
"文本:{input}\n\n输出:"
|
|
)
|
|
|
|
with open("test_data.jsonl", "r", encoding="utf-8") as fin, \
|
|
open("sft_messages_format.jsonl", "w", encoding="utf-8") as fout:
|
|
|
|
for line in fin:
|
|
line = line.strip()
|
|
if not line:
|
|
continue
|
|
try:
|
|
item = json.loads(line)
|
|
input_text = item["input"]
|
|
output_obj = item["output"]
|
|
|
|
# system prompt 中的 {input} 占位符替换(可选,也可保留原样)
|
|
# 这里按你要求:system 保持模板不变,user 才放真实 input
|
|
system_content = UNIFIED_PROMPT_TEMPLATE # 不替换 {input}
|
|
|
|
user_content = input_text
|
|
# assistant content 必须是 JSON 字符串(带转义)
|
|
assistant_content = json.dumps(output_obj, ensure_ascii=False)
|
|
|
|
messages = [
|
|
{"role": "system", "content": system_content},
|
|
{"role": "user", "content": user_content},
|
|
{"role": "assistant", "content": assistant_content}
|
|
]
|
|
|
|
fout.write(json.dumps({"messages": messages}, ensure_ascii=False) + "\n")
|
|
|
|
except Exception as e:
|
|
print(f"处理出错: {e}")
|
|
continue
|
|
|
|
print("✅ 转换完成!文件已保存为 sft_messages_format.jsonl")
|