You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

48 lines
2.1 KiB

import json
UNIFIED_PROMPT_TEMPLATE = (
"你是一个医疗知识图谱构建专家。请从以下文本中:\n"
"1. 提取所有医学实体(去重),仅返回名称列表;\n"
"2. 在这些实体之间抽取高质量、术语化的语义关系三元组。\n\n"
"### 输出规则\n"
"- 实体类型无需标注,只输出名称字符串(如 \"慢性淋巴细胞白血病\")。\n"
"- 关系谓词必须是专业术语(2~6字),如:临床表现、诊断、相关疾病、禁忌症、治疗药物等。\n"
"- e1 和 e2 必须来自提取出的实体列表,且 e1 ≠ e2。\n"
"- 输出必须是纯 JSON,仅包含两个字段:\"entities\"(字符串列表)和 \"relations\"(对象列表,每个含 e1/r/e2)。\n"
"- 不要任何额外文本、解释或 Markdown。\n\n"
"文本:{input}\n\n输出:"
)
with open("test_data.jsonl", "r", encoding="utf-8") as fin, \
open("sft_messages_format.jsonl", "w", encoding="utf-8") as fout:
for line in fin:
line = line.strip()
if not line:
continue
try:
item = json.loads(line)
input_text = item["input"]
output_obj = item["output"]
# system prompt 中的 {input} 占位符替换(可选,也可保留原样)
# 这里按你要求:system 保持模板不变,user 才放真实 input
system_content = UNIFIED_PROMPT_TEMPLATE # 不替换 {input}
user_content = input_text
# assistant content 必须是 JSON 字符串(带转义)
assistant_content = json.dumps(output_obj, ensure_ascii=False)
messages = [
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
{"role": "assistant", "content": assistant_content}
]
fout.write(json.dumps({"messages": messages}, ensure_ascii=False) + "\n")
except Exception as e:
print(f"处理出错: {e}")
continue
print("✅ 转换完成!文件已保存为 sft_messages_format.jsonl")