OpenAI Assistants with citations like【4:2†source】and citeturnXfileY

OpenAI Assistants with citations like【4:2†source】and citeturnXfileY

When streaming with OpenAI Assistants

openai.beta.threads.messages.create(
    thread_id=thread_id,
    role="user",
    content=payload.question
)

run = openai.beta.threads.runs.create(
    thread_id=thread_id,
    assistant_id=assistant_id,
    stream=True,
    tool_choice={"type": "file_search"},
)

streamed_text = ""
for event in run:
  if event.event == "thread.message.delta":
    delta_content = event.data.delta.content
    if delta_content and delta_content[0].type == "text":
      text_fragment = delta_content[0].text.value
      streamed_text += text_fragment
      yield {"data": text_fragment}
  if event.event == "thread.run.completed":
    break

the citations are coming in the formats like 【4:2†source】 or citeturnXfileY

OpenAI weird citation 1 OpenAI weird citation 2

How to fix it?

Answer

The approach I've used was to get the final message after streaming

messages = openai.beta.threads.messages.list(thread_id=thread_id)

and then apply the following regex

def replace_placeholder(match):
  nonlocal citation_index
  citation_index += 1
  return f"[{citation_index}]"

pattern = r"(citeturn\d+file\d+|【\d+:\d+†source】)"
citation_index = 0
assistant_reply_cleaned = re.sub(pattern, replace_placeholder, raw_text)

to replace the placeholders (like 【4:2†source】 or citeturnXfileY) with [1], [2], etc

Enjoyed this article?

Check out more content on our blog or follow us on social media.

Browse more articles