Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openAi broken JSON, the rescue #139

Open
TiagoGouvea opened this issue Jan 27, 2025 · 5 comments
Open

openAi broken JSON, the rescue #139

TiagoGouvea opened this issue Jan 27, 2025 · 5 comments

Comments

@TiagoGouvea
Copy link

Your tool is amazing and help a lot.

Dealing with LLMs like OpenAi, Gemini, DeepSeek, it very often return crazy invalid JSON, where one jsonrepair call is not enougth.

As example (json inside message.content) with two JSON objects togheter:

 {
      index: 0,
      message: {
        role: 'assistant',
        content: '{"response":"Agora, vou acessar cada uma das notícias que encontrei e extrair informações detalhadas para compor a lista final sobre as novidades na Inteligência Artificial.","goto":"leitor"}\n' +
          '{"response":"Acessando e extraindo informações das notícias sobre Inteligência Artificial encontradas anteriormente.","goto":"leitor"}',
        refusal: null
      },
      logprobs: null,
      finish_reason: 'stop'
    }

Sometime it returns the JSON inside a markdown ```json markup.

I would like to share with you two methods I made here, that helpme solve a little more that crazy json, and maybe it could be an idea to make you lib better.

export function extractJsonArrayFromString(inputString: string) {
  let repaired, match;
  try {
    const cleanedInputString = inputString
      .replace(/```json\s*|\s*```/g, '')
      .replace(/```\s*|\s*```/g, '')
      .trim();

    repaired = jsonrepair(cleanedInputString);

    const regex = /({[\s\S]*}|[\[\s\S]*\])/;
    match = repaired.match(regex);

    if (match) {
      const jsonArray = JSON.parse(match[0]);
      return jsonArray;
    } else {
      throw new Error('No JSON found.');
    }
  } catch (error: any) {
   // ...
  }
}


export function extractJsonObjectFromString(inputString: string) {
  let cleanedInputString, match, repaired;
  try {
    cleanedInputString = extractOnlyJson(inputString); 

    const regex = /{[\s\S]*}/;
    match = cleanedInputString.match(regex);

    if (match) {
      repaired = jsonrepair(match[0]);
      return JSON.parse(repaired);
    } else {
      throw new Error('No JSON found.');
    }
  } catch (error: any) {
    // ...
  }
}

const extractOnlyJson = (str: string) => {
  const start = str.indexOf('{');
  const end = str.lastIndexOf('}') + 1; // Include the closing brace
  if (start !== -1 && end !== -1) {
    return str.slice(start, end);
  } else {
    return 'Invalid input: no braces found.';
  }
};
@TiagoGouvea
Copy link
Author

TiagoGouvea commented Jan 30, 2025

Another example received from gpt-4o-mini

{
"assistant": "Seu time ficaria assim:\n\n1. Chikorita (Grama) \n - Movimentos: Giga Drain, Reflect, Leech Seed, Solar Beam \n\n2. Houndoom (Fogo/Sombrio) \n - Movimentos: Flamethrower, Crunch, Solar Beam, Will-O-Wisp \n\n3. Espeon (Psíquico) \n - Movimentos: Psychic, Shadow Ball, Morning Sun \n\n4. Lanturn (Água/Elétrico) \n - Movimentos: Surf, Thunderbolt, Ice Beam, Heal Bell \n\n5. Steelix (Aço/Terra) \n - Movimentos: Earthquake, Iron Tail, Rock Slide, Crunch \n\n6. Noctowl (Normal/Voador) \n - Movimentos: Hypnosis,

  • Missing closing braces
  • Missing " closing the string

@josdejong
Copy link
Owner

Thanks for your inputs Tiago.

Some thoughts:

  1. Removing markdown code blocks is being worked on, see feat: add support for skipping markdown JSON wrappers in jsonrepair f… #138
  2. In your first example, I can't think of a way to automatically detect that a string containing two non-escaped JSON objects should be changed into an array with the two JSON objects. I do not think that jsonrepair should change strings into objects or arrays on it's own, that changes the contents of the JSON. So I guess this will require manual fixing. Note that in https://jsoneditoronline.org/, you can easily paste the invalid JSON, it will repair it. And in case of a JSON object/array inside a string, you can convert it to an object/array by right clicking the string and selecting "Convert to object" or "Convert to array". Note that this will not work in case of two JSON object in a single string, since that is not valid JSON on it's own.
  3. About your "Another example": this is an object containing a long string which contains colons and commas, and newlines. In that case, jsonrepair is a bit conservative in interpreting the trailing comma as part of the string or as part of the root object, since repairing wrongly can mess up the JSON document badly (like trying to interpret the string contents as a series of key/values that should be part of the root object, where the keys and values are missing quotes). We may be able to improve on this but I'm not sure.

@TiagoGouvea
Copy link
Author

Got it @josdejong!

Thank you for your feedback, and it totally make sense.

Maybe it could have a parameter to define "conservative" level, so, in some cases when it's better have something than nothing, we could set a low conservative param.

Some that the user would use "at your own risk".

It's just one idea.

@josdejong
Copy link
Owner

Yes, that is an interesting idea, a "conservative" option. In the long run, I would love to let jsonrepair return a list with applied fixes, and we could attach a flag "safe" or "unsafe" to every fix. See #79.

@TiagoGouvea
Copy link
Author

I will keep posting here some cases, so if in the future a conservative option exists, you will have some real test cases.

Another broken JSON from OpenAi. In this case it visibly failed to close the objects from the array (same atributes 3 times), and didn't finished the string as well.

"insights": [
{
"id": 1,
"description": "Falta de clareza nas informações sobre as vagas de estágio limitadas a pessoas com mais de 30 anos pode desmotivar candidatos mais jovens.",
"action": "Revisar a comunicação sobre as oportunidades de estágio e torná-las mais inclusivas.",
"id": 2,
"description": "Candidatos demonstram interesse em saber mais sobre a empresa antes da entrevista, sugerindo que oferecemos informações detalhadas sobre a App Masters e a Agento.",
"action": "Criar um guia ou FAQ direcionado para candidatos, que inclua informações sobre os serviços e cultura da empresa.",
"id": 3,
"description": "Usuários que esquecem de verificar emails para entrevistas"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants