We asked 5 LLMs to build us a winning March Madness bracket — here’s what we learned

March Madness season is here, and in the spirit of competition, we challenged five leading large language models (LLMs) to generate a winning NCAA bracket. While we set out to learn which LLM was best at predicting the real winner, we ended up learning a lot more about LLM usability, safety settings, and creativity.

More language models should be trained on style guides

Response style and formatting vary wildly across the models we tested. Some outputs were easy to follow, while others were riddled with inconsistent formatting, unclear explanations, or overwhelming amounts of data.

This variability highlights a key point for AI developers and businesses: the accuracy of a response is irrelevant if the content is too difficult to consume. Models should be trained to follow consistent style guides to ensure outputs are user-friendly, especially in real-world applications where clarity can make or break usability.

This is especially true for AI being implemented for specific business use cases. The more work an end user has to do to make AI-generated content readable, the less efficient it becomes to use AI for that purpose.

AI safety guardrails can vary

Not all models were willing to play ball. Mistral and Gemini both refused to generate a bracket. GPT4o only generated predictions for the final four rather than creating a full bracket. These refusals to comply with the prompt might be attributed to the task getting flagged as too close to gambling.

While that might seem frustrating, it’s actually a sign of progress in LLM development. These models are increasingly being trained to avoid generating content that could be unethical or harmful — a key consideration in use cases where user safety and regulatory compliance are paramount.

Gemini and Mistral get extra points for creativity

Gemini and Mistral both declined to create a bracket according to our prompt, but instead of giving us a point-blank “no,” they went in a slightly different direction. Both LLMs’ responses included a detailed guide on how to research and build your own bracket, even including sources to aid the user.

When building or fine tuning a model for your business, it’s important to consider what alternate responses would be helpful when the original prompt cannot be answered directly. What would you prefer to see in a response if you were the end user? What would your risk management team prefer to see? Taking these perspectives into account can help you optimize your model’s design.

Gemini took an additional step and created an entirely fictional bracket with made-up schools, details about various teams’ strengths, and strategies. While this may not be useful for your office pool, it did showcase the model’s capacity for imaginative generation — a capability that can come in useful for requests that require storytelling or simulation.

Level up your AI game with Invisible

Is your AI initiative stuck in the pilot phase? Invisible is your AI partner to help you research and create the right voice for your model based on your business use case, fine tune your model, and deploy an AI application that delivers ROI.

We asked 5 LLMs to build us a winning March Madness bracket — here’s what we learned

More language models should be trained on style guides

AI safety guardrails can vary

Gemini and Mistral get extra points for creativity

Level up your AI game with Invisible

Want more? Speak to our team

Chat with our team to get expert insights into your unique challenges.