LMArena CTO Discusses AI Models and Google’s Nano Banana

LMArena CTO Discusses AI Models and Google’s Nano Banana


An AI war is raging as tech companies race to build models — and sometimes, the best way to determine which model is the best is to have them battle it out.

A site called LMArena allows users to do just that. In 2023, a group of researchers from the University of California, Berkeley, started Chatbot Arena, now called LMArena. It allows people to compare different AI models with prompts and determine which is better. Users can vote for how well models perform and compare them on a leaderboard.

LMArena saw a tenfold traffic spike in August when a mysterious new AI text-to-image and image editing model, Nano Banana, went viral for churning out impressive images and photo edits. Based on user votes, Nano Banana ranked #1 on LMArena’s image generation leaderboard. As many users guessed, Google was behind Nano Banana, which is Google’s Gemini 2.5 Flash.

Now, LMArena has over 3 million monthly users, says Wei-Lin Chiang, its CTO. Chiang cofounded LMArena along with Berkeley researchers Anastasios Angelopoulos, the CEO, and Ion Stoica, also a cofounder of $62 billion Databricks and $1 billion Anyscale.

“We’re continuing to build a platform that’s open and accessible to anyone,” Chiang said. “We want people to test these models and express their opinions and preferences to help the community — including providers — evaluate AI grounded in real-world use cases.”

Business Insider caught up with Chiang on how LMArena started, the top AI models people are using, and his best guess on what Meta is building at its new Superintelligence Labs.

The interview has been edited for clarity and concision.

Why did you start LMArena?

LMArena started as a research project at UC Berkeley. ChatGPT came out before that, and the model released by Meta was Llama 1. People were trying to figure out which model is the best.

We wondered what the difference was between all these models. Traditional benchmarks didn’t tell us much, so we launched this project.

Initially, we called it Chatbot Arena. We wanted to build a community-focused evaluation to invite everyone to come and participate. It got quite a bit of attention.

In the first few weeks, tens of thousands of people voted, meaning they asked a question and indicated which model was better. We used that to compile our first leaderboard. It was mostly some of the open-source models. At that time, the only proprietary chatbots were Claude and GPT. Over time, we added more models and got even more attention.

What are the top models on your platform, and which are the ones that are fast-growing?

It depends on the use cases. People come here and can ask any question. Some ask coding questions, and some ask open-ended questions, like creative writing prompts.

Claude is ranked the best in coding. In terms of creativity, I think Gemini is also at the top.

Beyond text, we also have different modalities. For example, on the vision leaderboard, people upload an image and ask questions about that image. In particular, Gemini is doing very well, and so is the GPT series. For text-to-image and image editing, that’s the one where we tested the latest Banana models.

Following the lackluster response to Llama 4 this year, how are developers using Llama? Are there any updates you expect from Llama?

We haven’t heard from them much lately, likely because they are internally figuring out how they’ll structure the new lab and team. We’ve been chatting with their Reality Labs team to work on potentially benchmarking multimodal models and products. We are looking forward to partnering with them to evaluate text and coding models.

Meta’s superintelligence team is building an “omni model.” Do you have guesses on what it might be?

A model consolidating modalities into one. That’s one of the trends we’re observing in the industry.

What do Google, Meta, and other Big Tech companies get out of putting their models on LMArena? Is it just building exposure, or do they get feedback to improve their models?

The main goal here is to build an open space where anyone can come and participate in evaluating all kinds of models. It’s community-driven and reflects how people think about all these different models by encouraging them to ask questions and vote for their preference.

When OpenAI, Google, or Meta come here to test their models, they are giving us a few variants of the model.

Basically, the same public leaderboard you’re seeing will tell them your model ranks #5, #10 in coding, #4 in creative writing, and so forth. We give them a detailed report and analysis on how their model is doing based on community-driven feedback. We are also open-sourcing some of the data we collect to the public, as well as the code and pipeline.

When all these models are benchmarking so close to one another, do we need new benchmarks?

Building more benchmarks would definitely benefit us. One core thing we want to ensure is that these benchmarks are grounded in real-world use cases.

If AI can save a doctor or a lawyer two hours a day, that will be a huge value add to society.

We want to ensure that we go beyond traditional benchmarks to benchmarks driven by real users and especially professional experts in using AI tools to get these jobs done.

Recently, we launched a benchmark we called WebDev. You can prompt a model to build a website. These are tools that can help people in tech build prototypes to get something done fast.

What do you think of that MIT report that said most companies that invested in AI aren’t seeing a return on their investments?

It’s an interesting study for sure. That’s why linking AI and grounding it in real-world use cases is particularly important.

That’s exactly why we want to build this and expand it to more industries. We started from the tech community. We believe in the tech, and people are getting a lot of value from AI. With Cursor and the Copilots of the world, people are obviously paying for it and leveraging it to build better and faster.

We would love to see this applied broadly to more industries. With the data we’re collecting, we want to help bridge that gap and help measure that.

Are there particular fields of query, like law, medicine, or education, where LLMs especially struggle to perform or answer appropriately?

We want to understand what percentage of queries are from these industries, legal and finance, and so on. We definitely would love to share when we get more insights and results.

The goal is to use the data we have to understand the model limitations and be transparent about how we do the data study, and release the data for the community to build upon.





Source link