Clicky

logo VideoAutoArena
An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

1Rhymes AI, 2Hong Kong Baptist University, 3National University of Singapore

Leaderboards


🚀 Join the VideoAutoArena and VideoAutoBench! Share your scores, and compete for the top ! 🏆


Please remember to include the number of frames with each submission.

Email us at or .


We present the ELO ratings and win rates across 4 video length categories for 11 SOTA LMMs in video analysis. Our evaluation involves a total of 12,479 head-to-head battles.

VideoAutoArena Examples


User Simulation (Role Play) A person with a background in traditional Japanese arts and crafts, particularly skilled in amezaiku or Japanese candy artistry, who is passionate about preserving and sharing cultural heritage through interactive demonstrations and workshops.
User Simulation (Question Asking) As someone passionate about preserving and sharing the art of amezaiku, I'm interested in the tools and techniques highlighted in the video. Could you explain what tools are mentioned for making Japanese candy art and how they contribute to the candy's delicate nature? Please format the response as a list.
Model A's Response (Aria) Certainly! Here are the tools mentioned in the video for making Japanese candy art, along with their contributions to the candy's delicate nature:

- **Japanese Scissors**:
- **Contribution**: Used for precise cutting and shaping of the candy.

- **Crock-Pot**:
- **Contribution**: Helps in melting and adjusting the consistency of the candy, ensuring it maintains its delicate form.

- **Food Colors**:
- **Contribution**: Allows for the addition of vibrant colors to the candy, enhancing its visual appeal without compromising its delicacy.

These tools are crucial in creating intricate and delicate candy art, as they enable the artist to manipulate the candy's texture and appearance with high precision and care.
Model B's Response (LLaVA-Video-72B) - Japanese scissors: Essential for cutting and shaping the candy into precise figures, contributing to the art's delicate and intricate designs.
- Mold power: Likely a typo referring to 'mold powder' or 'molding power,' which is used to shape and hold the candy in place, ensuring the art maintains its form despite the candy's soft and pliable nature.
- Food colors: Vital for adding vibrant hues to the candy art, allowing for a wide range of colors that enhance the visual appeal and thematic representation of the figures.
Automatic Judge 1. Instruction Following: Model A follows the instructions more closely by explicitly stating the contributions of each tool to the candy's delicate nature, addressing the user's question directly. Model B introduces an element not mentioned in the video, 'mold power,' which diverges from the user's instructions.
2. Accuracy: Model A accurately reflects the content of the video, listing Japanese scissors, a Crock-Pot, and food colors. Model B mentions 'mold power,' which is not referenced in the video, leading to an inaccurate response.
3. Relevance: Model A provides information relevant to the user's interest in amezaiku by focusing on the precise tools and their roles. Model B includes an irrelevant term ('mold power'), reducing its relevance.
4. Helpfulness: Model A offers a detailed explanation of how each tool contributes to the candy's delicate nature, aiding user understanding. Model B introduces a possible misunderstanding with 'mold power,' which detracts from helpfulness.
Overall: Model A Wins.
User Simulation (Role Play) A person who enjoys recreational cooking and frequently hosts parties for friends and family. They are curious about incorporating more tropical flavors into their gatherings and are interested in easy-to-make, crowd-pleasing beverages.
User Simulation (Question Asking) I'm planning to host a tropical-themed party and want to offer a unique twist on classic beverages. The video showed a process involving pineapples and some refreshing ingredients. Can you explain how I can make a creative pineapple drink as shown in the video to impress my guests?
Model A's Response (Qwen2VL-72B) Absolutely! To create a unique pineapple beverage for your tropical-themed party, you'll need to start by making a pineapple juice base. Blend fresh pineapple chunks with coconut milk until smooth. Pour this mixture into a glass and add a scoop of whipped cream on top. Garnish it with a pineapple slice and a cherry for an extra touch. This pineapple drink is sure to be a hit at your party!
Model B's Response (Aria) To create a unique pineapple drink for your tropical-themed party, follow these steps:

1. **Preparation of Pineapple Juice Mixture:**
- In a pot on a stovetop, combine pineapple juice (3 1/2 - 4 cups) and sugar (3/4 cup).
- Stir the mixture until the sugar is fully dissolved.
- Allow the mixture to cool completely.

2. **Making the Cream Topping:**
- In a blender, add frozen pineapple chunks (4 cups), coconut milk (1 3/4 cups), and blend until smooth.
- Pour the mixture into a container and freeze until solid.

3. **Serving the Drink:**
- Once the mixture is frozen, scoop it out and place it into a piping bag fitted with a decorative tip.
- Pipe the creamy mixture onto a glass filled with the pineapple juice mixture.
- Garnish with a pineapple-shaped straw and a cherry on top.

This will give your guests a refreshing and visually appealing pineapple drink that's perfect for a tropical party.
Automatic Judge 1. Instruction Following: Model B provides a more detailed response that closely follows the user's request to explain the process shown in the video for creating a unique pineapple drink. Model A offers a simpler explanation that is less representative of the video content.
2. Accuracy: Model B accurately describes the process seen in the video, including the preparation of a pineapple juice mixture and a creamy topping. Model A's description lacks specifics and misses key steps seen in the video.
3. Relevance: Model B considers the user's interest in a unique and impressive drink for their party by providing a detailed recipe. Model A presents a simpler version that doesn't align as well with the user's request for uniqueness.
4. Helpfulness: Model B is more helpful by giving a step-by-step guide that the user can follow to recreate the drink from the video, including refrigeration and presentation tips. Model A's response lacks these details.
Overall: Model B Wins.

Abstract

VideoAutoArena addresses a critical gap in the evaluation of LMMs for video analysis. While traditional benchmarks like VideoMME and LongVideoBench provide limited insights through multiple-choice question answering, they fail to capture the complex, open-ended demands of real-world users. With the rapid growth of LMMs, there is a need for a more dynamic and user-centric evaluation method that reflects the variety and depth of real-world video understanding tasks. To meet this need, VideoAutoArena introduces an innovative, automated pipeline that rigorously assesses LMMs' capabilities through user simulations, peer battles, and fault-driven evolution. This approach enables continuous, scalable comparisons of model performance, capturing nuances in video comprehension that traditional methods overlook. The modified ELO Rating System ensures fair, dynamic assessments, while fault-driven evolution progressively challenges models to improve their performance in complex, real-world scenarios. Furthermore, VideoAutoBench simplifies the evaluation process by integrating human-annotated outcomes with GPT-4o's automated judgments. This combination allows for a quicker, more accessible assessment framework without sacrificing the depth and user-centric approach of VideoAutoArena. Together, these benchmarks offer a cost-effective, scalable solution for evaluating LMMs in video understanding, ultimately pushing the field toward more robust and user-relevant video analysis models.

VideoAutoArena Diagram

BibTeX

@article{
    luo2024videoautoarena,
    title={VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation}, 
    author={Ziyang Luo and Haoning Wu and Dongxu Li and Jing Ma and Mohan Kankanhalli and Junnan Li},
    year={2024},
    eprint={2411.13281},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.13281}, 
}