Kagi Ranks AIs Using Useful Benchmarks

Kagi, the search engine company, has been independently evaluating various AI models or LLMs.

Their LLM Benchmarking Project is designed to evaluate the capabilities of major AIs across various domains, including reasoning, coding, and instruction-following.

Unlike typical tests, Kagi uses unique challenges that change often, ensuring the models are tested on new material. This approach helps to get a true picture of what these models can do, without them relying on past training data. The project also includes a variety of tasks, from text-based questions to image-related challenges, and offers different difficulty levels to see how models like GPT-4 handle various situations.

Here are the current results:

Model	Accuracy (%)	Tokens	Total Cost ($)	Median Latency (s)	Speed (tokens/sec)
OpenAI gpt-4o	52.00	7482	0.14310	1.60	48.00
Together meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo	50.00	7767	0.07136	2.00	46.49
Anthropic claude-3.5-sonnet-20240620	46.00	6595	0.12018	2.54	48.90
Mistral large-latest	44.00	5097	0.06787	3.08	18.03
Groq llama-3.1-70b-versatile	40.00	5190	0.00781	0.71	81.62
Reka reka-core	36.00	6966	0.12401	6.21	17.56
OpenAI gpt-4o-mini	34.00	6029	0.00451	1.64	36.92
DeepSeek deepseek-chat	32.00	7310	0.00304	4.81	17.20
Anthropic claude-3-haiku-20240307	28.00	5642	0.00881	1.33	55.46
Groq llama-3.1-8b-instant	28.00	6628	0.00085	2.26	82.02
DeepSeek deepseek-coder	28.00	8079	0.00327	4.13	16.72
OpenAI gpt-4	26.00	2477	0.33408	1.32	16.68
Mistral open-mistral-nemo	22.00	4135	0.00323	0.65	82.65
Groq gemma2-9b-it	22.00	4889	0.00249	1.69	54.39
OpenAI gpt-3.5-turbo	22.00	1569	0.01552	0.51	45.03
Reka reka-edge	20.00	5377	0.00798	2.02	46.87
Reka reka-flash	16.00	5738	0.01668	3.28	28.75
GoogleGenAI gemini-1.5-pro-exp-0801	14.00	4942	0.26325	1.82	28.19
GoogleGenAI gemini-1.5-flash	14.00	5287	0.02777	3.02	21.16

The table includes metrics such as overall mode quality (measured as a percent of correct responses), total tokens output (some models are less verbose by default, affecting both cost and speed), the total cost to run the test, median response latency and average speed in tokens per second at the time of testing.

This approach measures the models’ potential and adaptability, with some bias towards features essential for LLM features in Kagi Search.

Share your love

Related Posts

Start typing and press enter to search