ChatGPT vs Open Source LLMs: Who Has the Edge?
A comprehensive research report on large language models
This report provides a detailed review of the performance of various open source large language models that claim to be equivalent to or better than ChatGPT on various tasks since the release of ChatGPT one year ago!
The report integrates various evaluation benchmarks and analyzes the comparison between open source LLMs and ChatGPT on different tasks.
Including general ability, agent ability, logical reasoning ability, long text modeling ability, specific application ability (such as question and answer, summary), and trustworthiness (such as illusion, security).
The conclusion is: in terms of comprehensive capabilities, ChatGPT is still far ahead!
The following is a brief summary of the report:
- General ability:
Benchmark tests: including MT-Bench (multi-turn dialogue and instruction following ability test), AlpacaEval (testing the model’s ability to follow general user instructions), Open LLM Leaderboard (evaluating the performance of LLMs on a variety of reasoning and general knowledge tasks).
Model performance:
•Llama-2–70B-chat achieved a 92.66% winning rate in AlpacaEval, surpassing GPT-3.5-turbo.
•WizardLM-70B scores 7.71 on MT-Bench, but lower than GPT-4 (8.99) and GPT-3.5-turbo (7.94).
•Zephyr-7B has a win rate of 90.60% in AlpacaEval and a score of 7.34 on MT-Bench.
•GodziLLa2–70B scored 67.01% on the Open LLM Leaderboard, while Yi-34B scored 68.68%.
•GPT-4 maintains the highest performance, with a win rate of 95.28%
2. Agency capabilities:
Benchmark testing: including tool usage (API-Bank, ToolBench), self-debugging (InterCode-Bash, MINT-HumanEval), following natural language feedback (MINT), and environment exploration (ALFWorld, WebArena).
Model performance: Lemur-70B-chat performs better than GPT-3.5-turbo and GPT-4 in ALFWorld, IC-CTF and WebArena environment tests
3. Logical reasoning ability:
Benchmark tests: including GSM8K (mathematical problem solving), MATH (competition mathematical problems), TheoremQA (application of theorems to solve scientific problems), HumanEval (programming problems), etc.
Model performance:
•WizardCoder achieves 19.1% absolute improvement over GPT-3.5-turbo on HumanEval.
•WizardMath has an absolute improvement of 42.9% over GPT-3.5-turbo on GSM8K
4. Application specific abilities:
Benchmarks: including query-focused summarization (AQualMuse, QMSum, etc.) and open question answering (SQuAD, NewsQA, etc.).
Model performance: InstructRetro has 7–10% improvement over GPT-3 on NQ, TriviaQA, SQuAD 2.0 and DROP.
5. Application in medical field:
Benchmarks: including mental health analytics (IMHI) and radiology report generation (OpenI, MIMIC-CXR).
Model performance:
•MentalLlama-chat-13B, after fine-tuning on the IMHI training set, outperforms ChatGPT on 9 out of 9 tasks.
•Radiology-Llama-2 significantly surpasses ChatGPT and GPT-4 on MIMIC-CXR and OpenI datasets
6. Reliability:
Benchmark tests: including TruthfulQA, FactualityPrompts, HaluEval, etc., used to evaluate the authenticity and security of LLMs.
Model performance:
•Different methods and models (such as Platypus, Chain-of-Verification, Chain-of-Knowledge, etc.) have made progress in reducing hallucinations and improving safety
•For example, Platypus shows about 20% improvement over GPT-3.5-turbo on TruthfulQA.
Report