GPT 4 – considerably outperforms existing large language models, alongside most – state–of–the–art (SOTA) models which may include benchmark-specific crafting or additional training protocols.
These include a benchmark Hella Swag (common sense reasoning around everyday events) herein GPT4 was evaluated with a score of 95.3% and another benchmark AI2 Reasoning Challenge (ARC) GPT4 scored 96.3%.
Many existing ML benchmarks are written in English, to get an initial sense of capability in other languages, the MMLU benchmark was translated – a suite of 14,000 multiple–choice problem-solving spanning 57 subjects.
In the 24 out of 26 languages tested, GPT 4 outperforms the English language performance of GPT 3.5 and other LLMs including for low–resource languages such as Latvian, Welsh, and Swahili.
GPT 4 evaluated cracking traditional benchmarks as well.
In other benchmarks like Wino Grande (common sense reasoning around pronoun resolution), GPT4 scored a total of 87.5%, in Human Eval (python coding tasks) GPT4 scored 67.0% and in DROP (f1 score) reading comprehension and arithmetic GPT4 scored (80.9%).
GPT4 substantially improves previous models in the ability to follow user intent on a database of 5,214 prompts submitted to Chat GPT and the Open AI API; the responses generated by GPT4 were preferred over the reactions generated by GPT 3.5 on 70.2% of prompts.
In the figure where a bar graph is represented is the performance of GPT 4 in a variety of languages compared to prior models in English. It closely represents GPT 4’s performance with the English language.
GPT 4 has been used internally, with great impact on functions like support, sales, content moderation, and programming. It is also helpful in assisting humans in evaluating AI outputs, starting the second phase in the alignment strategy.
Whether it is a difficult or a new language it can crack easily without any fuss and you don’t need to worry about grammatical and various errors.