News
As per the official statement, both models also perform strongly on tool use, few-shot function calling, CoT reasoning (as seen in results on the Tau-Bench agentic evaluation suite) and HealthBench ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results