Benchmarking Floworks against OpenAI & Anthropic
Author:
Co-authors:
Abstract
A key reason for the underwhelming economic impact of LLMs has been a lack of proficiency with tool use / function calling. Much-anticipated AI gadgets like RabbitR1, Humane AI Pin have received criticism due to poor performance on fulfilling tasks.
In this white paper, we develop a novel “ThorV2” architecture which allows LLMs to perform function calling accurately and reliably. We also develop a new benchmark to evaluate LLM performance on HubSpot - one of the most popular CRM applications - as a prime example of tool-use. Our benchmark consists of 142 atomized Hubspot queries that test Create, Search, Update, Delete, (CRUD) and Associate operations.
We find that our ThorV2 system significantly outperforms traditional function-calling approaches utilizing gpt-4o and claude-opus models, achieving 75% and 15.2% higher accuracies respectively. At the same time, ThorV2 uses only a fraction of the cost (<40%) and time (<78%) as the other models. We propose a new metric, Reliability, that measures consistent performance on repeated tests, and show that our system aces this metric.
Lastly, we show that ThorV2 easily generalizes to complex tasks involving multiple API calls, with hardly any degradation in accuracy or speed.
Model Performance at a glance
We compare our Function Calling model viz. Floworks-ThorV2 against three other SOTA closed-source models: Claude-3 Opus, Gpt-4o and Gpt-4-turbo. We compare these models across 4 metrics: Accuracy (percentage of tasks fulfilled correctly), Reliability (maintaining correctness across repeated runs), Latency, and Cost per query. The numbers are shown in the table below:
Table 1: Accuracy, Reliability, latency and cost for the four models : 1) ThorV2 by Floworks 2) Claude-3 opus, 3) Gpt-4o, 4) Gpt-4-turbo
We see that our ThorV2 model outperforms the other three models across all four metrics. Of particular significance is the fact that Claude-3 Opus is the only model which comes close in terms of accuracy (78% versus 90%), but the same model lags significantly behind in the other three metrics. OpenAI’s latest model, GPT-4o, is both fast and relatively cheap, but only achieves ~51% accuracy compared to 90% of ThorV2, and also lags behind in reliability (84% versus 100%).
These metrics are visualized below:
Figure 1: Visualizing accuracy and reliability of the four models. Thor-V2 dominates the field, outperforming OpenAI models strongly on accuracy, and beating Claude-3 Opus handedly on reliability.
Figure 2: Visualizing cost and latency of the four models. Thor-V2 is shown to be the fastest and cheapest of the four. Claude-3 Opus is noticeably expensive both in terms of time and monetary cost.
Current State of Function Calling in LLMs
Why haven’t LLM assistants really taken off?
Large Language Models (LLMs) have undoubtedly changed our world in the last 3 years. Since ChatGPT was first released in November 2022, it now sees 100 million users every week. In just the last 2 years, massive amounts of funding have been poured into AI startups, and AI news has dominated the airwaves. And yet, in spite of all the hype and anticipation, we haven’t seen the economic tidal wave of AI startups that was hoped for. So what gives?
Well, having a chatbot for answering questions is one thing, but it’s another thing entirely to build an “AI assistant” that can perform sophisticated tasks, which require interfacing with many different software tools across complex APIs. We believe the technical challenges involved in using LLMs to interact with software - a process known as “function calling” - are what have kept LLM Assistants from becoming widespread.
Function calling has proven to be a challenge that even SOTA foundation models have struggled to meet. For example, GPT-4 obtained a disappointing average task accuracy of 55% on this benchmark. In fact, its inability to perform function calling is the primary reason why OpenAI’s GPT-store, which generated a lot of hype in the beginning, has been a bust. It is also partly the reason why recently launched AI assistant-products like RabbitR1, Humane Pin, have not been received well by the market. So why is function calling so hard?
Problems with the traditional approach to Function Calling
Traditional AI systems have treated function calling as a monolith; where the model accepts a task and a bunch of relevant function schemas, and outputs the proper function call filled with all appropriate arguments. This traditional approach suffers from a few key disadvantages:
Retrieving the correct functions: Firstly, retrieving the appropriate functions for performing the function call is itself an error-prone task. After all, retrieval is usually performed by computing vector similarity between the task and a function key. Vector similarity is a heuristic approach which is known to suffer from a number of issues with accuracy, scalability, domain specificity, and a general lack of intelligence.
References:
Huge token lengths: Function schemas are often long, so these models have a huge number of tokens in their prompt. This greatly increases the deployment cost and time consumption of traditional function calling relative to other LLM tasks. Moreover, large prompt lengths in LLMs are also associated with sharp declines in accuracy on reasoning tasks.
References:
High sensitivity of output: LLMs are trained on large bodies of text, most of which is free-flowing. They are trained to think creatively not deterministically. Unfortunately, the nature of function calling is very rigid - the variable names must be exact, the json structures precise, and the arguments must be correct to the last decimal. Two variables with similar but non-identical names will have completely different effects. E.g. using “revenue” instead of “amount”, or “id” instead of “ids” while searching for multiple objects, will give an error in the Hubspot API. Dates and timezones must appear in exactly the right format. These characteristics confuse most LLMs that are not trained to think in such inflexible terms.
References:
Figure 3: Performance of various LLMs averaged on 600 reasoning tasks. As the input token length increases, all models show declining performance well inside the limits of their context window. For the smaller models, this decline is especially sharp. (Graph taken from this paper )
In fact, the problem of function calling using LLMs has proven to be so challenging that even the best closed-source LLMs (GPT-4o, Claude-3 Opus) can’t solve it. So what makes us at Floworks different?
What is Floworks doing behind the scenes?
At Floworks, we have been trying to solve the problem of tool use in a valuable domain like CRM for many months. We have finally achieved a respectable level of accuracy (viz. 90% on our benchmark dataset) with our ThorV2 architecture. So what makes us succeed when so many other attempts at solving this problem have been unsuccessful?
When we started out, we tried many of the same approaches that were popular in this space. We gathered a large number of schemas for the Hubspot API, and used Retrieval Augmented Generation (RAG) to fetch the right functions for a particular task. However, what we found was that schemas alone are woefully inadequate to perform function calling. The API documentation of most software is not self-contained or self-explanatory, and current LLMs struggle to interpret them.
Figure 4: Traditional Function calling Diagram
These LLMs thus require a framework, or a Cognitive Enhancement Architecture (CEA) to simplify and guide the model along its task. ThorV2 is precisely such a Cognitive Architecture. Think of it as a high-tech power suit like the one worn by Iron Man – providing him better strength, flight capabilities, augmented reality vision etc. But there are two key differences – this power suit is worn by an LLM rather than a human, and it enhances their cognitive capabilities rather than physical attributes. You can also think of a CEA like Augmented Reality glasses for LLMs, increasing their effective intelligence.
But we haven’t addressed the real challenge yet – what is the best way to provide guidance to help an LLM understand function calling? Again, we started out with the traditional strategies – Instruction prompting, Example prompting, Chain-of-thoughts. But we found none of these approaches worked adequately. So where did we go from there? This is where the magic happened. We discovered that our approach had been flawed the entire time.
The reason Instructions prompting and Example prompting fail to work is that the domain of all the various functions on Hubspot is simply too large. It cannot be effectively captured in terms of examples else the number of tokens will explode. Our initial attempts found that hubspot requires 400,000 tokens to fully capture all the variants of function calling! And as for Instruction Prompting: generating a comprehensive, and yet digestible set of instructions for function calling is an extremely hard problem. Not to mention LLMs tend to suck at instruction-following at large input token lengths!
References:
- RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models , Evaluating Large Language Models at Evaluating Instruction Following.
So what’s our secret sauce? It was to model the domain in a new way, which takes advantage of both the LLMs innate knowledge and enhances it only where required. This is done by giving feedback at the output stage after an initial API call is generated – fixing mistakes rather than giving a surplus of instructions at the start. This is similar to the concept of Agentic Workflows that has gained a lot of attention in AI over the last few months, which has led to agents such as Devin, an autonomous Software Engineer.
In our framework, we have a Naive Assistant which solves the user’s query using its innate intelligence plus a short system prompt, which provides broad instructions about its task and a general domain overview of Hubspot. The Assistant generates an initial API call which is frequently wrong on the first try – this is forwarded to a Domain-specific Expert Validator (DEV) which inspects the API call for any errors. If there is no error, the API call is sent to the output. The initial API call can cycle between the Assistant and the Validator any number of times, though in practice it rarely exceeds three attempts.
Figure 5: Our ThorV2 Function Calling Diagram
A notable distinction of our architecture from commonly discussed agentic workflows is that our DEV is a static agent, written entirely in code. We are able to achieve this because we found the errors made by the LLM tend to be highly repetitive and predictable.
The key insight here is that by only fixing mistakes, rather than dumping a massive set of instructions right from the get-go, we avoid information overload on the LLM. This massively reduces the number of tokens boosting accuracy, and reducing cost and latency. We also greatly simplify the job of building the Validator, reducing the turn-around time of these agentic systems from months to weeks. We have visually captured the gains of this approach in the below diagram. We call it “edge of domain” modeling, as opposed to “whole of domain” modeling, which is the traditional approach to function calling.
Figure 6: Whole-domain modeling versus edge-of-domain modeling
Of course, if the Naive Assistant is too bad at its job, then the number of cycles needed will increase drastically. This will decimate our cost and speed advantage. As it turns out, we were able to engineer our system so that its zero-shot capabilities were good enough that it performs correctly for ~70% of simple queries. However, without getting > 90% performance on simple queries, it is not feasible to generate enterprise value from these agentic systems. With our Assistant-Validator framework, we are able to easily hit this benchmark and release a viable product into the market. The diagram below shows us the typical number of Assistant-Validator cycles that occur in an average Hubspot query.
Figure 7: Percentage of test queries versus fixes needed (numbers are illustrative)
Evaluation standard
Overview
We have a systematic benchmark compiled over ~150 selected queries from Hubspot, covering various aspects of everyday work. Our “Thor” system is compared with the strongest commercially available models like OpenAI’s GPT-4-turbo and GPT-4o, and Anthropic’s Claude-3 Opus. Thor beats these models significantly in terms of performance across multiple dimensions – accuracy (task completion rate), latency (milliseconds for task completion), and cost (dollars spent for output token generation).
Hubspot Overview
Hubspot is a common CRM (Customer Relationship Management) software that salespeople use to keep track of and manage their sales workload [7]. It allows the user to create various objects like Companies, Contacts, Deals, Notes, Tasks, etc. and manage relationships between them. Each of these objects also has several properties that can be set during creation as well as modified later. For example, a deal has properties like closing date, amount, deal stage, win probability.
There are four primary operations that can be performed on the objects in a CRM – Create, Read, Update, Delete (CRUD). We further support an Associate (A) operation – e.g. associate a deal with a company, or associate a deal with a contact. This makes a total of 5 operations – Create, Read, Update, Delete, Associate (CRUDA). These cover most common cases of how HubSpot is used in practice. Note that we exclude certain other uses of Hubspot such as sending emails or creating events, because as per feedback received from users, they prefer other apps for these tasks (e.g. Gmail, Calendar apps).
Evaluation Dataset
We have constructed a careful test set of 142 queries. These queries include every variety of tasks that a Hubspot user would need to accomplish. For simplicity and a fair comparison, we have created these queries in such a way that they require exactly one API call to accomplish.
Evaluation Bench Setup
We only compare ourselves against models that explicitly support function calling. Most top models such as GPT-4, Claude-3 meet this requirement.
Schemas
We provide these models with the test queries one at a time, along with 5 different relevant schemas for fulfilling the query, in a JSON format. The schemas are chosen by cosine similarity matching with a set of reference queries. Our bench is set up in such a way that at least one of the 5 schemas provided to the model is always capable of fulfilling the task.
Function-calling API
We use the function-calling API explicitly provided by OpenAI and Anthropic . These APIs have a provision for accepting function schemas as well as the input query. The model intelligently selects one of the functions suitable for solving the query, and then creates an API call in the “tool_call” field in case of OpenAI’s API, or “tool_use” in case of Anthropic.
Evaluation Metrics
Measuring Accuracy
The output of the model is a single API call, which is then assessed for correctness using a combination of software evaluation (i.e. does it run on the Hubspot API without error?) and human evaluation (i.e. does it correctly satisfy the user’s query?). Human evaluation consists of 5 steps:
Only if an API call satisfies all 5 of the above criteria, and also runs without error on the hubspot API, is it considered a correct API call. All the 4 models being tested - ThorV2, Claude, GPT-4-turbo, and GPT-4o, are subject to the same evaluation criteria in a blind evaluation.
Measuring Reliability
Large Language Models were originally designed to mimic the way that a human talks or writes in natural language. A key component of generating realistic writing was the presence of a sampling mechanism, along with a temperature parameter, which allows the model to be “creative” in its writing style.
However, the same sampling mechanism runs into problems when we use these models for function calling. Their non-deterministic nature becomes a hindrance rather than a benefit in such situations (Reference: LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation) . Fortunately, our ThorV2 model employs several agentic workflows to combat these non-deterministic issues, leading to a highly reliable and capable model. Besides enhancing the user experience, another benefit of ThorV2’s high reliability is that it is much easier to build, maintain and improve.
We can measure reliability using the following method:
We run the test suite 10 times, and count the number of queries that have a fluctuating response (i.e. at least one Pass and one Fail) among the 10 attempts. We call these queries “fluctuating”, and all other queries as “non-fluctuating” or “consistent”. The reliability metric is:
Observe that reliability is different from accuracy – a 100% reliable model doesn’t mean it gets every query right, just that it gets the same answer every time it is run.
Evaluation Results
Models
- ThorV2: Our latest AI model which powers the current version of Flowy Assistant
- Reference model - Gpt-4o-2025-05-13
- Reference model - Gpt-4-turbo-2024-04-09
- Reference model - Claude-3 Opus
Table 1: Accuracy, Reliability, latency and cost for the four models : 1) ThorV2 by Floworks 2) Gpt-4o, 3) Gpt-4-turbo, 4) Claude-3 opus
As figure 1 shows, our ThorV2 model clocking in at 90.1% accuracy is ahead of Claude-3 Opus model (78%), the Gpt-4o model (51%), and the Gpt-4-turbo model (48.6%). Note that these models, released in March, May, and April respectively, are considered SOTA in most domains among current AI models. Our ThorV2 model delivers these results consistently – thus the 10% of errors it encounters are always on the same set of queries.
Our model also manages this feat while maintaining an impressive pace - being 28% faster than gpt-4o, 98% faster than gpt-4-turbo and a whopping 570% faster than Claude-3 Opus! In terms of cost, ThorV2 once again outshines all 3 models, costing only 40% as much as gpt-4o, 25% of gpt-4-turbo and only 1/28th as much as Claude-3 Opus!
Category-wise accuracy distribution
Figure 8: Accuracy scores for the models across five categories: Create, Search, Update, Delete, and Associate.
We can also split up the accuracy score by various categories. We find that all four models are quite capable on Delete and Associate type queries. On Search and Create queries, ThorV2 establishes a sizable advantage over all other models. On Update queries, Claude-3 Opus is able to match ThorV2’s performance, but the openAI models are nowhere near the same accuracy.
Conclusion
The demonstrated superiority of our ThorV2 system over leading commercial alternatives has important business implications. The accuracy and latency directly translate to an improved user experience for sales adept using Flowy assistant. Higher accuracy leads to less frustration arising from failure to complete tasks or mis-transformation of hubspot data. Faster speed leads to near-instantaneous execution of tasks, without the user having to passively wait in between requests. Reliability enables the user to learn the capabilities of the system over time, without getting frustrated by the inherent randomness so characteristic of LLMs. It also makes the system as a whole much easier to engineer and improve.
As the AI industry seeks to transition from mere chatbots to agents i.e. AIs that take actions in the real world, the ability to perform API calls will take on a universal importance. Every software used today has an API with its own schema, opening up many directions for our company to expand, beyond sales.
Appendix
A. System Prompt for Claude-Opus:
- I am hubspot owner id <owner_id>.\n
- You are a smart function calling agent, You map all the information present in the input query to the output API call using tools provided.
- You are amazingly smart and you will keep generating the output until the 'stop_reason' is 'tool_use'.
- You are amazingly smart and you will keep generating the output until the 'stop_reason' is 'tool_use'and do not end the output generation when the 'stop_reason' is 'end_turn'.\n
- You must always return the name of the tool you used to generate the function call in the output.\n
- Do not assume any fields as required because they are present from the example in the schema.
- In the input_schema provided in tools, pay attention to the required key as they are the compulsory fields and others are optional.\n
- If the user does not provide any information, you can consider the current user (me) as the associated person.\n
# Rules :
- max filters per filterGroup allowed is 3.
- any timestamp should always be in the format "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
- current time is "2024-05-05T00:00:00.000Z"
B. System Prompt for GPT-4-Turbo / GPT-4o
- I am hubspot owner id <owner_id>.\n
# Rules :
- max filters per filterGroup allowed is 3.
- any timestamp should always be in the format "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
- current time is "2024-05-05T00:00:00.000Z"
- If the timestamp is not provided, you can consider the current time as the timestamp.\n
- If the user is not provided, you can consider the current user (me) as the associated person.\n
C. Prompt example given to all reference models
Search all notes with associated deal 15860461964 (include note body, creation date, note title)
{
"type": "tool_use",
"id": "toolu_01BrH8mLDahkxdsplThTBwXA",
"name": "crm_v3_objects_notes_search_post",
"input": {
"after": 0,
"filterGroups": [
{
"filters": [
{
"operator": "EQ",
"propertyName": "associations.deal",
"value": "15860461964"
}
]
}
],
"limit": 10,
"properties": [
"hs_note_body",
"_hs_createdate"
],
"sorts": []
}
}
Get in touch with us and book
a demo
Bring your team to the future of work, and make them insanely productive.