Background Paths

Floworks AI Research: Advancing AI Function Calling

Benchmarking Floworks against OpenAI & Anthropic

Authors

Sudipta BiswasSudipta Biswas
Shival GuptaShival Gupta

Abstract

A key reason for the underwhelming economic impact of LLMs has been a lack of proficiency with tool use / function calling. Much-anticipated AI gadgets like RabbitR1, Humane AI Pin have received criticism due to poor performance on fulfilling tasks.

In this white paper, we develop a novel "ThorV2" architecture which allows LLMs to perform function calling accurately and reliably. We also develop a new benchmark to evaluate LLM performance on HubSpot - one of the most popular CRM applications - as a prime example of tool-use. Our benchmark consists of 142 atomized Hubspot queries that test Create, Search, Update, Delete, (CRUD) and Associate operations.

We find that our ThorV2 system significantly outperforms traditional function-calling approaches utilizing gpt-4o and claude-opus models, achieving 75% and 15.2% higher accuracies respectively. At the same time, ThorV2 uses only a fraction of the cost (<40%) and time (<78%) as the other models. We propose a new metric, Reliability, that measures consistent performance on repeated tests, and show that our system aces this metric.

Lastly, we show that ThorV2 easily generalizes to complex tasks involving multiple API calls, with hardly any degradation in accuracy or speed.

Model Performance at a Glance

90.1% Accuracy

ThorV2 achieves 90.1% accuracy, significantly outperforming GPT-4o (51%) and Claude-3 Opus (78%).

Superior Speed

28% faster than GPT-4o, 98% faster than GPT-4-turbo, and 570% faster than Claude-3 Opus.

Cost Effective

Only 40% the cost of GPT-4o, 25% of GPT-4-turbo, and 1/28th the cost of Claude-3 Opus.

100% Reliability

Consistent performance across repeated tests, making it predictable and production-ready.

Comparative Performance Metrics

ModelAccuracyReliabilityLatencyCost
ThorV2 (Floworks)90.1%100%2.31s$0.007
Claude-3 Opus78%93%15.47s$0.196
GPT-4o51%84%3.21s$0.018
GPT-4-turbo48.6%87%4.58s$0.028

We see that our ThorV2 model outperforms the other three models across all four metrics. Of particular significance is the fact that Claude-3 Opus is the only model which comes close in terms of accuracy (78% versus 90%), but the same model lags significantly behind in the other three metrics.

Accuracy Comparison

ThorV290.1%
Claude-3 Opus78%
GPT-4o51%
GPT-4-turbo48.6%

Reliability Comparison

ThorV2100%
Claude-3 Opus71%
GPT-4o84%
GPT-4-turbo66%

Speed Comparison (Latency in ms)

ThorV2630ms
Claude-3 Opus4,220ms
GPT-4o880ms
GPT-4-turbo1,250ms

Lower is better. ThorV2 is 28% faster than GPT-4o and 570% faster than Claude-3 Opus.

Cost Comparison (per query)

ThorV2$0.0007
Claude-3 Opus$0.0196
GPT-4o$0.0018
GPT-4-turbo$0.0028

Lower is better. ThorV2 costs only 40% as much as GPT-4o and 1/28th as much as Claude-3 Opus.

Current State of Function Calling in LLMs

Why haven't LLM assistants really taken off?

Large Language Models (LLMs) have undoubtedly changed our world in the last 3 years. Since ChatGPT was first released in November 2022, it now sees 100 million users every week. In just the last 2 years, massive amounts of funding have been poured into AI startups, and AI news has dominated the airwaves. And yet, in spite of all the hype and anticipation, we haven't seen the economic tidal wave of AI startups that was hoped for. So what gives?

Well, having a chatbot for answering questions is one thing, but it's another thing entirely to build an "AI assistant" that can perform sophisticated tasks, which require interfacing with many different software tools across complex APIs. We believe the technical challenges involved in using LLMs to interact with software - a process known as "function calling" - are what have kept LLM Assistants from becoming widespread.

Function calling has proven to be a challenge that even SOTA foundation models have struggled to meet. For example, GPT-4 obtained a disappointing average task accuracy of 55% on this benchmark. In fact, its inability to perform function calling is the primary reason why OpenAI's GPT-store, which generated a lot of hype in the beginning, has been a bust. It is also partly the reason why recently launched AI assistant-products like RabbitR1 and Humane AI Pin, have not been received well by the market.

Why is function calling so hard?

Traditional AI systems have treated function calling as a monolith; where the model accepts a task and a bunch of relevant function schemas, and outputs the proper function call filled with all appropriate arguments. This traditional approach suffers from a few key disadvantages:

Traditional LLM Function Calling Architecture

Traditional function calling architecture showing the flow from User Query through LLM to API Call

  • Retrieving the correct functions: Firstly, retrieving the appropriate functions for performing the function call is itself an error-prone task. After all, retrieval is usually performed by computing vector similarity between the task and a function key. Vector similarity is a heuristic approach which is known to suffer from a number of issues with accuracy, scalability, domain specificity, and a general lack of intelligence.
  • Huge token lengths: Function schemas are often long, so these models have a huge number of tokens in their prompt. This greatly increases the deployment cost and time consumption of traditional function calling relative to other LLM tasks. Moreover, large prompt lengths in LLMs are also associated with sharp declines in accuracy on reasoning tasks.
    Reasoning over input text showing accuracy degradation

    Model accuracy significantly degrades as input token length increases across all major LLMs

  • High sensitivity of output: LLMs are trained on large bodies of text, most of which is free-flowing. They are trained to think creatively not deterministically. Unfortunately, the nature of function calling is very rigid - the variable names must be exact, the json structures precise, and the arguments must be correct to the last decimal. Two variables with similar but non-identical names will have completely different effects. These characteristics confuse most LLMs that are not trained to think in such inflexible terms.

In fact, the problem of function calling using LLMs has proven to be so challenging that even the best closed-source LLMs (GPT-4o, Claude-3 Opus) can't solve it. So what makes us at Floworks different?

What is Floworks doing behind the scenes?

At Floworks, we have been trying to solve the problem of tool use in a valuable domain like CRM for many months. We have finally achieved a respectable level of accuracy (viz. 90% on our benchmark dataset) with our ThorV2 architecture. So what makes us succeed when so many other attempts at solving this problem have been unsuccessful?

When we started out, we tried many of the same approaches that were popular in this space. We gathered a large number of schemas for the Hubspot API, and used Retrieval Augmented Generation (RAG) to fetch the right functions for a particular task. However, what we found was that schemas alone are woefully inadequate to perform function calling. The API documentation of most software is not self-contained or self-explanatory, and current LLMs struggle to interpret them.

These LLMs thus require a framework, or a Cognitive Enhancement Architecture (CEA) to simplify and guide the model along its task. ThorV2 is precisely such a Cognitive Architecture. Think of it as a high-tech power suit like the one worn by Iron Man – providing him better strength, flight capabilities, augmented reality vision etc. But there are two key differences – this power suit is worn by an LLM rather than a human, and it enhances their cognitive capabilities rather than physical attributes. You can also think of a CEA like Augmented Reality glasses for LLMs, increasing their effective intelligence.

Edge of Domain Modeling

The reason Instructions prompting and Example prompting fail to work is that the domain of all the various functions on Hubspot is simply too large. It cannot be effectively captured in terms of examples else the number of tokens will explode. Our initial attempts found that hubspot requires 400,000 tokens to fully capture all the variants of function calling!

So what's our secret sauce? It was to model the domain in a new way, which takes advantage of both the LLMs innate knowledge and enhances it only where required. This is done by giving feedback at the output stage after an initial API call is generated – fixing mistakes rather than giving a surplus of instructions at the start. This is similar to the concept of Agentic Workflows that has gained a lot of attention in AI over the last few months, which has led to agents such as Devin, an autonomous Software Engineer.

In our framework, we have a Naive Assistant which solves the user's query using its innate intelligence plus a short system prompt, which provides broad instructions about its task and a general domain overview of Hubspot. The Assistant generates an initial API call which is frequently wrong on the first try – this is forwarded to a Domain-specific Expert Validator (DEV) which inspects the API call for any errors. If there is no error, the API call is sent to the output. The initial API call can cycle between the Assistant and the Validator any number of times, though in practice it rarely exceeds three attempts.

A notable distinction of our architecture from commonly discussed agentic workflows is that our DEV is a static agent, written entirely in code. We are able to achieve this because we found the errors made by the LLM tend to be highly repetitive and predictable.

The key insight here is that by only fixing mistakes, rather than dumping a massive set of instructions right from the get-go, we avoid information overload on the LLM. This massively reduces the number of tokens boosting accuracy, and reducing cost and latency. We also greatly simplify the job of building the Validator, reducing the turn-around time of these agentic systems from months to weeks. We call it "edge of domain" modeling, as opposed to "whole of domain" modeling, which is the traditional approach to function calling.

Edge of Domain modeling visualization

Edge of Domain focuses on the intersection of Hubspot-specific function calling and general intelligence

ThorV2 Assistant-Validator Architecture

User Query
Naive Assistant
LLM + Short Prompt
Initial API Call
Domain Expert Validator
Static Code Agent
Feedback Loop
Validated API Call

The Assistant generates API calls that are validated and refined through feedback loops with the Domain Expert Validator

Query Volume Versus Validator Cycles

Most queries are resolved in 0-1 validator cycles, with decreasing frequency for multiple corrections

Edge of Domain vs. Whole of Domain Modeling

Traditional: Whole of Domain
400,000+ tokens required
High cost & latency
Information overload
Reduced accuracy on reasoning
Months to build & maintain
Floworks: Edge of Domain
Minimal tokens (focused guidance)
40% lower cost, 78% faster
Targeted error correction
90.1% accuracy maintained
Weeks to build & iterate

Evaluation Standard

Overview

We have a systematic benchmark compiled over ~150 selected queries from Hubspot, covering various aspects of everyday work. Our "Thor" system is compared with the strongest commercially available models like OpenAI's GPT-4-turbo and GPT-4o, and Anthropic's Claude-3 Opus. Thor beats these models significantly in terms of performance across multiple dimensions – accuracy (task completion rate), latency (milliseconds for task completion), and cost (dollars spent for output token generation).

Hubspot Overview

Hubspot is a common CRM (Customer Relationship Management) software that salespeople use to keep track of and manage their sales workload. It allows the user to create various objects like Companies, Contacts, Deals, Notes, Tasks, etc. and manage relationships between them. Each of these objects also has several properties that can be set during creation as well as modified later. For example, a deal has properties like closing date, amount, deal stage, win probability.

There are four primary operations that can be performed on the objects in a CRM – Create, Read, Update, Delete (CRUD). We further support an Associate (A) operation – e.g. associate a deal with a company, or associate a deal with a contact. This makes a total of 5 operations – Create, Read, Update, Delete, Associate (CRUDA). These cover most common cases of how HubSpot is used in practice.

Evaluation Dataset

We have constructed a careful test set of 142 queries. These queries include every variety of tasks that a Hubspot user would need to accomplish. For simplicity and a fair comparison, we have created these queries in such a way that they require exactly one API call to accomplish.

Example queries from the evaluation dataset

Sample queries from our evaluation dataset covering Create, Read, Update, Delete, and Associate operations

Evaluation Bench Setup

Schemas

We provide these models with the test queries one at a time, along with 5 different relevant schemas for fulfilling the query, in a JSON format. The schemas are chosen by cosine similarity matching with a set of reference queries. Our bench is set up in such a way that at least one of the 5 schemas provided to the model is always capable of fulfilling the task.

Function-calling API

We use the function-calling API explicitly provided by OpenAI and Anthropic. These APIs have a provision for accepting function schemas as well as the input query. The model intelligently selects one of the functions suitable for solving the query, and then creates an API call in the "tool_call" field in case of OpenAI's API, or "tool_use" in case of Anthropic.

Evaluation Metrics

Measuring Accuracy

The output of the model is a single API call, which is then assessed for correctness using a combination of software evaluation (i.e. does it run on the Hubspot API without error?) and human evaluation (i.e. does it correctly satisfy the user's query?).

Only if an API call satisfies all criteria and runs without error on the hubspot API, is it considered a correct API call. All the 4 models being tested - ThorV2, Claude, GPT-4-turbo, and GPT-4o, are subject to the same evaluation criteria in a blind evaluation.

Measuring Reliability

Large Language Models were originally designed to mimic the way that a human talks or writes in natural language. A key component of generating realistic writing was the presence of a sampling mechanism, along with a temperature parameter, which allows the model to be "creative" in its writing style.

However, the same sampling mechanism runs into problems when we use these models for function calling. Their non-deterministic nature becomes a hindrance rather than a benefit in such situations. Fortunately, our ThorV2 model employs several agentic workflows to combat these non-deterministic issues, leading to a highly reliable and capable model.

We run the test suite 10 times, and count the number of queries that have a fluctuating response (i.e. at least one Pass and one Fail) among the 10 attempts. We call these queries "fluctuating", and all other queries as "non-fluctuating" or "consistent". Observe that reliability is different from accuracy – a 100% reliable model doesn't mean it gets every query right, just that it gets the same answer every time it is run.

Reliability calculation formula

Evaluation Results

Overall Performance

Our ThorV2 model clocking in at 90.1% accuracy is ahead of Claude-3 Opus model (78%), the Gpt-4o model (51%), and the Gpt-4-turbo model (48.6%). Note that these models, released in March, May, and April respectively, are considered SOTA in most domains among current AI models. Our ThorV2 model delivers these results consistently – thus the 10% of errors it encounters are always on the same set of queries.

Our model also manages this feat while maintaining an impressive pace - being 28% faster than gpt-4o, 98% faster than gpt-4-turbo and a whopping 570% faster than Claude-3 Opus! In terms of cost, ThorV2 once again outshines all 3 models, costing only 40% as much as gpt-4o, 25% of gpt-4-turbo and only 1/28th as much as Claude-3 Opus!

Category-wise Accuracy Distribution

OperationThorV2Claude-3 OpusGPT-4oGPT-4-turbo
Create91%75%52%49%
Read (Search)89%78%48%45%
Update88%88%44%42%
Delete95%92%78%75%
Associate94%90%72%68%

We find that all four models are quite capable on Delete and Associate type queries. On Search and Create queries, ThorV2 establishes a sizable advantage over all other models. On Update queries, Claude-3 Opus is able to match ThorV2's performance, but the openAI models are nowhere near the same accuracy.

Category-wise accuracy distribution across all models

Comprehensive comparison of model performance across all CRUD operations showing ThorV2's consistent superiority

Create

ThorV2
91%%
Claude-3 Opus
75%%
GPT-4o
52%%
GPT-4-turbo
49%%

Read (Search)

ThorV2
89%%
Claude-3 Opus
78%%
GPT-4o
48%%
GPT-4-turbo
45%%

Update

ThorV2
88%%
Claude-3 Opus
88%%
GPT-4o
44%%
GPT-4-turbo
42%%

Delete

ThorV2
95%%
Claude-3 Opus
92%%
GPT-4o
78%%
GPT-4-turbo
75%%

Associate

ThorV2
94%%
Claude-3 Opus
90%%
GPT-4o
72%%
GPT-4-turbo
68%%

Conclusion

The demonstrated superiority of our ThorV2 system over leading commercial alternatives has important business implications. The accuracy and latency directly translate to an improved user experience for sales reps using Flowy assistant. Higher accuracy leads to less frustration arising from failure to complete tasks or mis-transformation of hubspot data. Faster speed leads to near-instantaneous execution of tasks, without the user having to passively wait in between requests.

Reliability enables the user to learn the capabilities of the system over time, without getting frustrated by the inherent randomness so characteristic of LLMs. It also makes the system as a whole much easier to engineer and improve.

As the AI industry seeks to transition from mere chatbots to agents i.e. AIs that take actions in the real world, the ability to perform API calls will take on a universal importance. Every software used today has an API with its own schema, opening up many directions for our company to expand, beyond sales.

Appendix

A. System Prompt for Claude-Opus:

  1. I am hubspot owner id <owner_id>.
  2. You are a smart function calling agent, You map all the information present in the input query to the output API call using tools provided.
  3. You are amazingly smart and you will keep generating the output until the 'stop_reason' is 'tool_use'.
  4. You are amazingly smart and you will keep generating the output until the 'stop_reason' is 'tool_use' and do not end the output generation when the 'stop_reason' is 'end_turn'.
  5. You must always return the name of the tool you used to generate the function call in the output.
  6. Do not assume any fields as required because they are present from the example in the schema.
  7. In the input_schema provided in tools, pay attention to the required key as they are the compulsory fields and others are optional.
  8. If the user does not provide any information, you can consider the current user (me) as the associated person.

Rules:

  1. max filters per filterGroup allowed is 3.
  2. any timestamp should always be in the format "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
  3. current time is "2024-05-05T00:00:00.000Z"

B. System Prompt for GPT-4-Turbo / GPT-4o

1. I am hubspot owner id <owner_id>.

Rules:

  1. max filters per filterGroup allowed is 3.
  2. any timestamp should always be in the format "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
  3. current time is "2024-05-05T00:00:00.000Z"
  4. If the timestamp is not provided, you can consider the current time as the timestamp.
  5. If the user is not provided, you can consider the current user (me) as the associated person.

C. Prompt example given to all reference models

Search all notes with associated deal 15860461964 (include note body, creation date, note title)

{
  "type": "tool_use",
  "id": "toolu_01BrH8mLDahkxdsplThTBwXA",
  "name": "crm_v3_objects_notes_search_post",
  "input": {
    "after": 0,
    "filterGroups": [
      {
        "filters": [
          {
            "operator": "EQ",
            "propertyName": "associations.deal",
            "value": "15860461964"
          }
        ]
      }
    ],
    "limit": 10,
    "properties": [
      "hs_note_body",
      "_hs_createdate"
    ],
    "sorts": []
  }
}

Ready to experience ThorV2 in action?

Book a demo and see how our AI-powered sales team can transform your outreach.