Evaluation Standard
Overview
We have a systematic benchmark compiled over ~150 selected queries from Hubspot, covering various aspects of everyday work. Our “Thor” system is compared with the strongest commercially available models like OpenAI’s GPT-4-turbo and GPT-4o, and Anthropic’s Claude-3 Opus. Thor beats these models significantly in terms of performance across multiple dimensions – accuracy (task completion rate), latency (milliseconds for task completion), and cost (dollars spent for output token generation).
Hubspot Overview
Hubspot is a common CRM (Customer Relationship Management) software that salespeople use to keep track of and manage their sales workload [7]. It allows the user to create various objects like Companies, Contacts, Deals, Notes, Tasks, etc. and manage relationships between them. Each of these objects also has several properties that can be set during creation as well as modified later. For example, a deal has properties like closing date, amount, deal stage, win probability.
There are four primary operations that can be performed on the objects in a CRM – Create, Read, Update, Delete (CRUD). We further support an Associate (A) operation – e.g. associate a deal with a company, or associate a deal with a contact. This makes a total of 5 operations – Create, Read, Update, Delete, Associate (CRUDA). These cover most common cases of how HubSpot is used in practice. Note that we exclude certain other uses of Hubspot such as sending emails or creating events, because as per feedback received from users, they prefer other apps for these tasks (e.g. Gmail, Calendar apps).
Evaluation Dataset
We have constructed a careful test set of 142 queries. These queries include every variety of tasks that a Hubspot user would need to accomplish. For simplicity and a fair comparison, we have created these queries in such a way that they require exactly one API call to accomplish.

Evaluation Bench Setup
We only compare ourselves against models that explicitly support function calling. Most top models such as GPT-4, Claude-3 meet this requirement.
Schemas
We provide these models with the test queries one at a time, along with 5 different relevant schemas for fulfilling the query, in a JSON format. The schemas are chosen by cosine similarity matching with a set of reference queries. Our bench is set up in such a way that at least one of the 5 schemas provided to the model is always capable of fulfilling the task.
Function-calling API
We provide these models with the test queries one at a time, along with 5 different relevant schemas for We use the function-calling API explicitly provided by
OpenAI and
Anthropic . These APIs have a provision for accepting function schemas as well as the input query. The model intelligently selects one of the functions suitable for solving the query, and then creates an API call in the “tool_call” field in case of OpenAI’s API, or “tool_use” in case of Anthropic.fulfilling the query, in a JSON format. The schemas are chosen by cosine similarity matching with a set of reference queries. Our bench is set up in such a way that at least one of the 5 schemas provided to the model is always capable of fulfilling the task.
Evaluation Metrics

Measuring Accuracy
The output of the model is a single API call, which is then assessed for correctness using a combination of software evaluation (i.e. does it run on the Hubspot API without error?) and human evaluation (i.e. does it correctly satisfy the user’s query?). Human evaluation consists of 5 steps:

Only if an API call satisfies all 5 of the above criteria, and also runs without error on the hubspot API, is it considered a correct API call. All the 4 models being tested - ThorV2, Claude, GPT-4-turbo, and GPT-4o, are subject to the same evaluation criteria in a blind evaluation.
Measuring Reliability
Large Language Models were originally designed to mimic the way that a human talks or writes in natural language. A key component of generating realistic writing was the presence of a sampling mechanism, along with a temperature parameter, which allows the model to be “creative” in its writing style.
However, the same sampling mechanism runs into problems when we use these models for function calling. Their non-deterministic nature becomes a hindrance rather than a benefit in such situations (Reference:
LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation) . Fortunately, our ThorV2 model employs several agentic workflows to combat these non-deterministic issues, leading to a highly reliable and capable model. Besides enhancing the user experience, another benefit of ThorV2’s high reliability is that it is much easier to build, maintain and improve.
We can measure reliability using the following method:
We run the test suite 10 times, and count the number of queries that have a fluctuating response (i.e. at least one Pass and one Fail) among the 10 attempts. We call these queries “fluctuating”, and all other queries as “non-fluctuating” or “consistent”. The reliability metric is:

Observe that reliability is different from accuracy – a 100% reliable model doesn’t mean it gets every query right, just that it gets the same answer every time it is run.