细看 Claude 3.7 两个重要的 Benchmark:SWE-Bench & TAU-Bench

2025-2-27

Claude 3.7 Sonnet 在万众期待中推出了,为什么期待,因为从 Claude 3.5 Sonnet 发布后,一直是AI Coding Agent 领域最好的模型,综合效果没有对手,后面陆续推出的 o1/o3/DeepSeek 都没能撼动,更让人期待 Claude 3.7 Sonnet 在 AI Coding 领域能不能有进一步提升。

Claude 3.7 放出来的 Benchmark 里,有两个是跟 AI Coding Agent 表现强相关的:

  1. Agentic coding,SWE-bench,衡量解决实际软件工程编码问题的能力。
  2. Agentic tool use,TAU-bench,衡量理解用户意图调用工具执行命令的能力。

可以看到 SWE-bench 有显著的提升,问题解决率 49% 提升到最高 70%,TAU-bench 也有不错的绝对值10个点的提升,确实重点提升了 AI Coding Agent 相关能力。

接下来详细看看这两个 Benchmark 究竟测了什么,可以大致知道,目前模型的能力上限大概是怎样。

SWE-bench

SWE-bench 是由普林斯顿大学 NLP 团队开发的项目,23年10月就开始提出,主要是想找到一种方式可以评估大模型解决实际软件工程问题的能力,而不是像之前只衡量算法题的解决能力。当时还是 Claude 2 和 GPT4 的时代,随着 AI Coding 的逐渐火爆,OpenAI 也加入对这个 benchmark 的完善,这个项目也逐渐成为主流。

数据构造

分三步:

  1. 选靠谱的库:选了 12 个流行的 Python 开源库,选择的标准是,热门库,长期维护,有比较多的 Pull Request 修复相关 issue,PR 的管理也很规范,有很好的测试覆盖率。这些库修复 issue 的 PR 就是我们要获取的测试 case,但会对这些 PR 进行一些过滤。
  2. 特性过滤:1)明确 PR 是解决了某个特定问题。2) PR里包含了测试 case,可以很容易从测试 case 上判断代码修改是否有效。这些在运行前就能过滤出来。
  3. 运行时过滤:这些 PR 应用前后,测试用例中要有明确的从不通过到通过的变化,程序跑起来也不会有错误,便于评估结果。

基于上述规则从 github 热门项目上抽取相关的数据,这些数据还可以持续更新,避免模型因为看到过这些数据而“作弊”。

这是抽取的几个流行的 python 库,以及数据集数量:

经过上述步骤抽取构造数据后,得到 SWE-Bench 数据集,后来 OpenAI 对这个数据集再进行人工过滤筛选掉了一些不太好的 case,比如 issue 问题描述不准确、开发环境难搭建难测试等,也对每个挑选的 case 做了精心人工验证,一共500个样本,组成 SWE-bench_Verified 数据集,现在一般测的是这个数据集。

来看看这个数据集具体都由哪些部分组成:

instance_id: 实例ID,通常格式为 repo_owner__name-PR-number

//代码基本信息
repo: 仓库名
base_commit: PR 提交之前的代码 commit id,定位代码基线
version: 用于运行的版本号
environment_setup_commit: commit id,安装运行环境的代码基线
created_at: PR 创建时间

//PR基本信息
problem_statement: PR 对应的 issue 内容,也就是要解决的问题
test_patch: 这个 PR 提交的测试 case patch 代码
FAIL_TO_PASS: 应用修复的 PR 后会通过的测试 case
PASS_TO_PASS: 应用 PR 前和应用后,都应该通过的测试 case
patch: 这个 PR 修复的 patch 代码,相当于标准答案
hints_text: PR 提交之前,github 上对这个 issue 的讨论 comment。可选,如果要上榜单,禁止使用这个数据。

代码信息、问题描述、测试用例,重点是这几个,剩下的都是用于把程序跑起来、验证修复结果用。

测试执行

大体流程见下面这张图,输入 issue 描述和代码库,模型根据输入,输出要修改的代码,最后有个环境运行模型生成修改的代码,跑测试用例,把应用代码之前没跑通的单元测试跑通,这个任务就完成了。

这个过程一个最大的问题是:代码上下文怎么给模型?这几个热门项目代码库平均 43w 行,不可能直接给,需要有个检索的能力。

项目论文中给了两个方法:

  1. 作弊:上述构造数据时,我们有拿到人类修复这个 issue 提交的 PR 对应的 patch,而这个 patch 里修改到的代码文件,就是最重要的代码上下文,可以直接作为代码上下文给到模型。这个接近于标准答案,除了一些需要更多文件上下文才可以解的问题外。这个只用于做实验,或去检测其他的检索方式命中率如何。
  2. 稀疏检索:用 BM25 算法做检索,基于 issue 的描述搜索相关代码,限制长度在1.3万行-5万行。实验看起来,检索长度在2.7w行时,这种检索方法只有 40% 会命中上述 PR 对应的代码文件。

上述两个检索代码的方法,只是论文中做实验的参考,实际在测试 SWE-Bench 时,各模型会有自己的方法,因为检索代码的准确性对成功率影响也很大,所以榜单上很多是 Agent + 模型 的测试结果,而不是单大模型的。

Claude 3.7 跑分的说明里提到:

SWE-bench Verified: Claude 3.7 Sonnet scores 62.3% out of 500 problems using pass@1 with bash/editor tools plus a "thinking tool" for single-attempt patches — no additional test-time compute used. 70.3% score uses internal scoring and custom scaffold on a reduced subset of problems. Scaffold Deepseek R1 results use the "Agentless" framework. OpenAI results from o3-mini system card cover a different subset of problems with a custom compute.

Claude 没有具体说明怎么做检索代码的,官方blog附录里提到在运行环境上是用了极简的方案,只提供了命令和编辑文件的能力,看起来是只把代码仓库目录扔给 LLM,让它自行去做文件搜索。另外有提到 Deepseek 的跑分是基于 Agentless 框架跑的,Agentless 专门介绍说明了如何跑 SWE-Bench,以及具体是怎么做代码检索的。见 Agentless/README_swebench.md

可以看到 SWE-Bench 测试集其实是比较局限,这里面全是 Python 代码,也基本是纯逻辑代码,不涉及 UI 渲染相关,也不涉及其他语言,很多实际的软件工程场景没有覆盖到,所以即使 benchmark 到 100%,也不代表能解决绝大多数工程问题。

不过这事是一步步推进的,SWE-Bench 刚出来时解决率是个位数,这一年多一步步提上来,Claude 3.7 干到了 70%,解决了这个 Bench,还会有更多的更高难度的 Benchmark 等着,SWE-bench Multimodal 就是其中之一,包含了一些 JS UI 渲染相关的 issue 修复 case,Claude 3.5 也只有 12% 左右的解决率,还有很长路要走。

TAU-bench

TAU-bench 又叫 τ-bench,是 Sierra 团队(OpenAI 董事会主席 Bret Taylor 和谷歌前高管 Clay Bavor 联合创立的 AI 初创公司,主要开发 AI Agent 为企业服务)推出的用于评估 AI Agent 在现实世界场景中性能和可靠性的基准测试。

TAU-bench 设计了两个领域场景

  1. Airline(航空场景),模拟用户在航空业务场景下进行航班查询、预订、改签、退票、机场服务等操作,测试大模型利用工具理解用户需求、提供准确信息、遵循业务规则流程、准确进行业务操作的能力。
  2. Retail(零售场景)模拟在零售场景中进行购物咨询、商品推荐、订单修改、退货换货等操作,同样测试在这个场景下用户需求理解能力、准确处理用户订单等相关问题的能力。

这两个场景都包含了多个复杂任务 case,涉及代理与用户的多轮对话,以及代理使用工具获取信息的能力,这些任务可以综合地评估一个 Agent 所需要的 推理、规则遵循、长期记忆、工具调用等能力。为此 TAU-bench 项目也实现了一个完整的 Agent 框架,以执行这个流程。

数据构造

以零售为例,用于测试准备的数据包含以下几个部分:

  1. 数据库:json 格式,模拟电商零售领域一些订单信息、商品属性。
  2. Tools:操作上述数据库的工具函数,包括获取订单/商品信息、修改订单、修改收货地址等。这些 Tools 信息描述会在一开始给 LLM,让 LLM 知道当前有哪些工具可调用,过程中会 LLM 会根据用户意图调用相应工具修改数据库。
  3. 策略:system prompt,写明了模拟的零售场景下一些背景,包括订单状态说明/退换规则/支付方式规则、工具调用规则等。
  4. Tasks:预先设计好的测试数据集,每个测试 case 包含 instruction 和 actions,instruction 写明了这个测试 case 里用户的诉求,actions 里包含基于这个诉求下,应该调用什么工具方法改写数据库达到目的,相当于标准答案,actions 只用于最后的验证,不会在每轮测试中作为输入。

航空领域也是同样的这几类数据,只是处理的内容变成航班、订单、用户信息管理。

测试执行

具体测试 case 执行的流程:用户 instruction + 领域策略 System Prompt + Tool 描述,一起输入 LLM,LLM 循环逐步输出用户对话、助理回复、工具调用,整个流程就是一个通用的 Agent 交互流程,跟上次说到的 OpenHands 的流程差不多,只是这里用户输入也是 LLM 根据 instruction 模拟生成的。

整个多轮对话结束后,模型在这过程中会调用工具修改数据库,同时再跑一遍测试 case 里预定的 action,看对数据库的修改跟模型在这过程中调工具的修改结果是否一致,一致则测试 case 通过,不正确就不通过。

来看一个具体的例子,这是一条测试case,包含对用户诉求描述的 instruction 和标准答案 actions:

{
    "annotator": 0,
    "user_id": "omar_rossi_1241",
    "instruction": "Your user id is omar_rossi_1241. For your upcoming trip from New York to Chicago, you want to change the passenger to yourself, upgrade it to economy class, and have 3 checked bags. You prefer gift card payment. Your birthday is in your user profile so you do not prefer to provide it. You are reactive to the agent and will not say anything that is not asked.",
    "actions": [
        {
            "name": "update_reservation_flights",
            "arguments": {
                "reservation_id": "FQ8APE",
                "cabin": "economy",
                "flights": [
                    {
                        "flight_number": "HAT056",
                        "date": "2024-05-25",
                    },
                    {
                        "flight_number": "HAT138",
                        "date": "2024-05-25",
                    },
                ],
                "payment_id": "gift_card_8190333",
            },
        },
        {
            "name": "update_reservation_passengers",
            "arguments": {
                "reservation_id": "FQ8APE",
                "passengers": [
                    {
                        "first_name": "Omar",
                        "last_name": "Rossi",
                        "dob": "1970-06-06",
                    }
                ],
            },
        },
        {
            "name": "update_reservation_baggages",
            "arguments": {
                "reservation_id": "FQ8APE",
                "total_baggages": 3,
                "nonfree_baggages": 0,
                "payment_id": "gift_card_8190333",
            },
        },
    ],
},

转化后这个case实际跟大模型交互的过程:

{"traj": [
      {
        "role": "system",
        "content": "# Airline Agent Policy\n\nThe current time is 2024-05-15 15:00:00 EST.\n\nAs an airline agent, you can help users book, modify, or cancel flight reservations.\n\n- Before taking any actions that update the booking database (booking, modifying flights, editing baggage, upgrading cabin class, or updating passenger information), you must list the action details and obtain explicit user confirmation (yes) to proceed.\n\n- You should not provide any information, knowledge, or procedures not provided by the user or available tools, or give subjective recommendations or comments.\n\n- You should only make one tool call at a time, and if you make a tool call, you should not respond to the user simultaneously. If you respond to the user, you should not make a tool call at the same time.\n\n- You should deny user requests that are against this policy.\n\n- You should transfer the user to a human agent if and only if the request cannot be handled within the scope of your actions.\n\n## Domain Basic\n\n- Each user has a profile containing user id, email, addresses, date of birth, payment methods, reservation numbers, and membership tier.\n\n- Each reservation has an reservation id, user id, trip type (one way, round trip), flights, passengers, payment methods, created time, baggages, and travel insurance information.\n\n- Each flight has a flight number, an origin, destination, scheduled departure and arrival time (local time), and for each date:\n  - If the status is "available", the flight has not taken off, available seats and prices are listed.\n  - If the status is "delayed" or "on time", the flight has not taken off, cannot be booked.\n  - If the status is "flying", the flight has taken off but not landed, cannot be booked.\n\n## Book flight\n\n- The agent must first obtain the user id, then ask for the trip type, origin, destination.\n\n- Passengers: Each reservation can have at most five passengers. The agent needs to collect the first name, last name, and date of birth for each passenger. All passengers must fly the same flights in the same cabin.\n\n- Payment: each reservation can use at most one travel certificate, at most one credit card, and at most three gift cards. The remaining amount of a travel certificate is not refundable. All payment methods must already be in user profile for safety reasons.\n\n- Checked bag allowance: If the booking user is a regular member, 0 free checked bag for each basic economy passenger, 1 free checked bag for each economy passenger, and 2 free checked bags for each business passenger. If the booking user is a silver member, 1 free checked bag for each basic economy passenger, 2 free checked bag for each economy passenger, and 3 free checked bags for each business passenger. If the booking user is a gold member, 2 free checked bag for each basic economy passenger, 3 free checked bag for each economy passenger, and 3 free checked bags for each business passenger. Each extra baggage is 50 dollars.\n\n- Travel insurance: the agent should ask if the user wants to buy the travel insurance, which is 30 dollars per passenger and enables full refund if the user needs to cancel the flight given health or weather reasons.\n\n## Modify flight\n\n- The agent must first obtain the user id and the reservation id.\n\n- Change flights: Basic economy flights cannot be modified. Other reservations can be modified without changing the origin, destination, and trip type. Some flight segments can be kept, but their prices will not be updated based on the current price. The API does not check these for the agent, so the agent must make sure the rules apply before calling the API!\n\n- Change cabin: all reservations, including basic economy, can change cabin without changing the flights. Cabin changes require the user to pay for the difference between their current cabin and the new cabin class. Cabin class must be the same across all the flights in the same reservation; changing cabin for just one flight segment is not possible.\n\n- Change baggage and insurance: The user can add but not remove checked bags. The user cannot add insurance after initial booking.\n\n- Change passengers: The user can modify passengers but cannot modify the number of passengers. This is something that even a human agent cannot assist with.\n\n- Payment: If the flights are changed, the user needs to provide one gift card or credit card for payment or refund method. The agent should ask for the payment or refund method instead.\n\n## Cancel flight\n\n- The agent must first obtain the user id, the reservation id, and the reason for cancellation (change of plan, airline cancelled flight, or other reasons)\n\n- All reservations can be cancelled within 24 hours of booking, or if the airline cancelled the flight. Otherwise, basic economy or economy flights can be cancelled only if travel insurance is bought and the condition is met, and business flights can always be cancelled. The rules are strict regardless of the membership status. The API does not check these for the agent, so the agent must make sure the rules apply before calling the API!\n\n- The agent can only cancel the whole trip that is not flown. If any of the segments are already used, the agent cannot help and transfer is needed.\n\n- The refund will go to original payment methods in 5 to 7 business days.\n\n## Refund\n\n- If the user is silver/gold member or has travel insurance or flies business, and complains about cancelled flights in a reservation, the agent can offer a certificate as a gesture after confirming the facts, with the amount being $100 times the number of passengers.\n\n- If the user is silver/gold member or has travel insurance or flies business, and complains about delayed flights in a reservation and wants to change or cancel the reservation, the agent can offer a certificate as a gesture after confirming the facts and changing or cancelling the reservation, with the amount being $50 times the number of passengers.\n\n- Do not proactively offer these unless the user complains about the situation and explicitly asks for some compensation. Do not compensate if the user is regular member and has no travel insurance and flies (basic) economy.\n"
      },
      {
        "role": "user",
        "content": "Hi! I'd like to make a few changes to my trip from New York to Chicago. Can you help me with that?"
      },
      {
        "role": "assistant",
        "content": "Of course! I can help you modify your trip. Could you please provide your user ID and the reservation ID for the trip you want to change?",
      },
      {
        "role": "user",
        "content": "Sure! My user ID is omar_rossi_1241. But I'm not sure about the reservation ID at the moment."
      },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "function": {
              "arguments": "{"user_id":"omar_rossi_1241"}",
              "name": "get_user_details"
            },
            "id": "call_FXi5dyufwOlkHksVgNwVhhVB",
            "type": "function"
          }
        ]
      },
      {
        "role": "assistant",
        "content": "I found your reservations. Could you please confirm which reservation ID you'd like to modify for your trip from New York to Chicago? Here are your reservation IDs:\n\n- UM3OG5\n- 5RJ7UH\n- FQ8APE\n- QKRY03\n\nLet me know which one you'd like to modify!",
      },
      {
        "role": "user",
        "content": "I'm not entirely certain, but I believe it might be the first one, UM3OG5."
      },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "function": {
              "arguments": "{"reservation_id":"UM3OG5"}",
              "name": "get_reservation_details"
            },
            "id": "call_FApEDaUHdL2hx8FNbu5UCMb8",
            "type": "function"
          }
        ]
      },
      {
        "role": "assistant",
        "content": "The reservation ID "UM3OG5" is for a round trip from Seattle (SEA) to Dallas/Fort Worth (DFW), not from New York to Chicago. \n\nCould you please check if any of the other reservation IDs match your trip from New York to Chicago? Here are the remaining reservation IDs:\n\n- 5RJ7UH\n- FQ8APE\n- QKRY03\n\nLet me know which one you'd like to modify!",
      },
      {
        "role": "user",
        "content": "In that case, I think it might be FQ8APE. Can you check that for me?"
      },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "function": {
              "arguments": "{"reservation_id":"FQ8APE"}",
              "name": "get_reservation_details"
            },
            "id": "call_W507pQxieFVdAxvqONuYWvN2",
            "type": "function"
          }
        ]
      },
      {
        "content": "The reservation ID "FQ8APE" is for a one-way trip from Newark (EWR) to Chicago (ORD) with a stop in Houston (IAH). Since this reservation is in basic economy, please note that the flights cannot be modified. However, I can assist with other changes like cabin upgrades or baggage additions.\n\nPlease let me know what specific changes you would like to make to this reservation!",
        "role": "assistant"
      },
      {
        "role": "user",
        "content": "Great! I'd like to add three checked bags, change the passenger to myself, and upgrade to economy class, please."
      },
      {
        "role": "assistant",
        "content": "Here are the changes you want to make to your reservation:\n\n1. **Add three checked bags**: \n   - Since you are a gold member and upgrading to economy class, you will have 3 free checked bags, so no additional charges for the bags.\n\n2. **Change the passenger to yourself**:\n   - Current passenger: Ivan Garcia\n   - New passenger: Omar Rossi (DOB: 1970-06-06)\n\n3. **Upgrade to economy class**:\n   - Upgrade from basic economy to economy class.\n\nPlease confirm if you would like to proceed with these changes.",
      },
      {
        "role": "user",
        "content": "Yes, I'd like to proceed with those changes. And please use a gift card for any payment needed."
      },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "function": {
              "arguments": "{"reservation_id": "FQ8APE", "passengers": [{"first_name": "Omar", "last_name": "Rossi", "dob": "1970-06-06"}]}",
              "name": "update_reservation_passengers"
            },
            "id": "call_jK7xz4ERk3csc9jBfroPobGs",
            "type": "function"
          }
        ]
      },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "function": {
              "arguments": "{"reservation_id":"FQ8APE","cabin":"economy","flights":[{"origin":"EWR","destination":"IAH","flight_number":"HAT056","date":"2024-05-25"},{"origin":"IAH","destination":"ORD","flight_number":"HAT138","date":"2024-05-25"}],"payment_id":"gift_card_8190333"}",
              "name": "update_reservation_flights"
            },
            "id": "call_zeyT5c2EYzRvfY42X7YOKOng",
            "type": "function"
          }
        ]
      },
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "function": {
              "arguments": "{"reservation_id":"FQ8APE","total_baggages":3,"nonfree_baggages":0,"payment_id":"gift_card_8190333"}",
              "name": "update_reservation_baggages"
            },
            "id": "call_PA1XaKLPX8egjewaxIArCkRc",
            "type": "function"
          }
        ]
      },
      {
        "role": "assistant",
        "content": "Your reservation has been successfully updated with the following changes:\n\n1. **Passenger Change**: The passenger is now Omar Rossi.\n2. **Cabin Upgrade**: Upgraded to economy class.\n3. **Checked Baggage**: Added three checked bags, all free due to your gold membership.\n\nThe payment for the cabin upgrade was processed using your gift card. If you have any more questions or need further assistance, feel free to ask!",
        
      },
      {
        "role": "user",
        "content": "Thank you for helping with the changes! That's all for now. ###STOP###"
      }
    ],
}

在跑的过程中,可以看到大模型可以一步步理解好用户意图,调用get_reservation_details 工具获取信息,最后依次调用 update_reservation_flights、update_reservation_passengers、update_reservation_baggages,修改数据库。

执行完后,会重新加载初始状态的数据库,把 Task 里的 actions 依次执行一遍,对比最终数据库的状态是否一致,一致则说明模型这轮跑下来,调用的工具和修改的数据跟我们预期的一致,测试通过。

这里可以看到每条测试的过程和结果,可以从 Reward=1/0 看这条 case 是否通过,Claude 3.7 在零售领域问题解决率高达81%,但航空领域只有58%,细看下航空领域一些 case 涉及非常多的查询匹配航班信息、金额计算、行李/支付/退换多步操作,难度还是很大的。

Pass^k

这个测试集还定义了另一个指标:多次稳定通过的概率。因为 Agent 真正使用时,比如让它去订机票、处理退换货,如果执行失败会让人有很大的挫败感,所以它的成功率稳定性很重要,这个 benchmark 定义了pass^k 的指标,也就是对一个测试 case 连续执行 k 次,每次都成功,才能算任务成功

可以看到每个模型的稳定性都不是很好,航空领域下,Claude Sonnet 3.5 从 46% 通过率下降到pass^4 的 22.5%,也就是只对 22.5% 的问题,连续测4遍都能成功。

这跟我们目前体感也一致,Agent 还没那么可靠,并不能期望它在复杂的场景、多轮交互中很稳定地理解意图做出正确的行动。Claude 3.7 没有 pass^k 相关的指标,不确定稳定性是否有提升。

最后

上述两个 Benchmark 都是尽量在模拟真实世界的问题场景,算是模拟得比较好的了,但跟真正现实的使用方式和多样性还是有很大差距,分数只能是个参考,能大致知道模型在哪些方面表现还不好,实际在某个场景下好不好用,还得真正上手测试,实际体验上据了解 Claude 3.7 远好于 3.5,这两个 benchmark 的分数提升还不足以反应优化的程度。

分类:技术文章 Tags:
评论