QUICK REVIEW

[논문 리뷰] InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

John Yang, Akshara Prabhakar|arXiv (Cornell University)|2023. 06. 26.

Software Engineering Research인용 수 7

한 줄 요약

InterCode는 실행 피드백이 있는 인터랙티브 코드 생성을 위한 경량 Docker 기반 프레임워크를 제공하며 다양한 프롬프트 전략 하에서 Bash, SQL, Python 환경으로 최첨단 LLM을 벤치마크하는 것을 보여준다.

ABSTRACT

Humans write code in a fundamentally interactive manner and rely on constant execution feedback to correct errors, resolve ambiguities, and decompose tasks. While LLMs have recently exhibited promising coding capabilities, current coding benchmarks mostly consider a static instruction-to-code sequence transduction process, which has the potential for error propagation and a disconnect between the generated code and its final execution environment. To address this gap, we introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning (RL) environment, with code as actions and execution feedback as observations. Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution, and is compatible out-of-the-box with traditional seq2seq coding methods, while enabling the development of new methods for interactive code generation. We use InterCode to create three interactive code environments with Bash, SQL, and Python as action spaces, leveraging data from the static NL2Bash, Spider, and MBPP datasets. We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies such as ReAct and Plan & Solve. Our results showcase the benefits of interactive code generation and demonstrate that InterCode can serve as a challenging benchmark for advancing code understanding and generation capabilities. InterCode is designed to be easily extensible and can even be used to create new tasks such as Capture the Flag, a popular coding puzzle that is inherently multi-step and involves multiple programming languages. Project site with code and data: https://intercode-benchmark.github.io

연구 동기 및 목표

실행 피드백을 이용한 코드 생성을 개선하기 위해 인간의 쓰고-실행-테스트 디버깅을 모방한다.
도커 컨테이너를 통해 확장 가능하고 안전한 보편적인 환경 독립적 인터랙티브 코딩 프레임워크를 제공한다.
기존의 정적 데이터셋을 인터랙티브 태스크로 연결하여 반복적 개선과 평가를 가능하게 한다.
코딩 태스크에서 상호작용의 이점을 정량화하기 위해 여러 LLM과 프롬프트 전략을 평가한다.
새로운 인터랙티브 코딩 벤치마크와 데이터셋을 생성하기 위한 확장 가능한 태스크 구성 파이프라인을 제안한다.

제안 방법

InterCode를 지시 공간, 상태, 행동, 관찰, 보상 신호를 포함하는 부분 관찰 가능한 마르코프 결정 프로세스(POMDP)로 정의한다.
Bash, SQL, Python 환경을 행동 공간으로 호스팅하기 위한 Docker 기반 실행 샌드박스를 구현한다.
정적 NL-대-코드 데이터셋(NL2Bash, Spider, MBPP)을 인터랙티브 태스크로 접지화하기 위해 지시문과 골드 응답을 실행 가능 환경에 접지한다.
정확 일치(exact-match) 및 IoU/Kendall 기반 변형을 갖는 실행 기반 보상을 사용하고, 맞춤 신호를 위한 보상 함수 엔드포인트를 제공한다.
다양한 프롬프트 전략(Single Turn, Try Again, ReAct, Plan & Solve) 하에서 여러 모델(OpenAI, PaLM-2, Open Source)을 평가한다.
기존 데이터셋을 InterCode 태스크로 변환하고 단위 테스트로 안전성을 검증하기 위한 모듈식 데이터 수집 및 환경 구성 파이프라인을 제공한다.

실험 결과

연구 질문

RQ1실행 피드백이 있는 인터랙티브 코딩이 정적 시퀀스-투-시퀀스 벤치마크보다 코드 생성을 향상시킬 수 있는가?
RQ2다양한 프롬프트 전략이 Bash, SQL, Python 태스크 전반에서 인터랙티브 코딩의 효과에 어떤 영향을 미치는가?
RQ3장기적 인터랙티브 코딩에서 현재 LLM의 도전과제와 한계는 무엇이며, InterCode가 어떤 개선을 촉진할 수 있는가?
RQ4기존 NL-to-code 데이터셋을 안전한 실행 환경으로 유연한 인터랙티브 태스크로 어떻게 변환할 수 있는가?
RQ5Capture the Flag(CTF)와 같은 다중 언어 퍼즐과 같은 새로운 태스크를 지원할 만큼 InterCode가 확장 가능한가?

주요 결과

다중 턴 상호작용을 사용할 때 인터랙티브 코딩은 태스크와 모델 전반에서 모델 성능을 향상시킨다.
GPT-4는 Try Again 프롬프트로 InterCode-SQL에서 최대 73.7%의 성공률을 달성하여 상호작용의 강한 이점을 보여준다.
명시적 추론을 촉진하는 프롬프트 전략(ReAct, Plan & Solve)은 대체로 더 적은 턴으로 더 높은 성공률과 더 나은 타당성을 제공한다.
모델은 계획 수립과 모듈식 문제 해결을 시연하며, 관찰을 사용해 상위 수준의 행동을 구성하고, 앞서 해결된 하위 문제를 활용해 다단계 과제를 해결한다.
InterCode는 인터랙티브 코드 생성을 평가하기 위한 안전하고 확장 가능한 벤치마크로 작용하며, Docker 기반 환경을 통해 새로운 데이터셋과 태스크를 접지하는 데 사용할 수 있다.
이 프레임워크는 다양한 태스크 설정(Bash, SQL, Python)을 지원하며, 추가 언어 및 CTF 스타일 챌린지와 같은 더 복잡한 태스크로 확장될 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.