QUICK REVIEW

[논문 리뷰] DSL or Code? Evaluating the Quality of LLM-Generated Algebraic Specifications: A Case Study in Optimization at Kinaxis

Negin Ayoughi, David Dewar|arXiv (Cornell University)|2026. 01. 01.

Model-Driven Software Engineering Techniques인용 수 0

한 줄 요약

논문은 NL 설명으로부터 AMPL 또는 Python 명세를 도출하는 LLM 기반 파이프라인 EXEOS를 소개하고, 구조화되고 점진적으로 개선되는 산업 최적화 문제들에서 DSL(AMPL) 명세가 코드 품질과 대등하거나 더 우수함을 경험적으로 보인다.

ABSTRACT

Model-driven engineering (MDE) provides abstraction and analytical rigour, but industrial adoption in many domains has been limited by the cost of developing and maintaining models. Large language models (LLMs) can help shift this cost balance by supporting direct generation of models from natural-language (NL) descriptions. For domain-specific languages (DSLs), however, LLM-generated models may be less accurate than LLM-generated code in mainstream languages such as Python, due to the latter's dominance in LLM training corpora. We investigate this issue in mathematical optimization, with AMPL, a DSL with established industrial use. We introduce EXEOS, an LLM-based approach that derives AMPL models and Python code from NL problem descriptions and iteratively refines them with solver feedback. Using a public optimization dataset and real-world supply-chain cases from our industrial partner Kinaxis, we evaluate generated AMPL models against Python code in terms of executability and correctness. An ablation study with two LLM families shows that AMPL is competitive with, and sometimes better than, Python, and that our design choices in EXEOS improve the quality of generated specifications.

연구 동기 및 목표

LLM이 생성한 DSL 명세(AMPL)가 최적화 문제에서 실행 가능성과 정확성 면에서 LLM이 생성한 코드(Python)와 동등하거나 그 이상을 달성할 수 있는지 평가한다.
NL 문제 구조화가 생성된 명세의 품질에 미치는 영향을 조사한다.
해결자 피드백에 의해 안내되는 반복적 개선 루프가 오류 처리에 미치는 효과를 평가한다.
실행 가능한 명세 생성을 위한 추론형(LMM)과 지시 지향형 LLM의 차이를 비교한다.
명세 품질과 실행 가능성을 향상시키는 별도의 데이터 처리 단계의 역할을 고찰한다.

제안 방법

EXEOS를 제안: NL 문제 설명을 구조화하고, 관련 데이터를 변환하며, AMPL 또는 Python으로 형식 명세를 생성 또는 정제하고, 해결기 피드백으로 반복적으로 해결하는 LLM 기반 파이프라인.
두 개의 데이터셋(Public 및 Kinaxis Industry)과 두 개의 대상 언어(AMPL 및 Python)를 사용하여 구조화 및 개선 선택에 대한 팩토리얼 차감 연구를 수행한다.
두 가지 추론형과 두 가지 지시-추종형의 네 가지 LLM을 사용하고, 여덟 가지 EXEOS 변형에 걸쳐 10,560개의 명세 인스턴스를 산출한다.
생성된 명세의 실행 가능성(컴파일/실행 성공)과 정확성(실제 값 대비 상대 오차)을 평가한다.
코드와 데이터셋, 평가 스크립트를 포함한 재현 패키지를 제공한다.

Figure 4. EXEOS – our approach for transforming NL descriptions of optimization problems into formal specifications.

실험 결과

연구 질문

RQ1RQ1: LLM이 생성한 AMPL 및 Python 명세가 실행 가능성과 정확성에서 어떻게 비교되는가?
RQ2RQ2: NL 문제 설명의 구조화가 실행 가능성과 정확성에 어떤 영향을 미치는가?
RQ3RQ3: 개선 루프가 실행 가능성과 정확성에 어떤 영향을 미치는가?
RQ4RQ4: 추론 기반 LLM과 지시 준수 LLM이 결과에 어떠한 영향을 미치는가?
RQ5RQ5: 데이터 변환 단계가 실행 가능성과 정확성에 어떤 영향을 미치는가?

주요 결과

구조화 단계는 생성 이전에 일관되게 컴파일 오류를 줄이고 의도된 최적화 목표와의 정렬을 향상시킨다.
반복적 개선은 초기 실패의 자동 수정을 가능하게 하여 실행 가능성을 높인다.
AMPL 모델 생성은 파이썬 코드 생성보다 체계적으로 나쁘거나 더 낫지 않다; 특히 추론형 LLM일 때 AMPL이 종종 동등하거나 더 나은 성능을 발휘한다.
구조화된 NL 설명과 반복적 개선을 사용할 때, AMPL 모델이 Public 데이터 세트에서 더 높은 실행 가능성을 달성하고 Kinaxis Industry 데이터에서 Python과 대등하게 수행한다.
별도의 데이터 관리 단계를 생략한 기본 대비, 명시적 데이터 처리를 포함한 EXEOS 접근 방식이 실행 가능성과 정확성 측면에서 우수하다.
66개 문제, 8개 변형, 4개 LLM, 5회 반복에 걸쳐 총 10,560개의 명세 인스턴스가 수행되었고 약 484시간의 계산이 소요되었다.

Figure 5. Comparison of EXEOS variants that generate AMPL models and Python code from structured descriptions with refinement loops, showing average execution success rate, average number of zero-error solutions, and average relative error when applied with reasoning LLMs on the Public and Industry

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.