Skip to main content
QUICK REVIEW

[论文解读] ELLMPEG: An Edge-based Agentic LLM Video Processing Tool

Zoha Azimi, Reza Farahani|arXiv (Cornell University)|Jan 17, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

tldr: ELLMPEG 是一个边缘可部署的代理能力 LLM 系统,能够使用 Retrieval-Augmented Generation 和自我 refinement 生成并本地验证 FFmpeg 和 VVenC 指令,消除对云 API 的依赖。在开源模型上实现了高指令生成准确性,且运行时和能耗成本较低。

ABSTRACT

Large language models (LLMs), the foundation of generative AI systems like ChatGPT, are transforming many fields and applications, including multimedia, enabling more advanced content generation, analysis, and interaction. However, cloud-based LLM deployments face three key limitations: high computational and energy demands, privacy and reliability risks from remote processing, and recurring API costs. Recent advances in agentic AI, especially in structured reasoning and tool use, offer a better way to exploit open and locally deployed tools and LLMs. This paper presents ELLMPEG, an edge-enabled agentic LLM framework for the automated generation of video-processing commands. ELLMPEG integrates tool-aware Retrieval-Augmented Generation (RAG) with iterative self-reflection to produce and locally verify executable FFmpeg and VVenC commands directly at the edge, eliminating reliance on external cloud APIs. To evaluate ELLMPEG, we collect a dedicated prompt dataset comprising 480 diverse queries covering different categories of FFmpeg and the Versatile Video Codec (VVC) encoder (VVenC) commands. We validate command generation accuracy and evaluate four open-source LLMs based on command validity, tokens generated per second, inference time, and energy efficiency. We also execute the generated commands to assess their runtime correctness and practical applicability. Experimental results show that Qwen2.5, when augmented with the ELLMPEG framework, achieves an average command-generation accuracy of 78 % with zero recurring API cost, outperforming all other open-source models across both the FFmpeg and VVenC datasets.

研究动机与目标

  • Motivate edge-based, privacy-preserving video processing by reducing reliance on cloud LLMs and APIs.
  • Design an architecture that combines RAG with self-reflection to generate executable multimedia processing commands at the edge.
  • Evaluate open-source LLMs on FFmpeg and VVenC command generation in terms of validity, speed, and energy efficiency.
  • Provide a dataset of FFmpeg and VVenC queries and benchmark the system’s accuracy and practicality for edge deployment.

提出的方法

  • Propose an edge-deployable agentic LLM workflow with three phases: RAG setup, LLM reasoning, and command execution.
  • Maintain two tool-aware FAISS vector stores (FFmpeg and VVenC) and perform tool-specific retrieval for accurate command generation.
  • Use a dual-embedding approach to map chunks to the relevant tool documentation during retrieval.
  • Implement a self-reflection loop with a maximum of Imax iterations to correct errors and improve command correctness.
  • Extract executable commands from LLM outputs using a pattern-matching module before dispatching to FFmpeg or VVenC backends.
  • Evaluate on a dedicated 480-query dataset covering FFmpeg and VVenC commands, and measure accuracy, speed, and energy efficiency on edge CPUs and server-class hardware.
Figure 1 . Comparison of responses to two queries: green borders indicate valid commands, red borders denote invalid ones.
Figure 1 . Comparison of responses to two queries: green borders indicate valid commands, red borders denote invalid ones.

实验结果

研究问题

  • RQ1Can edge-deployable LLMs with RAG and self-refinement generate correct FFmpeg and VVenC commands without cloud APIs?
  • RQ2How do open-source 2–8B parameter models perform on domain-specific multimedia command generation when augmented with ELLMPEG?
  • RQ3What is the trade-off between command-generation accuracy, inference time, and energy consumption in edge versus server environments?
  • RQ4Does a tool-aware dual-vector-store RAG setup improve retrieval relevance and reduce cross-tool confusion in command generation?

主要发现

  • Qwen2.5 augmented with ELLMPEG achieves an average command-generation accuracy of 78% with zero recurring API costs.
  • ELLMPEG outperforms other open-source models across both FFmpeg and VVenC datasets in command accuracy.
  • The system runs on edge hardware (Intel i7-8700) and server hardware (Xeon Gold with GPUs), with the edge setup avoiding cloud APIs.
  • Two separate FAISS vector stores for FFmpeg and VVenC reduce retrieval noise and improve tool routing accuracy.
  • A self-reflection loop with a bounded iteration count improves command correctness while maintaining acceptable latency on edge devices.
  • The dataset comprises 480 diverse queries (380 FFmpeg, 100 VVenC) created from GPT-4o and real-world sources, and is publicly released for reproducibility.
Figure 2 . ELLMPEG architecture.
Figure 2 . ELLMPEG architecture.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。