QUICK REVIEW

[論文レビュー] Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing

Zhe Liu, Chunyang Chen|arXiv (Cornell University)|May 16, 2023

Software Testing and Debugging Techniques被引用数 14

ひとこと要約

GPTDroidは、GPT-3を用いてGUI説明からテストスクリプトを生成し、それをデコードして実行することで、モバイル GUI テストを Q&A タスクとして扱い、Google Play アプリにおけるベースラインよりカバレッジとバグ検出を向上させる。

ABSTRACT

Mobile apps are indispensable for people's daily life, and automated GUI (Graphical User Interface) testing is widely used for app quality assurance. There is a growing interest in using learning-based techniques for automated GUI testing which aims at generating human-like actions and interactions. However, the limitations such as low testing coverage, weak generalization, and heavy reliance on training data, make an urgent need for a more effective approach to generate human-like actions to thoroughly test mobile apps. Inspired by the success of the Large Language Model (LLM), e.g., GPT-3 and ChatGPT, in natural language understanding and question answering, we formulate the mobile GUI testing problem as a Q&A task. We propose GPTDroid, asking LLM to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts, and executing them to keep passing the app feedback to LLM, iterating the whole process. Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process, design prompts for inputting this information to LLM, and develop a neural matching network to decode the LLM's output into actionable steps to execute the app. We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline. GPTDroid also detects 48 new bugs on the Google Play with 25 of them being confirmed/fixed. We further summarize the capabilities of GPTDroid behind the superior performance, including semantic text input, compound action, long meaningful test trace, and test case prioritization.

研究の動機と目的

自動化されたモバイル GUI テストを大規模言語モデル（LLM）を用いた Q&A 問題として定式化する。
静的および動的文脈抽出を通じて GUI ページをLLMに意味的に説明する。
ニューロン近似マッチングネットワークを用いてLLM出力を実行可能な GUI アクションへデコードする。
実世界の Android アプリを対象に GPTDroid を評価し、カバレッジとバグ検出を評価する。
意味的入力とテストトレース品質を含むLLM駆動のテスト効果の理由に関する洞察を提供する。

提案手法

AndroidManifest とビュー階層ファイルから静的文脈（アプリ情報、現在の GUI ページ、ウィジェット）を抽出する。
操作メモリ写真機（operation memorizer）を用いて動的文脈（反復的なテスト進行）を抽出する。
静的および動的文脈を用いてLLM入力のための言語パターンを設計しプロンプトを生成する。
LLMが報告する手順を実際の GUI ウィジェットへマッピングするためにニューラルマッチングネットワークを用いる。
正例/負例のヒューリスティック駆動型トレーニングデータを生成してニューラルマッチャーをシードする。
GPTDroidをAndroid-x86上で UIAutomator と ADB を用いて実装し、LLMとしてGPT-3、デコードとマッチングのコンポーネントをPyTorchベースで用いる。

Figure 1 . Demonstrated example of how GPTDroid works.

実験結果

リサーチクエスチョン

RQ1GPTDroidは GUI テストのアクティビティカバレッジを増加させる効果があるか。
RQ2GPTDroidはベースラインの GUI テストツールより多くのバグをより速く検出できるか。
RQ3ニューラルマッチングネットワークはLLMのプロンプトを有効な GUI アクションへどの程度正確に翻訳できるか。
RQ4この設定におけるLLM駆動の GUI テストの成功を説明する要因は何か。

主な発見

GPTDroidは71%のアクティビティカバレッジを達成し、これは最良のベースラインより32%高い。
GPTDroidはベストベースラインよりバグ検出数を36%多く、速度も向上。
GPTDroidはGoogle Playで48件の新規クラッシュバグを検出し、25件が開発者によって確認/修正された。
評価は86アプリと129バグを対象とし、9つのベースラインと比較。
定性的分析はパフォーマンスの要因を特定：意味的テキスト入力、複合アクション、長いテストトレース、テストケースの優先順位付け。

Figure 2 . The model structure of GPT-3.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。