Skip to main content
QUICK REVIEW

[論文レビュー] AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models

Zhiqiang Tang, Haoyang Fang|arXiv (Cornell University)|Apr 24, 2024
Natural Language Processing Techniques被引用数 6
ひとこと要約

AutoGluon-Multimodal (AutoMM) は、画像・テキスト・表形式データを横断する基盤モデルのファインチューニングを3行のコードで実現する、マルチモーダル学習向けのオープンソース AutoML フレームワーク。分類、回帰、物体検出、意味的マッチング、セマンティック分割に対応。 unimodal/multimodal タスクで既存の AutoML ツールと比較して優れたベンチマークを示し、先進的なビジョン言語機能をサポートします。

ABSTRACT

AutoGluon-Multimodal (AutoMM) is introduced as an open-source AutoML library designed specifically for multimodal learning. Distinguished by its exceptional ease of use, AutoMM enables fine-tuning of foundation models with just three lines of code. Supporting various modalities including image, text, and tabular data, both independently and in combination, the library offers a comprehensive suite of functionalities spanning classification, regression, object detection, semantic matching, and image segmentation. Experiments across diverse datasets and tasks showcases AutoMM's superior performance in basic classification and regression tasks compared to existing AutoML tools, while also demonstrating competitive results in advanced tasks, aligning with specialized toolboxes designed for such purposes.

研究の動機と目的

  • Democratize multimodal AutoML by providing a unified, easy-to-use framework that handles diverse modalities and tasks.
  • Leverage foundation models with a late-fusion architecture to support image, text, and tabular inputs in a single pipeline.
  • Provide a practical benchmark suite and showcase AutoMM’s performance across basic and advanced multimodal tasks.

提案手法

  • Unified data format using Pandas DataFrame to accommodate images, text, and tabular modalities.
  • Three-line API (MultiModalPredictor) for rapid fine-tuning of foundation models on unimodal or multimodal data.
  • Late-fusion architecture with independent backbones per modality and a fusion module for multimodal integration.
  • Support for parameter-efficient fine-tuning (PEFT) like BitFit and LoRA to handle large foundation models.
  • Data processing pipeline powered by Lightning (PyTorch) with modality detectors, preprocessor, and per-modality data processors.
  • Deployment optimizations including offline artifact pre-saving, real-time inference path without Lightning, and NVIDIA TensorRT integration.]
  • research_questions: ["Can a unified AutoML framework effectively handle diverse multimodal data (image, text, tabular) across a range of tasks?", "How does AutoMM perform on basic unimodal/multimodal classification and regression compared to Auto-Keras and task-specific toolboxes?", "What is AutoMM’s effectiveness on advanced tasks like semantic matching, object detection, and semantic segmentation?", "How do PEFT techniques and offline hyperparameter selections impact training efficiency and performance for large foundation models?", "What benchmarks and model zoos enable robust, end-to-end multimodal AutoML experiments?"]
  • key_findings:["AutoMM outperforms Auto-Keras on 24 unimodal/multimodal classification/regression tasks with statistical significance.", "AutoMM achieves competitive results with state-of-the-art toolboxes on semantic matching and semantic segmentation tasks.", "In object detection, AutoMM surpasses baseline AutoML solutions in accuracy and speed while offering easier usage.", "AutoMM semantic segmentation results show strong performance with low-parameter budgets due to parameter-efficient fine-tuning (e.g., SAM-based approaches).", "The framework supports easy deployment with offline predictor loading, real-time inference path, and TensorRT integration for low-latency inference."]
  • table_headers:["Dataset","Text","Image","Tabular","Problem Type","Metric","Auto-Keras","AutoMM"]
  • table_rows:[["fashion_mnist","✗","✓","✗","Multiclass","F1_weighted ↑","0.876(0.020)","0.953 (0.002)"],["food101","✗","✓","✗","Multiclass","F1_weighted ↑","0.024(0.045)","0.937 (0.001)"],["Stanford_cars","✗","✓","✗","Multiclass","F1_weighted ↑","0.055(0.079)","0.892 (0.002)"],["magnetic_tile_defects","✗","✓","✗","Multiclass","F1_weighted ↑","0.627(0.171)","0.956 (0.014)"],["European_flood_depth","✗","✓","✗","Binary","F1 ↑","0.750(0.017)","0.790 (0.008)"],["Oxford_flowers","✗","✓","✗","Multiclass","F1_weighted ↑","0.123(0.155)","0.989 (0.003)"],["OxfordIIITPet","✗","✓","✗","Multiclass","F1_weighted ↑","0.157(0.283)","0.958 (0.003)"],["CD18_cellphone","✗","✓","✗","Regression","R^2 ↑","-18.390(35.120)","-1.843 (4.477)"],["HAM10000","✗","✓","✗","Multiclass","F1_weighted ↑","0.276(0.211)","0.608 (0.014)"],["hateful_meme","✓","✓","✗","Binary","F1 ↑","0.572(0.099)","0.596 (0.013)"],["petfinder","✓","✓","✓","Multiclass","F1_weighted ↑","0.243 (0.040)","0.408 (0.006)"],["memotion","✓","✓","✓","Multiclass","F1_weighted ↑","0.297 (0.026)","0.467 (0.013)"],["financial_news","✓","✗","✗","Multiclass","F1_weighted ↑","0.678(0.027)","0.874 (0.010)"],["MLDoc-11000","✓","✗","✗","Multiclass","F1_weighted ↑","0.916(0.016)","0.978 (0.002)"],["gnad10","✓","✗","✗","Multiclass","F1_weighted ↑","0.521(0.029)","0.899 (0.006)"],["MultiATIS-5000","✓","✗","✗","Multiclass","F1_weighted ↑","0.864(0.010)","0.990 (0.003)"],["fb_dialog","✓","✗","✗","Multiclass","F1_weighted ↑","0.982(0.003)","0.992 (0.001)"],["SNIPS","✓","✗","✗","Multiclass","F1_weighted ↑","0.049(0.018)","0.990 (0.002)"],["ag_news","✓","✗","✗","Multiclass","F1_weighted ↑","0.887(0.004)","0.944 (0.001)"],["airbnb_melbourn","✓","✗","✓","Multiclass","F1_weighted ↑","0.198(0.071)","0.397 (0.011)"],["kick_start_funding","✓","✗","✓","Binary","F1 ↑","0.401 (0.151)","0.609 (0.005)"],["cloth_review","✓","✗","✓","Regression","R^2 ↑","0.542(0.053)","0.735 (0.004)"],["news_popularity","✓","✗","✓","Regression","R^2 ↑","-1.306(1.863)","0.014 (0.003)"],["California_house","✓","✗","✓","Regression","R^2 ↑","-53757156.425 (55682587.109)","0.944 (0.001)"]]} signature:

実験結果

リサーチクエスチョン

  • RQ1Can a unified AutoML framework effectively handle diverse multimodal data (image, text, tabular) across a range of tasks?
  • RQ2How does AutoMM perform on basic unimodal/multimodal classification and regression compared to Auto-Keras and task-specific toolboxes?
  • RQ3What is AutoMM’s effectiveness on advanced tasks like semantic matching, object detection, and semantic segmentation?
  • RQ4How do PEFT techniques and offline hyperparameter selections impact training efficiency and performance for large foundation models?
  • RQ5What benchmarks and model zoos enable robust, end-to-end multimodal AutoML experiments?

主な発見

DatasetTextImageTabularProblem TypeMetricAuto-KerasAutoMM
fashion_mnistMulticlassF1_weighted ↑0.876(0.020)0.953 (0.002)
food101MulticlassF1_weighted ↑0.024(0.045)0.937 (0.001)
Stanford_carsMulticlassF1_weighted ↑0.055(0.079)0.892 (0.002)
magnetic_tile_defectsMulticlassF1_weighted ↑0.627(0.171)0.956 (0.014)
European_flood_depthBinaryF1 ↑0.750(0.017)0.790 (0.008)
Oxford_flowersMulticlassF1_weighted ↑0.123(0.155)0.989 (0.003)
OxfordIIITPetMulticlassF1_weighted ↑0.157(0.283)0.958 (0.003)
CD18_cellphoneRegressionR^2 ↑-18.390(35.120)-1.843 (4.477)
HAM10000MulticlassF1_weighted ↑0.276(0.211)0.608 (0.014)
hateful_memeBinaryF1 ↑0.572(0.099)0.596 (0.013)
petfinderMulticlassF1_weighted ↑0.243 (0.040)0.408 (0.006)
memotionMulticlassF1_weighted ↑0.297 (0.026)0.467 (0.013)
financial_newsMulticlassF1_weighted ↑0.678(0.027)0.874 (0.010)
MLDoc-11000MulticlassF1_weighted ↑0.916(0.006)0.978 (0.002)
gnad10MulticlassF1_weighted ↑0.521(0.029)0.899 (0.006)
MultiATIS-5000MulticlassF1_weighted ↑0.864(0.010)0.990 (0.003)
fb_dialogMulticlassF1_weighted ↑0.982(0.003)0.992 (0.001)
SNIPSMulticlassF1_weighted ↑0.049(0.018)0.990 (0.002)
ag_newsMulticlassF1_weighted ↑0.887(0.004)0.944 (0.001)
airbnb_melbournMulticlassF1_weighted ↑0.198(0.071)0.397 (0.011)
kick_start_fundingBinaryF1 ↑0.401 (0.151)0.609 (0.005)
cloth_reviewRegressionR^2 ↑0.542(0.053)0.735 (0.004)
news_popularityRegressionR^2 ↑-1.306(1.863)0.014 (0.003)
California_houseRegressionR^2 ↑-53757156.425 (55682587.109)0.944 (0.001)
  • AutoMM outperforms Auto-Keras on 24 unimodal/multimodal classification/regression tasks with statistical significance.
  • AutoMM achieves competitive results with state-of-the-art toolboxes on semantic matching and semantic segmentation tasks.
  • In object detection, AutoMM surpasses baseline AutoML solutions in accuracy and speed while offering easier usage.
  • AutoMM semantic segmentation results show strong performance with low-parameter budgets due to parameter-efficient fine-tuning (e.g., SAM-based approaches).
  • The framework supports easy deployment with offline predictor loading, real-time inference path, and TensorRT integration for low-latency inference.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。