Skip to main content
QUICK REVIEW

[論文レビュー] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

Haiyu Yang, Sumit Sharma|arXiv (Cornell University)|Mar 18, 2026
Smart Agriculture and AI被引用数 0
ひとこと要約

The paper systematically compares training-from-scratch, frozen features, and PEFT (QLoRA and DoRA) for DINOv3 on nine dairy-cow behaviors, showing PEFT achieves state-of-the-art accuracy with far less training data and resources under a 98:1 test-to-train regime.

ABSTRACT

Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.

研究の動機と目的

  • Assess generalization of vision models to agricultural imagery with limited labeled data.
  • Systematically compare three learning paradigms: training from scratch, frozen feature extraction, and PEFT on a billion-parameter foundation model.
  • Evaluate PEFT hyperparameters (rank and target modules) for QLoRA and DoRA.
  • Provide practical deployment guidelines for industrial livestock monitoring using PEFT-based foundation models.

提案手法

  • Fine-tune DINOv3 (6.7B params) with PEFT using QLoRA and DoRA.
  • Quantize backbone to 4-bit and inject low-rank adapters; vary rank r in {8,16,64} and target modules {q_proj, all-linear}.
  • Train with 80 epochs, batch size 4 (effective 32 with gradient accumulation), learning rate 1e-4, warmup and cosine annealing; use mixed precision and gradient checkpointing.
  • Data: 2,160 training images (80% per class across 9 behaviors) with augmented training set; 540 validation and 211,800 test samples from two sources (MMCows, PlayBehaviour).
  • Evaluation: accuracy, weighted F1-score, per-class metrics; latency and throughput for inference.

実験結果

リサーチクエスチョン

  • RQ1Can PEFT enable competitive performance on billion-parameter vision models with very limited training data in agricultural image classification?
  • RQ2How do QLoRA and DoRA compare in terms of accuracy, training efficiency, and stability under different adapter configurations?
  • RQ3What is the effect of adapter capacity (rank) and scope (q_proj vs all-linear) on generalization vs. overfitting in this domain?
  • RQ4Do pretrained feature extractors or full fine-tuning offer advantages over PEFT for livestock behavior classification?
  • RQ5What practical guidelines emerge for deploying PEFT-based foundation models in precision livestock farming?

主な発見

MethodTargetRankTrainable ParamsTraining TimeTest AccTest F1
ResNet-18 (scratch)11.2M (100%)16h 45m72.87%0.7526
ViT-Small (scratch)21.7M (100%)18h 39m61.91%0.6600
DINOv3 (frozen)4.7M (0.07%)17h 27m76.56%0.7691
QLoRAq_proj82.6M (0.04%)6h 32m77.17%0.7646
QLoRAq_proj165.2M (0.08%)7h 16m78.38%0.7753
QLoRAall-linear1646.8M (0.70%)4h 43m80.40%0.8069
QLoRAall-linear64183.0M (2.72%)5h 46m83.16%0.8380
DoRAq_proj82.8M (0.04%)11h 31m81.53%0.8182
DoRAq_proj165.4M (0.08%)10h 27m81.03%0.8153
DoRAall-linear1648.4M (0.72%)11h 51m81.23%0.8139
DoRAall-linear64184.5M (2.75%)10h 59m83.14%0.8338
  • PEFT substantially outperforms training from scratch and frozen feature extraction for nine-class dairy cow behavior; best QLoRA configuration (all-linear, rank=64) achieves 83.16% test accuracy.
  • DoRA achieves comparable performance (83.14% test accuracy) with slightly longer training time compared to the best QLoRA setup.
  • Increasing adapter capacity (higher rank or broader target modules) consistently improves generalization rather than causing overfitting, indicating underfitting as the main challenge.
  • QLoRA and DoRA with optimal settings reach around 83% test accuracy using only a small fraction of trainable parameters (as low as 0.04% to 2.72% of total).
  • Training time for PEFT configurations is substantially reduced (e.g., 5h 46m for QLoRA all-linear 64) versus full fine-tuning baselines (up to ~18h).
  • Across configurations, QLoRA and DoRA show different sensitivity to adapter choices, with DoRA often offering more stable performance

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。