[論文レビュー] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime
The paper systematically compares training-from-scratch, frozen features, and PEFT (QLoRA and DoRA) for DINOv3 on nine dairy-cow behaviors, showing PEFT achieves state-of-the-art accuracy with far less training data and resources under a 98:1 test-to-train regime.
Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.
研究の動機と目的
- Assess generalization of vision models to agricultural imagery with limited labeled data.
- Systematically compare three learning paradigms: training from scratch, frozen feature extraction, and PEFT on a billion-parameter foundation model.
- Evaluate PEFT hyperparameters (rank and target modules) for QLoRA and DoRA.
- Provide practical deployment guidelines for industrial livestock monitoring using PEFT-based foundation models.
提案手法
- Fine-tune DINOv3 (6.7B params) with PEFT using QLoRA and DoRA.
- Quantize backbone to 4-bit and inject low-rank adapters; vary rank r in {8,16,64} and target modules {q_proj, all-linear}.
- Train with 80 epochs, batch size 4 (effective 32 with gradient accumulation), learning rate 1e-4, warmup and cosine annealing; use mixed precision and gradient checkpointing.
- Data: 2,160 training images (80% per class across 9 behaviors) with augmented training set; 540 validation and 211,800 test samples from two sources (MMCows, PlayBehaviour).
- Evaluation: accuracy, weighted F1-score, per-class metrics; latency and throughput for inference.
実験結果
リサーチクエスチョン
- RQ1Can PEFT enable competitive performance on billion-parameter vision models with very limited training data in agricultural image classification?
- RQ2How do QLoRA and DoRA compare in terms of accuracy, training efficiency, and stability under different adapter configurations?
- RQ3What is the effect of adapter capacity (rank) and scope (q_proj vs all-linear) on generalization vs. overfitting in this domain?
- RQ4Do pretrained feature extractors or full fine-tuning offer advantages over PEFT for livestock behavior classification?
- RQ5What practical guidelines emerge for deploying PEFT-based foundation models in precision livestock farming?
主な発見
| Method | Target | Rank | Trainable Params | Training Time | Test Acc | Test F1 |
|---|---|---|---|---|---|---|
| ResNet-18 (scratch) | — | — | 11.2M (100%) | 16h 45m | 72.87% | 0.7526 |
| ViT-Small (scratch) | — | — | 21.7M (100%) | 18h 39m | 61.91% | 0.6600 |
| DINOv3 (frozen) | — | — | 4.7M (0.07%) | 17h 27m | 76.56% | 0.7691 |
| QLoRA | q_proj | 8 | 2.6M (0.04%) | 6h 32m | 77.17% | 0.7646 |
| QLoRA | q_proj | 16 | 5.2M (0.08%) | 7h 16m | 78.38% | 0.7753 |
| QLoRA | all-linear | 16 | 46.8M (0.70%) | 4h 43m | 80.40% | 0.8069 |
| QLoRA | all-linear | 64 | 183.0M (2.72%) | 5h 46m | 83.16% | 0.8380 |
| DoRA | q_proj | 8 | 2.8M (0.04%) | 11h 31m | 81.53% | 0.8182 |
| DoRA | q_proj | 16 | 5.4M (0.08%) | 10h 27m | 81.03% | 0.8153 |
| DoRA | all-linear | 16 | 48.4M (0.72%) | 11h 51m | 81.23% | 0.8139 |
| DoRA | all-linear | 64 | 184.5M (2.75%) | 10h 59m | 83.14% | 0.8338 |
- PEFT substantially outperforms training from scratch and frozen feature extraction for nine-class dairy cow behavior; best QLoRA configuration (all-linear, rank=64) achieves 83.16% test accuracy.
- DoRA achieves comparable performance (83.14% test accuracy) with slightly longer training time compared to the best QLoRA setup.
- Increasing adapter capacity (higher rank or broader target modules) consistently improves generalization rather than causing overfitting, indicating underfitting as the main challenge.
- QLoRA and DoRA with optimal settings reach around 83% test accuracy using only a small fraction of trainable parameters (as low as 0.04% to 2.72% of total).
- Training time for PEFT configurations is substantially reduced (e.g., 5h 46m for QLoRA all-linear 64) versus full fine-tuning baselines (up to ~18h).
- Across configurations, QLoRA and DoRA show different sensitivity to adapter choices, with DoRA often offering more stable performance
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。