Ferret: An End-to-End MLLM by Apple

Ferret: Refer and Ground Anything Anywhere at Any Granularity

An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response. [[Paper]

Haoxuan You * Haotian Zhang * Zhe Gan Xianzhi Du Bowen Zhang Zirui Wang Liangliang Cao Shih-Fu Chang Yinfei Yang
[*: equal contribution]

Introduction

Diagram of Ferret Model.

Secret Contributions:

Ferret Model – Hybrid Region Representation + Spatial-aware Visual Sampler make it possible for fine-grained and open-vocabulary referring and grounding in MLLM.
GRIT Dataset (~ 1.1 M) – A Massive, Hierarchical, Robust ground-and-refer guideline tuning dataset.
Ferret-Bench – A multimodal assessment criteria that collectively needs Referring/Grounding, Semantics, Knowledge, and Reasoning

Release

[12/14] We launched the checkpoints(7B, 13B)
[10/30] We launched the code of FERRET design and Ferret-Bench

Use and License Notices: The information, and code is meant and accredited for research study usage just. They are likewise limited to usages that follow the license contract of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (permitting just non-commercial usage) and designs trained utilizing the dataset must not be utilized beyond research study functions.

Set up

Clone this repository and browse to FERRET folder

git clone https://github.com/apple/ml-ferret
cd ml-ferret

Set up Package

conda create -n ferret python=3.10 -y
conda activate ferret
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install pycocotools
pip install protobuf==3.20.0

Set up extra plans for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Train

FERRET is trained on 8 A100 GPUs with 80GB memory. To train on less GPUs, you can lower the per_device_train_batch_size and increase the gradient_accumulation_steps appropriately. Constantly keep the international batch size the exact same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus

Hyperparameters

We utilize a comparable set of hyperparameters as LLaVA(Vicuna) in finetuning.

Hyperparameter	International Batch Size	Knowing rate	Dates	Max length	Weight decay
FERRET-7B	128	2e-5	3	2048	0
FERRET-13B	128	2e-5	3	2048	0

Prepare Vicuna checkpoint and LLaVA’s projector

Before you begin, prepare our base design Vicuna, which is an instruction-tuned chatbot. Please download its weights following the guidelines hereVicuna v1.3 is utilized in FERRET.

Download LLaVA’s first-stage pre-trained projector weight (7B 13B.

FERRET Training

The scripts are supplied (7B 13B.

Examination

Please see this doc for the information.

Checkpoints

We drew out the delta in between our pre-trained design and Vicuna. Please initially download weights of Vicuna following the previous guidelineDownload our ready offsets of weights: 7B 13B utilizing wget or curland unzip the downloaded offsets. Use the balanced out to the Vicuna’s weight by running the following script:

# 7B
python3 -m ferret.model.apply_delta 
    --base ./model/vicuna-7b-v1-3 
    --target ./model/ferret-7b-v1-3 
    --delta path/to/ferret-7b-delta
# 13B
python3 -m ferret.model.apply_delta 
    --base ./model/vicuna-13b-v1-3 
    --target ./model/ferret-13b-v1-3 
    --delta path/to/ferret-13b-delta

Notifications: Apple’s rights in the connected weight differentials are thus accredited under the CC-BY-NC license. Apple makes no representations with concerns to LLaMa or any other 3rd party software application, which go through their own terms.

Please describe the next area about how to establish a regional demonstration with pre-trained weight.

Demonstration

To run our demonstration, you require to train FERRET and utilize the checkpoints in your area. Gradio web UI is utilized. Please run the following commands one by one.

Introduce a controller

python -m ferret.serve.controller --host 0.0.0.0 --port 10000

Release a gradio web server.

python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --add_region_feature

Introduce a design employee

This is the employee that pack the ckpt and do the reasoning on the GPU. Each employee is accountable for a single design defined in --model-path

CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/FERRET-13B-v0 --add_region_feature

Wait till the procedure completes packing the design and you see “Uvicorn working on …”. Now, revitalize your Gradio web UI, and you will see the design you simply released in the design list.

Example of Ferret Interactive Demo.

Citation

If you discover Ferret beneficial, please point out utilizing this BibTeX:

@article{you2023ferret,
  title={Ferret: Refer and Ground Anything Anywhere at Any Granularity},
  author={You, Haoxuan and Zhang, Haotian and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zirui and Cao, Liangliang and Chang, Shih-Fu and Yang, Yinfei},
  journal={arXiv preprint arXiv:2310.07704},
  year={2023}
}

Recognition

LLaVA: the codebase we built on.
Vicuna: the LLM codebase.

Learn more