Interested in Project 10 and have some clarifying questions (Fine-tuning Vision Language Models (VLMs) for Object Detection and Hierarchical Classification using the OpenVINO Ecosystem) #29355

Jyc323 · 2025-03-09T20:29:48Z

Jyc323
Mar 9, 2025

Dear Rajesh Gangireddy, Laurens Hogeweg, and Samet Akcay,

I’m Jiayu Li, and I have hands-on experience in deep learning, multimodal learning, computer vision, and object detection. I’m very interested in Project 10 – Fine-tuning Vision Language Models (VLMs) for Object Detection and Hierarchical Classification using the OpenVINO Ecosystem.

As I prepare my proposal, I’d love to clarify a few details to ensure I’m aligned with the project’s goals:

Should the focus be on fine-tuning efficiency (e.g., LoRA, QLoRA) or optimizing model accuracy?
What interpretability techniques are preferred for this project? (e.g., Grad-CAM, SHAP, attention visualization)
Are domain-specific datasets required, or would general-purpose object detection datasets like COCO, VOC, ImageNet? I personally prefer autonomous driving datasets, like KITTI
Are there any baseline models or past experiments I should reference?
To make my proposal stand out, what key questions should I address, or would developing a prototype be beneficial?

I’d really appreciate any guidance you can provide, and I look forward to discussing how I can contribute effectively to this project.

Best regards,
Jiayu Li

@adrianboguszewski @mlukasze Could you please help connect me with the mentors?

adrianboguszewski · 2025-03-10T09:11:56Z

adrianboguszewski
Mar 10, 2025
Collaborator

@rajeshgangireddy @samet-akcay this is your potential contributor :)

0 replies

rajeshgangireddy · 2025-03-17T13:55:18Z

rajeshgangireddy
Mar 17, 2025

Hi @Jyc323
Thank you for the questions.
Sorry for a delayed response as I just returned from a short break.

Ideally both :) as we make a trade-off between the training (time and resources) required vs the accuracy. However, we will prioritise the model accuracy over training efficiency in the initial stages. But do remember that the fine tuning preferably must fit on consumer grade GPUs (~24GB-40GB of VRAM)
Existing SOTA and widely used methods would be great. You might find some methods from our XAI repo here - https://github.com/openvinotoolkit/openvino_xai. Would be great if we can focus on these light weight algorithms. Note that interpretability/explainability is a bit of an extra "nice-to-have" step in this project and can be pursued based on the time availability.
KITTI is a great dataset to start with. We also prefer experimenting with medical (e.g.: KVASIR Dataset, Chest-XRay, etc), satellite biodiversity datasets (CUB200, mini versions of iNaturalist 2017, etc).
Datasets like ImageNet are not interesting as most of the VLMs have been extensively pre-trained directly on such datasets.
We recently started this topic and hence do not have results to share yet.
For your proposal : please keep in mind in the choice of VLMs (smaller the better while achieving high accuracies), the choice of the fine-tuning methods (must be able to fit on a reasonable GPU memory), the choice of datasets for hierarchical classification and object detection (we are testing how well VLMs capabilities can be used for tasks where classes are fine-grained and/or belonging to somewhat niche domain on which the VLMs might not be pre-trained). Please also consider what evaluation criteria would be good for evaluating hierarchical classification.

Let us know if you have more questions.
Looking forward to your proposal.
Thanks.

0 replies

Jyc323 · 2025-03-30T20:00:42Z

Jyc323
Mar 30, 2025
Author

Hi @rajeshgangireddy,
Thanks a lot for your suggestion and comment. I just sent you an email ([email protected]) with the draft of my GSoC proposal for the Fine-tuning Vision-Language Models using OpenVINO project from [email protected]. I’d be very grateful if you could take a look and share any feedback when you have time. Thank you!

0 replies

PraroopChanda · 2025-04-01T20:18:32Z

PraroopChanda
Apr 1, 2025

Hi @rajeshgangireddy @adrianboguszewski @mlukasze

Hope you're doing well!

I am Praroop, currently a masters student at Texas A&M , focused on computer vision, multi-model learning and Generative AI.
I am highly interested in the project 10 - (Fine-tuning Vision Language Models (VLMs) for Object Detection and Hierarchical Classification using the OpenVINO Ecosystem)

I did some preliminary research and settled down on GroundDINO, I set up the code base and ran a small fine tune on KITTI dataset, training only the decoder layer.

I used NVIDIA A100 GPU and keeping the batch size small to 6, training was using 9~10 GB of VRAM.
You can find the GitHub repo with the setup and initial detection results here: - https://github.com/PraroopChanda/GroundDINO_FineTune

Further I am planning to: -

Integrate LORA, QLORA for fine tuning,
Try fine tuning on other datasets and
Explore other VLM models such as OWL-VIT, DETR for fine tuning and classification.
Investigate the use of hyperbolic embeddings for hierarchical classification

Would really love to know your thoughts on this.
Attaching a couple of preliminary visual results below:

Best,
Praroop Chanda
https://praroopchanda.github.io/

2 replies

rajeshgangireddy Apr 8, 2025

Hi @PraroopChanda
I have replied to your email with my feedback.

PraroopChanda Apr 8, 2025

Thanks a lot Rajesh!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interested in Project 10 and have some clarifying questions (Fine-tuning Vision Language Models (VLMs) for Object Detection and Hierarchical Classification using the OpenVINO Ecosystem) #29355

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Interested in Project 10 and have some clarifying questions (Fine-tuning Vision Language Models (VLMs) for Object Detection and Hierarchical Classification using the OpenVINO Ecosystem) #29355

Jyc323 Mar 9, 2025

Replies: 4 comments · 2 replies

adrianboguszewski Mar 10, 2025 Collaborator

rajeshgangireddy Mar 17, 2025

Jyc323 Mar 30, 2025 Author

PraroopChanda Apr 1, 2025

rajeshgangireddy Apr 8, 2025

PraroopChanda Apr 8, 2025

Jyc323
Mar 9, 2025

Replies: 4 comments 2 replies

adrianboguszewski
Mar 10, 2025
Collaborator

rajeshgangireddy
Mar 17, 2025

Jyc323
Mar 30, 2025
Author

PraroopChanda
Apr 1, 2025