Joint AI Seminar / Doctoral Speaking Skills Talk - Victor Akinwande
— 1:00pm
Location:
In Person
-
Newell-Simon 3305
Speaker:
VICTOR AKINWANDE
,
Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://home.victorakinwande.com/
Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments.
This talk presents an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs.
With a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
Presented in Partial Fulfillment of the CSD Speaking Skills Requirement
Event Website:
https://www.cs.cmu.edu/~aiseminar/