Thesis Defense: Visual Representation Learning from Synthetic Data

Speaker

Lijie Fan

MIT-CSAIL

Host

Professor Dina Katabi

MIT-CSAIL

Abstract: Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative modeling have enabled the synthesis of photorealistic images and high-quality text, drastically increasing the visibility of synthetic data. Despite these advancements, the application of synthetic data for representation learning and visual recognition tasks lags behind, with a noticeable performance gap between models trained on synthetic versus real data. In this talk I will delve into our recent efforts to close this gap and utilize synthetic data to train state-of-the-art representation models. I will begin by utilizing synthetic texts from LLMs to enhance the training of vision-language models. Next, we will explore synthetic images generated by text-to-image models, examining the scaling laws applicable to these images when used for supervised model training. We will also introduce a multi-positive contrastive loss specifically designed for synthetic images, demonstrating their advantages over real images in representation learning. Finally, we propose a novel framework for training vision models exclusively with synthetic texts and images, which achieves superior performance, surpassing state-of-the-art models trained on real images, such as DINOv2 and CLIP, in tasks including fine-grained classification and semantic segmentation. These works establish a robust foundation for advancing generative models in representation learning and solving key computer vision tasks, and mark a substantial leap in utilizing synthetic data for improved representation learning across the Data-centric AI ecosystem.

Committee: Dina Katabi (Advisor, MIT CSAIL), Phillip Isola (MIT CSAIL), Dilip Krishnan (Google)

Short Bio: Lijie Fan is a final-year Computer Science PhD student at MIT CSAIL. He received his Bachelor's degree in Computer Science from Tsinghua University. His research focuses on machine learning and computer vision, with a particular interest in learning from synthetic data and large-scale training of vision-language models. He has published over 10 first-author papers in leading ML and CV conferences, including NeurIPS, CVPR, ICCV, and ECCV. His research has received coverage from major media outlets, such as MIT Tech Review, BBC, Engadget, and VentureBeat.

Zoom Link: https://mit.zoom.us/my/lijiefan

Location: 32-D463 (Wednesday, May 1, 3-4pm)

Add to Calendar 2024-05-01 15:00:00 2024-05-01 16:00:00 America/New_York Thesis Defense: Visual Representation Learning from Synthetic Data Abstract: Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative modeling have enabled the synthesis of photorealistic images and high-quality text, drastically increasing the visibility of synthetic data. Despite these advancements, the application of synthetic data for representation learning and visual recognition tasks lags behind, with a noticeable performance gap between models trained on synthetic versus real data. In this talk I will delve into our recent efforts to close this gap and utilize synthetic data to train state-of-the-art representation models. I will begin by utilizing synthetic texts from LLMs to enhance the training of vision-language models. Next, we will explore synthetic images generated by text-to-image models, examining the scaling laws applicable to these images when used for supervised model training. We will also introduce a multi-positive contrastive loss specifically designed for synthetic images, demonstrating their advantages over real images in representation learning. Finally, we propose a novel framework for training vision models exclusively with synthetic texts and images, which achieves superior performance, surpassing state-of-the-art models trained on real images, such as DINOv2 and CLIP, in tasks including fine-grained classification and semantic segmentation. These works establish a robust foundation for advancing generative models in representation learning and solving key computer vision tasks, and mark a substantial leap in utilizing synthetic data for improved representation learning across the Data-centric AI ecosystem. Committee: Dina Katabi (Advisor, MIT CSAIL), Phillip Isola (MIT CSAIL), Dilip Krishnan (Google) Short Bio: Lijie Fan is a final-year Computer Science PhD student at MIT CSAIL. He received his Bachelor's degree in Computer Science from Tsinghua University. His research focuses on machine learning and computer vision, with a particular interest in learning from synthetic data and large-scale training of vision-language models. He has published over 10 first-author papers in leading ML and CV conferences, including NeurIPS, CVPR, ICCV, and ECCV. His research has received coverage from major media outlets, such as MIT Tech Review, BBC, Engadget, and VentureBeat. Zoom Link: https://mit.zoom.us/my/lijiefan Location: 32-D463 (Wednesday, May 1, 3-4pm) 32-D463

Organizer & Contact

Mary McDavitt

mmcdavit@csail.mit.edu

Thesis Defense: Visual Representation Learning from Synthetic Data

Speaker

Host

May 01 2024

Location

Organizer & Contact