220 South 33rd Street, Towne 299 Philadelphia, PA 19104, USA
Email: yueyang1 [at] seas.upenn.edu
About me
Hi! My name is Yue Yang (杨樾). I am a final-year Ph.D. student in Computer and Information Science at the University of Pennsylvania, affiliated with Penn NLP. I am grateful to be advised by Prof. Chris Callison-Burch and Prof. Mark Yatskar. I am interested in the intersection area of Natural Language Processing (NLP) and Computer Vision (CV).
My current research focuses on applying the knowledge priors of large language models (LLMs) to various domains (images, videos, healthcare, Embodied AI, etc) to improve different aspects of AI systems, including:
Interpretability. LLMs aid in constructing human-readable intermediate representations, such as concept bottlenecks, enabling the design of inherently interpretable models, thereby mitigating the black-box nature of deep learning.
Robustness. By utilizing sparse natural language representations as input, models are less prone to overfitting on the spurious cues of in-domain training data, enhancing their robustness and out-of-domain generalization.
Data Efficiency. Leveraging the world knowledge and coding ability of text-only LLMs to create synthetic data to improve embodied agents and multimodal language models.
I am looking for full-time positions starting in Summer 2025. Please reach out if you are interested in working with me!
Matt Deitke*, Christopher Clark*, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, et al. (51 authors in total)
TL;DR: We introduce Knowledge Bottlenecks (KnoBo) that incorporate priors from medical documents, such as PubMed, through inherently interpretable models. KnoBo is robust to domain shifts in medical images, e.g., data sampled from different hospitals or data confounded by demographic variables such as sex, race, etc. Overall, our work demonstrates that a key missing ingredient for robustness to distribution shift in medical imaging models is a prior rooted in knowledge.
Yue Yang*, Fan-Yun Sun*, Luca Weihs*, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
TL;DR: Holodeck is an automated system for generating diverse 3D environments in Embodied AI, using a large language model (GPT-4) and a vast collection of 3D assets from Objaverse. It can create complex scenes based on user prompts, adjusting for styles and specific details, like "a 1b1b apartment of a researcher who has a cat".
TL;DR: Concept Bottleneck Models are interpretable models that factor in human-readable concepts to explain model decisions. However, CBMs often under-perform their black box counterparts and require manual specification of concepts. Our method, LaBo, leverages large language models (GPT-3) to automatically construct bottlenecks for any image classification tasks.
TL;DR: We introduce Knowledge Bottlenecks (KnoBo) that incorporate priors from medical documents, such as PubMed, through inherently interpretable models. KnoBo is robust to domain shifts in medical images, e.g., data sampled from different hospitals or data confounded by demographic variables such as sex, race, etc. Overall, our work demonstrates that a key missing ingredient for robustness to distribution shift in medical imaging models is a prior rooted in knowledge.
TL;DR: CoMo is a controllable human motion generation model that encodes motion into interpretable pose codes representing body part semantics. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions.
Yue Yang*, Fan-Yun Sun*, Luca Weihs*, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
TL;DR: Holodeck is an automated system for generating diverse 3D environments in Embodied AI, using a large language model (GPT-4) and a vast collection of 3D assets from Objaverse. It can create complex scenes based on user prompts, adjusting for styles and specific details, like "a 1b1b apartment of a researcher who has a cat".
TL;DR: We generate visual metaphors from linguistic metaphors using a collaboration between LLMs and Diffusion Models. We use GPT-3 with Chain-of-Thought prompting to generate text that represents a visual elaboration of the linguistic metaphor, which is then used as input to the diffusion-based text-to-image models to create 6,476 visual metaphors.
TL;DR: Concept Bottleneck Models are interpretable models that factor in human-readable concepts to explain model decisions. However, CBMs often under-perform their black box counterparts and require manual specification of concepts. Our method, LaBo, leverages large language models (GPT-3) to automatically construct bottlenecks for any image classification tasks.
TL;DR: We introduces CREPE, a benchmark for causal reasoning about event plausibility based on entity states, it shows that current large language models including GPT-3 perform poorly on this task. To improve the performance, we inject the causal relations between entities and events through structured representations such as programming languages which results in a significant increase in performance.
TL;DR: We develop a novel approach to endow language models with visual imagination capabilities. We leverage two complementary types of "imaginations": (I) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks.
TL;DR: We propose to extract properties of nouns from images, which can then be used to complement information from language models to mitigate the reporting bias problem. Results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful language models.
TL;DR: Procedures are inherently hierarchical. To "host a party", one may need to "clean the house", which in turn may require "putting away the clothes". We develop an efficient method that links steps (e.g. "clean the house") in an article to other articles with similar intents (e.g. "how to deep lean your house"), which proceeds recursively to form the KB.
TL;DR: This work proposes a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance, and demonstrates that the schemata induced by the system are better than those generated by other models.
TL;DR: This work proposes the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images. We construct a VGSI dataset from wikiHow and show that SOTA multimodal models struggle on it.
Education
University of Pennsylvania, Philadelphia, PA, USA
Ph.D. in Computer and Information Science (2020 - present)
M.S. in Robotics (2018 - 2020)
Zhejiang University, Hangzhou, China
B.E. in Mechanical Engineering (2014 - 2018)
Experiences
Allen Institute for AI, Seattle, WA, USA Research Intern (May. 2023 to Sept. 2023, May. 2024 to Sept. 2024) Outstanding Intern of the Year Award (2023)
Tencent AI Lab, Seattle, WA, USA Research Scientist Intern (May. 2022 to Sept. 2022)
Teaching
Head Teaching Assistant, CIS-521 Artificial Intelligence, University of Pennsylvania
Fall2019; Fall 2020; Summer 2021; Fall 2021; Spring 2022
WPE-II Presentation, University of Pennsylvania, Philadelphia, PA, USA Language Guided Concept Bottlenecks for Interpretable and Robust Image Classification, April 29, 2024. slides
CLUNCH, University of Pennsylvania, Philadelphia, PA, USA Investigate Procedural Events in a Multimodal Fashion, November 22, 2021. slides