Yue Yang 杨樾

Yue Yang
pronounced as yoo-eh

220 South 33rd Street, Towne 299
Philadelphia, PA 19104, USA
Email: yueyang1 [at] seas.upenn.edu

About me

Hi! My name is Yue Yang (杨樾). I am a fourth-year Ph.D. student in Computer and Information Science at the University of Pennsylvania, affiliated with Penn NLP. I am grateful to be advised by Prof. Chris Callison-Burch and Prof. Mark Yatskar. I am interested in the intersection area of Natural Language Processing (NLP) and Computer Vision (CV).

My current research focuses on applying the knowledge priors of large language models (LLMs) to various domains (images, videos, healthcare, Embodied AI, etc) to improve different aspects of AI systems, including:

Interpretability. LLMs aid in constructing human-readable intermediate representations, such as concept bottlenecks, enabling the design of inherently interpretable models, thereby mitigating the black-box nature of deep learning.

Robustness. By utilizing sparse natural language representations as input, models are less prone to overfitting on the spurious cues of in-domain training data, enhancing their robustness and out-of-domain generalization.

Controllability & Creativity. Language interfaces in generative systems enables easier control over the generation process. Leveraging the extensive world knowledge of LLMs, these systems can produce customized and diverse outputs.

Recent Preprints
	CoMo: Controllable Motion Generation through Language Guided Pose Code Editing Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu arxiv, 2024 arxiv / website TL;DR: CoMo is a controllable human motion generation model that encodes motion into interpretable pose codes representing body part semantics. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions.
	A Concept-based Interpretable Model for the Diagnosis of Choroid Neoplasias using Multimodal Data Yifan Wu, Yang Liu, Yue Yang, Michael S. Yao, Wenli Yang, Xuehui Shi, Lihong Yang, Dongjun Li, Yueming Liu, James C. Gee, Xuan Yang, Wen-bin Wei, Shi Gu arxiv, 2024 arxiv / demo
	Interpretable-by-Design Text Classification with Iteratively Generated Concept Bottleneck Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, Chris Callison-Burch arxiv, 2023 arxiv / code TL;DR: Text Bottleneck Model (TBM) is an inheretly interpretable text classification framework that offers both global and local explanations by iteratively constructing concept bottlenecks. TBM can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa.

Selected Publications	Full List of Publications
a 1b1b apartment of a researcher who has a cat	Holodeck: Language Guided Generation of 3D Embodied AI Environments Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark Conference on Computer Vision and Pattern Recognition (CVPR)*, 2024 arxiv / website / code TL;DR: Holodeck is an automated system for generating diverse 3D environments in Embodied AI, using a large language model (GPT-4) and a vast collection of 3D assets from Objaverse. It can create complex scenes based on user prompts, adjusting for styles and specific details, like "a 1b1b apartment of a researcher who has a cat".
	Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, Mark Yatskar Conference on Computer Vision and Pattern Recognition (CVPR), 2023 arxiv / code TL;DR: Concept Bottleneck Models are interpretable models that factor in human-readable concepts to explain model decisions. However, CBMs often under-perform their black box counterparts and require manual specification of concepts. Our method, LaBo, leverages large language models (GPT-3) to automatically construct bottlenecks for any image classification tasks.
	Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination Yue Yang, Wenlin Yao, Hongming Zhang, Xiaoyang Wang, Dong Yu, Jianshu Chen Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022 arxiv / code TL;DR: We endow language models with visual imagination capabilities with two complementary methods: (I) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the originallanguage tasks.
	Visual Goal-Step Inference using wikiHow Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021 arxiv / bibtex / talk / github TL;DR: This work proposes the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images. We construct a VGSI dataset from wikiHow and show that SOTA multimodal models struggle on it.

Publications	Selected Publications
a 1b1b apartment of a researcher who has a cat	Holodeck: Language Guided Generation of 3D Embodied AI Environments Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark Conference on Computer Vision and Pattern Recognition (CVPR)*, 2024 arxiv / website / code TL;DR: Holodeck is an automated system for generating diverse 3D environments in Embodied AI, using a large language model (GPT-4) and a vast collection of 3D assets from Objaverse. It can create complex scenes based on user prompts, adjusting for styles and specific details, like "a 1b1b apartment of a researcher who has a cat".
	I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors Tuhin Chakrabarty, Arkady Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan Findings of the Association for Computational Linguistics (ACL), 2023 arxiv / code TL;DR: We generate visual metaphors from linguistic metaphors using a collaboration between LLMs and Diffusion Models. We use GPT-3 with Chain-of-Thought prompting to generate text that represents a visual elaboration of the linguistic metaphor, which is then used as input to the diffusion-based text-to-image models. We end up creating 6,476 visual metaphors and associated visual elaborations.
	Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, Mark Yatskar Conference on Computer Vision and Pattern Recognition (CVPR), 2023 arxiv / code TL;DR: Concept Bottleneck Models are interpretable models that factor in human-readable concepts to explain model decisions. However, CBMs often under-perform their black box counterparts and require manual specification of concepts. Our method, LaBo, leverages large language models (GPT-3) to automatically construct bottlenecks for any image classification tasks.
	Causal Reasoning About Entities and Events in Procedural Texts Li Zhang, Hainiu Xu, Yue Yang, Shuyan Zhou, Weiqiu You, Manni Arora, Chris Callison-Burch Findings of the European Chapter of the ACL (EACL), 2023 arxiv / code TL;DR: We introduces CREPE, a benchmark for causal reasoning about event plausibility based on entity states, it shows that current large language models including GPT-3 perform poorly on this task. To improve the performance, we inject the causal relations between entities and events through structured representations such as programming languages which results in a significant increase in performance.
	Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination Yue Yang, Wenlin Yao, Hongming Zhang, Xiaoyang Wang, Dong Yu, Jianshu Chen Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022 arxiv / code TL;DR: We develop a novel approach to endow language models with visual imagination capabilities. We leverage two complementary types of "imaginations": (I) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks.
	Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction Yue Yang, Artemis Panagopoulou, Marianna Apidianaki, Mark Yatskar, Chris Callison-Burch (equal contribution) Findings of the Empirical Methods in Natural Language Processing (EMNLP)*, 2022 arxiv / code TL;DR: We propose to extract properties of nouns from images, which can then be used to complement information from language models to mitigate the reporting bias problem. Results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful language models.
	Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data Shuyan Zhou, Harry Li Zhang, Yue Yang, Veronica Qing Lyu, Pengcheng Yin, Chris Callison-Burch, Graham Neubig Annual Meeting of the Association for Computational Linguistics (ACL), 2022 arxiv / github / website TL;DR: Procedures are inherently hierarchical. To "host a party", one may need to "clean the house", which in turn may require "putting away the clothes". We develop a simple and efficient method that links steps (e.g. "clean the house") in an article to other articles with similar intents (e.g. "how to deep lean your house"), which proceeds recursively to form the KB.
	Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval Yue Yang, Joongwon Kim, Artemis Panagopoulou, Mark Yatskar and Chris Callison-Burch CVPR 2022 @ ODRUM, 2022, spotlight talk arxiv / bibtex TL;DR: This work proposes a novel system that induces schemata from web videos and generalizes them to capture unseen tasks with the goal of improving video retrieval performance, and demonstrates that the schemata induced by the system are better than those generated by other models.
	Visual Goal-Step Inference using wikiHow Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021 arxiv / bibtex / talk / github TL;DR: This work proposes the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images. We construct a VGSI dataset from wikiHow and show that SOTA multimodal models struggle on it.

Education

University of Pennsylvania, Philadelphia, PA, USA Ph.D. in Computer and Information Science (2020 - present) M.S. in Robotics (2018 - 2020)
Zhejiang University, Hangzhou, China B.E. in Mechanical Engineering (2014 - 2018)

Experiences

Allen Institute for AI, Seattle, WA, USA Research Intern (May. 2023 to Sept. 2023, May. 2024 to Sept. 2024) Outstanding Intern of the Year Award (2023)
Tencent AI Lab, Seattle, WA, USA Research Scientist Intern (May. 2022 to Sept. 2022)

Teaching

Head Teaching Assistant, CIS-521 Artificial Intelligence, University of Pennsylvania
Fall2019; Fall 2020; Summer 2021; Fall 2021; Spring 2022

Teaching Assistant, CIS-530 Computational Linguistics, University of Pennsylvania
Spring 2021

Academic Service

Reviewer: CVPR, ECCV, ACL, EMNLP, NAACL, EACL, COLM, TMLR.

Talks

CLUNCH, University of Pennsylvania, Philadelphia, PA, USA
Investigate Procedural Events in a Multimodal Fashion, November 22, 2021. slides

Website source from Jon Barron.