Yue Yang
pronounced as yoo-eh

220 South 33rd Street, Towne 299
Philadelphia, PA 19104, USA
Email: yueyang1 [at] seas.upenn.edu

Google Scholar Icon Semantic Scholar Icon GitHub Icon Twitter Icon LinkedIn Icon Youtube Icon Pixiv Icon CV
About me

Hi! My name is Yue Yang (杨樾). I am a fourth-year Ph.D. student in Computer and Information Science at the University of Pennsylvania, affiliated with Penn NLP. I am grateful to be advised by Prof. Chris Callison-Burch and Prof. Mark Yatskar. I am interested in the intersection area of Natural Language Processing (NLP) and Computer Vision (CV).

My current research focuses on applying the knowledge priors of large language models (LLMs) to various domains (images, videos, healthcare, Embodied AI, etc) to improve different aspects of AI systems, including:

Interpretability. LLMs aid in constructing human-readable intermediate representations, such as concept bottlenecks, enabling the design of inherently interpretable models, thereby mitigating the black-box nature of deep learning.

Robustness. By utilizing sparse natural language representations as input, models are less prone to overfitting on the spurious cues of in-domain training data, enhancing their robustness and out-of-domain generalization.

Controllability & Creativity. Language interfaces in generative systems enables easier control over the generation process. Leveraging the extensive world knowledge of LLMs, these systems can produce customized and diverse outputs.

Recent Preprints
Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu
arxiv, 2024
TL;DR: CoMo is a controllable human motion generation model that encodes motion into interpretable pose codes representing body part semantics. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions.
Yifan Wu, Yang Liu, Yue Yang, Michael S. Yao, Wenli Yang, Xuehui Shi, Lihong Yang, Dongjun Li, Yueming Liu, James C. Gee, Xuan Yang, Wen-bin Wei, Shi Gu
arxiv, 2024
Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, Chris Callison-Burch
arxiv, 2023
TL;DR: Text Bottleneck Model (TBM) is an inheretly interpretable text classification framework that offers both global and local explanations by iteratively constructing concept bottlenecks. TBM can rival the performance of established black-box baselines such as GPT-4 fewshot and finetuned DeBERTa.
Selected Publications
a 1b1b apartment of a researcher who has a cat
Yue Yang*, Fan-Yun Sun*, Luca Weihs*, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
TL;DR: Holodeck is an automated system for generating diverse 3D environments in Embodied AI, using a large language model (GPT-4) and a vast collection of 3D assets from Objaverse. It can create complex scenes based on user prompts, adjusting for styles and specific details, like "a 1b1b apartment of a researcher who has a cat".
Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, Mark Yatskar
Conference on Computer Vision and Pattern Recognition (CVPR), 2023
TL;DR: Concept Bottleneck Models are interpretable models that factor in human-readable concepts to explain model decisions. However, CBMs often under-perform their black box counterparts and require manual specification of concepts. Our method, LaBo, leverages large language models (GPT-3) to automatically construct bottlenecks for any image classification tasks.
Yue Yang, Wenlin Yao, Hongming Zhang, Xiaoyang Wang, Dong Yu, Jianshu Chen
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
TL;DR: We endow language models with visual imagination capabilities with two complementary methods: (I) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the originallanguage tasks.
Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
TL;DR: This work proposes the Visual Goal-Step Inference (VGSI) task where a model is given a textual goal and must choose a plausible step towards that goal from among four candidate images. We construct a VGSI dataset from wikiHow and show that SOTA multimodal models struggle on it.
Education
University of Pennsylvania, Philadelphia, PA, USA
  • Ph.D. in Computer and Information Science (2020 - present)
  • M.S. in Robotics (2018 - 2020)
  • Zhejiang University, Hangzhou, China
  • B.E. in Mechanical Engineering (2014 - 2018)
  • Experiences
    Allen Institute for AI, Seattle, WA, USA
    Research Intern (May. 2023 to Sept. 2023, May. 2024 to Sept. 2024)
    Outstanding Intern of the Year Award (2023)
    Tencent AI Lab, Seattle, WA, USA
    Research Scientist Intern (May. 2022 to Sept. 2022)
    Teaching

    Head Teaching Assistant, CIS-521 Artificial Intelligence, University of Pennsylvania
    Fall2019; Fall 2020; Summer 2021; Fall 2021; Spring 2022

    Teaching Assistant, CIS-530 Computational Linguistics, University of Pennsylvania
    Spring 2021
    Academic Service

    Reviewer: CVPR, ECCV, ACL, EMNLP, NAACL, EACL, COLM, TMLR.
    Talks
    CLUNCH, University of Pennsylvania, Philadelphia, PA, USA
    Investigate Procedural Events in a Multimodal Fashion, November 22, 2021. slides

    Website source from Jon Barron.