KnoBo incorporates knowledge priors from medical documents via inherently interpretable models.




Interpolate start reference image.

[In-domain (ID), out-of-domain (OOD), and average of ID and OOD (Avg) performance on confounded medical image datasets.]

KnoBo is more robust to domain shifts (e.g., race, hospital, etc) than fine-tuned vision transformers.

Abstract

While deep networks have achieved broad success in analyzing natural images, when applied to medical scans, they often fail in unexcepted situations. We investigate this challenge and focus on model sensitivity to domain shifts, such as data sampled from different hospitals or data confounded by demographic variables such as sex, race, etc, in the context of chest X-rays and skin lesion images. A key finding we show empirically is that existing visual backbones lack an appropriate prior from the architecture for reliable generalization in these settings. Taking inspiration from medical training, we propose giving deep networks a prior grounded in explicit medical knowledge communicated in natural language. To this end, we introduce Knowledge Bottlenecks (KnoBo), a class of concept bottleneck models that incorporates knowledge priors that constrain it to reason with clinically relevant factors found in medical textbooks or PubMed. KnoBo uses retrieval-augmented language models to design an appropriate concept space paired with an automatic training procedure for recognizing the concept. We evaluate different resources of knowledge and recognition architectures on a broad range of domain shifts across 20 datasets. In our comprehensive evaluation with two imaging modalities, KnoBo outperforms fine-tuned models on confounded datasets by 32.4% on average. Finally, evaluations reveal that PubMed is a promising resource for making medical models less sensitive to domain shift, outperforming other resources on diversity of information and final prediction performance.

Deep Networks Lack Priors for Medical Images

Priors are an important signal allowing models to adopt appropriate hypotheses in low or misleading data regimes. Vision backbones have an effective deep image prior on natural images even when entirely untrained. In contrast, across multiple medical modalities (X-rays and skin lesion images), the deep image prior in current major visual backbones is no more effective than using pixels (and often worse).

Interpolate start reference image.

[Classification performance on natural and medical images through linear probing using features extracted from untrained models versus pixels features.]

Knowledge-enhanced Bottlenecks

Our Knowledge-enhanced Bottlenecks for medical image classification comprises three main components:
(1) Structure Prior constructs the trustworthy knowledge bottleneck by leveraging medical documents;
(2) Bottleneck Predictor grounds the images onto concepts which are used as input for the linear layer;
(3) Parameter Prior constrains the learning of linear layer with parameters predefined by doctors or LLMs.

KnoBo is More Robust to Domain Shifts.

We evaluate 5 confounds per modality, covering scenarios of race, sex, age, scan position, and hospital, etc. Averaged over confounds, KnoBo increases OOD performance by 41.8% and 22.9% on X-ray and skin lesion.

Interpolate start reference image.

[Compare KnoBo and fine-tuned ViT on confounded datasets of two modalities (X-ray and skin lesion). We report in-domain (ID), out-of-domain (OOD), and average of ID and OOD (Avg) accuracy. The number about the arrow indicates the OOD improvement of KnoBo over fine-tuned ViT.]

KnoBo Performs the Best across Confounded and Unconfounded Data

Across 10 confounded and 10 unconfounded medical datasets, KnoBo achieves the best performance on average.

Interpolate start reference image.

[Averaged results across all datasets, including in-domain (ID), out-of-domain (OOD), domain-gap (∆, lower is better), and mean of ID and OOD (Avg) accuracy for confounded datasets. For unconfounded datasets (Unconfd), we report test accuracy. Overall performance is calculated as the mean of the Avg and Unconfd, the overall tradeoff between data conditions.]

PubMed is a Promising Source for Medical Knowledge

We explore 5 different sources of knowledge for retrieval-augmented generation and reveal that PubMed outperforms other resources in terms of both diversity of information and final prediction performance.

Interpolate start reference image.

[Comparison of concept bottlenecks built from different knowledge sources. Prompt is our baseline without retrieving documents for concept generation. We report the accuracy of confounded (Confd, average over ID and OOD), unconfounded (Unconfd) datasets, and the overall performance of all datasets. Diversity measures the difference between the concepts in a bottleneck.]

Example concept bottlenecks generated from PubMed:

X-ray Bottleneck (PubMed) Skin Lesion Bottleneck (PubMed)

BibTeX

@article{yang2024textbook,
      title={A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis}, 
      author={Yue Yang and Mona Gandhi and Yufei Wang and Yifan Wu and Michael S. Yao and Chris Callison-Burch and James C. Gee and Mark Yatskar},
      journal={arXiv preprint arXiv:2405.14839},
      year={2024}
}