Microsoft Research · Healthcare

BiomedCLIP

A biomedical vision-language model trained on 15 million figure-caption pairs from PubMed Central for medical image-text understanding.

Overview

BiomedCLIP adapts the CLIP framework for the biomedical domain by training on PMC-15M, a dataset of 15 million biomedical figure-caption pairs extracted from PubMed Central. It achieves state-of-the-art performance on a wide range of biomedical vision-language tasks including image classification, retrieval, and visual question answering. The model bridges the gap between medical imaging and natural language understanding in a single unified framework.

Architecture

CLIP (ViT + Text Encoder)

Training Data

PMC-15M (15M figure-caption pairs)

Image Encoder

Vision Transformer (ViT-B/16)

Text Encoder

PubMedBERT

License

MIT

Capabilities

Biomedical image classification

Medical image-text retrieval

Visual question answering for medical images

Zero-shot medical image recognition

Cross-modal biomedical search

Use Cases

Searching medical image databases using natural language queries

Classifying medical images without task-specific fine-tuning

Building multimodal search engines for biomedical literature

Automating figure annotation in medical publications

Pros

  • +State-of-the-art biomedical vision-language understanding
  • +Zero-shot capability reduces need for labeled medical data
  • +Open-source with permissive MIT license
  • +Trained on the largest biomedical image-text dataset to date

Cons

  • -Performance varies across medical imaging modalities
  • -Primarily trained on published figures, not raw clinical imaging
  • -Requires paired image-text data for best results
  • -Not designed for diagnostic-grade clinical image analysis

Pricing

Free and open-source under MIT license. Available on Hugging Face. Efficient enough for standard GPU inference.

Related Models