Microsoft Research · Healthcare

PubMedBERT

A BERT model pre-trained from scratch on PubMed abstracts using a biomedical-specific vocabulary for superior biomedical NLP performance.

Overview

PubMedBERT distinguishes itself from other biomedical BERT variants by being pre-trained entirely from scratch on biomedical text rather than initialized from general-domain BERT. This approach with a domain-specific vocabulary yields substantial improvements on biomedical NLP benchmarks. It consistently outperforms BioBERT and other mixed-domain models on the Biomedical Language Understanding Evaluation (BLUE) benchmark.

Parameters

110M

Architecture

BERT-Base (from-scratch pretraining)

Training Data

PubMed abstracts (3.1B words)

Vocabulary

Custom biomedical WordPiece (30K tokens)

License

MIT

Capabilities

Biomedical named entity recognition

Biomedical relation extraction

Biomedical question answering

Sentence similarity in medical context

Document classification for medical literature

Use Cases

Extracting gene-disease associations from research papers

Classifying clinical trial eligibility criteria

Building biomedical knowledge graphs from literature

Semantic search across medical publication databases

Pros

  • +Top performance on BLUE benchmark for biomedical NLP
  • +Domain-specific vocabulary captures biomedical terminology better
  • +Lightweight and efficient for production deployment
  • +Well-supported with extensive documentation and benchmarks

Cons

  • -Encoder-only model; cannot generate text
  • -Limited 512-token context window
  • -Focused on abstracts; may underperform on full-text clinical documents
  • -Requires task-specific fine-tuning

Pricing

Free and open-source. Available on Hugging Face. Self-hosting costs depend on infrastructure; runs efficiently on a single GPU.

Related Models