¶ LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
LayoutLMv3 is a multimodal Transformer model for Document AI that jointly processes text, layout, and visual information from documents. It was introduced in the paper LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Huang, Lv, Cui, Lu, and Wei (2022).
Code, pretrained models, and practical usage examples are available through the original Microsoft UNILM repository and the Hugging Face Transformers implementation.
LayoutLMv3 is a general-purpose multimodal document model designed for tasks where understanding a document requires combining textual content, page layout, and visual appearance. Compared to earlier LayoutLM versions, it simplifies the architecture by replacing the CNN-based visual backbone of LayoutLMv2 with patch embeddings, similar to Vision Transformers.
A key contribution of LayoutLMv3 is that it uses a unified masking strategy across both text and image modalities. During pretraining, the model is trained with masked language modeling (MLM), masked image modeling (MIM), and word-patch alignment (WPA). This makes the learning process more consistent across modalities and improves cross-modal representation learning for document understanding.
Key traits of LayoutLMv3:
- Unified multimodal document model: Combines text, 2D layout, and image information in one Transformer.
- Patch-based visual encoder: Uses ViT-style patch embeddings instead of a CNN backbone.
- Unified pretraining objectives: Trained with MLM, MIM, and word-patch alignment.
- Cross-modal alignment: Learns direct relationships between document words and their corresponding visual regions.
- Broad Document AI applicability: Supports both text-centric and image-centric document tasks.

Figure 1 (from the paper) illustrates the architecture and workflow of LayoutLMv3:
- A document page is represented through text tokens, their bounding boxes, and the document image itself.
- The visual input is converted into image patches, following a ViT-style patch embedding approach.
- Textual and visual representations are processed jointly by a multimodal Transformer encoder.
- During pretraining, the model learns from masked language modeling, masked image modeling, and word-patch alignment objectives.
- This unified design enables the model to perform well on a wide range of Document AI tasks, including form understanding, receipt understanding, document question answering, layout analysis, and document image classification.
LayoutLMv3 is intended for:
- Document understanding tasks that require combining text, layout, and page appearance.
- Information extraction from forms, invoices, receipts, and structured business documents.
- Document question answering and token-level labeling tasks.
- Document image classification and layout-aware analysis.
Limitations:
- LayoutLMv3 is designed specifically for document-oriented multimodal inputs, not for general natural images or general-purpose vision-language tasks.
- Effective use typically depends on having OCR outputs or word-level text with normalized bounding boxes.
- Input preprocessing is more complex than standard NLP models because it requires coordinating text tokens, boxes, and images.
- Performance may depend on document quality, OCR accuracy, and layout consistency in the target domain.
¶ BibTeX entry and citation info
@article{huang2022layoutlmv3,
title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
author={Huang, Yupan and Lv, Tengchao and Cui, Lei and Lu, Yutong and Wei, Furu},
journal={arXiv preprint arXiv:2204.08387},
year={2022}
}