多模态预训练+对比学习
CLIP - Learning Transferable Visual Models From Natural Language Supervision
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training 2021-5
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning