

Hands-On Multimodal AI: Contrastive Pretraining and Projection
Join us at the Merantix AI Campus for a hands-on workshop led by our Hacker Room resident Antonio Rueda Toicen.
This session explores how to build models that understand and process information from multiple data types, such as images and text. You will learn two key techniques through direct coding experience: contrastive pretraining and cross-modal projection.
Specifically, we will go through the following topics:
Implement contrastive pretraining to teach a model the relationship between 3D LiDAR point clouds and corresponding RGB images of objects.
Implement contrastive pretraining to teach a model the relationship between images and their outlines.
Apply classical computer vision techniques, like the Sobel filter, for enhanced latent-space creation.
Use cosine similarity to measure the distance between vector embeddings.
Build a vector database for image retrieval using a trained contrastive model.
Construct a cross-modal projector to map text embeddings into an image embedding space.
Integrate a projector with a pre-trained image model to create a Vision-Language Model.
Train and fine-tune a complete multimodal pipeline that classifies images from text descriptions.
📅 Agenda
9:00–9:30 AM – Arrival
9:30–11:30 AM – Workshop
11:30–12:00 AM – Wrap-up and networking
⚠️ Doors close at 9:30 sharp! Arrive before then.
💻 Bring your laptop.
Who should join?
Participants should have foundational knowledge of Python and PyTorch. A basic understanding of convolutional neural networks and vector embeddings is beneficial.
About the hosts
Merantix AI Campus is Europe's Hub for AI. Launched in April 2021, the AIC plays a key role in shaping Europe’s AI community, through countless initiatives that bring together the best start-ups & scale-ups, investors, policy makers and industry leaders.