Modern AI Models for Vision and Multimodal Understanding

Ce cours n'est pas disponible en Français (France)

Nous sommes actuellement en train de le traduire dans plus de langues.

Modern AI Models for Vision and Multimodal Understanding

Instructeur : Tom Yeh

Inclus avec Coursera Plus

4 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Avancées

Expérience recommandée

1 semaine à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

4 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

niveau Avancées

Expérience recommandée

1 semaine à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

Ce que vous apprendrez

Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data.
Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data.
Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks.
Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images.

Compétences que vous acquerrez

Catégorie : Artificial Intelligence and Machine Learning (AI/ML)

Détails à connaître

Certificat partageable

Ajouter à votre profil LinkedIn

Récemment mis à jour !

août 2025

Évaluations

18 devoirs

Enseigné en Anglais

Découvrez comment les employés des entreprises prestigieuses maîtrisent des compétences recherchées

En savoir plus sur Coursera pour les affaires

logos de Petrobras, TATA, Danone, Capgemini, P&G et L'Oreal

Il y a 4 modules dans ce cours

Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more—just like today’s leading AI models.

You'll begin by discovering how Nonlinear Support Vector Machines (NSVMs) and Fourier transforms lay the groundwork for signal processing and pattern recognition in visual data. You'll then build a strong foundation in probabilistic reasoning and temporal modeling with RNNs, enabling AI systems to understand sequences and context. After, you'll learn how transformer architectures revolutionize both language and vision tasks. Finally, you'll dive into multimodal learning with CLIP, which connects images and text, and explore diffusion models that generate high-fidelity images through iterative refinement. This course is ideal for learners who want to go beyond traditional deep learning and explore the models shaping the future of AI. With a blend of theory, code, and real-world applications, you'll be equipped to tackle cutting-edge challenges in computer vision and multimodal AI. This course can be taken for academic credit as part of CU Boulder’s MS in Computer Science degree offered on the Coursera platform. These fully accredited graduate degrees offer targeted courses, short 8-week sessions, and pay-as-you-go tuition. Admission is based on performance in three preliminary courses, not academic history. CU degrees on Coursera are ideal for recent graduates or working professionals. Learn more: https://coursera.org/degrees/ms-computer-science-boulder.

Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.

Inclus

14 vidéos5 lectures4 devoirs

14 vidéosTotal 80 minutes

Meet Your Instructor 2 minutes
Linear SVM11 minutes
Visualize Linear8 minutes
Radial Basis Function (RBF)6 minutes
RBF Kernel3 minutes
Visualize a RBF SVM10 minutes
1D DFT5 minutes
1D Inverse DFT 7 minutes
1D Basic Functions5 minutes
Frequency and Time6 minutes
2D DFT2 minutes
2D Inverse DFT2 minutes
2D Basic Functions4 minutes
Frequency and Spatial 3 minutes

5 lecturesTotal 29 minutes

Earn Academic Credit for your Work!10 minutes
Course Support10 minutes
Inside the Course5 minutes
Get the Workbook: SVM2 minutes
Get the Workbook: Fourier 1D & 2D2 minutes

4 devoirsTotal 75 minutes

SMV and Fourier30 minutes
Support Vector Machine (SVM)15 minutes
Fourier 1D15 minutes
Fourier 2D15 minutes

This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.

Inclus

15 vidéos2 lectures5 devoirs

15 vidéosTotal 122 minutes

Probability in Language Models 10 minutes
Conditional Probabilities 8 minutes
The Chain Rule of Probabilities10 minutes
Calculating Joint Probabilities 12 minutes
Pixel-Base Image Models12 minutes
Autoregressive Image Model16 minutes
Attention Mechanisms in Transformer Models13 minutes
Batch vs Recurrent4 minutes
MLP vs RNN11 minutes
Many to One3 minutes
One to Many2 minutes
One to One5 minutes
Sequence to Sequence2 minutes
Deep RNN5 minutes
Autoregressive RNN3 minutes

2 lecturesTotal 4 minutes

Get the Workbook: Probability2 minutes
Get the Workbook: RNN2 minutes

5 devoirsTotal 90 minutes

Probability and RNN30 minutes
Probability Part One15 minutes
Probability Part Two15 minutes
RNN Part One15 minutes
RNN Part Two15 minutes

This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.

Inclus

15 vidéos2 lectures5 devoirs

15 vidéosTotal 80 minutes

Batch vs Recurrent vs Attention6 minutes
Attention + MLP4 minutes
Dot-Product Self-Attention4 minutes
QKV Self-Attention4 minutes
Transformer Encoder3 minutes
Self vs Cross Attention5 minutes
Encoder and Decoder for Transformer7 minutes
Decoder Output Layer3 minutes
Image to Tokens10 minutes
Normalization for ViT3 minutes
Self-Attention for ViT5 minutes
Multi-Head Attention8 minutes
MLP Forward Feed3 minutes
ViT Output Layer4 minutes
Loss Gradient for ViT3 minutes

2 lecturesTotal 4 minutes

Get the Workbook: Transformer2 minutes
Get the Workbook: ViT2 minutes

5 devoirsTotal 90 minutes

Transformer and ViT30 minutes
Transformer Part One15 minutes
Transformer Part Two15 minutes
ViT Part One15 minutes
ViT Part Two15 minutes

In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.

Inclus

11 vidéos2 lectures4 devoirs

11 vidéosTotal 75 minutes

Batch of Pairs5 minutes
Image Encoder (Batch)6 minutes
Text Encoder (Batch)10 minutes
Joint Embedding4 minutes
Contrastive Pre-Training12 minutes
Zero-Shot Image Classifier6 minutes
Zero-Shot Image Prediction6 minutes
Diffusion Introduction4 minutes
Noise Prediction5 minutes
Time Conditioning and Parallel Training4 minutes
Reverse Diffusion6 minutes

2 lecturesTotal 4 minutes

Get the Workbook: CLIP2 minutes
Get the Workbook: Diffusion2 minutes

4 devoirsTotal 75 minutes

CLIP and Diffusion30 minutes
CLIP Part One15 minutes
CLIP Part Two15 minutes
Diffusion15 minutes

Obtenez un certificat professionnel

Ajoutez ce titre à votre profil LinkedIn, à votre curriculum vitae ou à votre CV. Partagez-le sur les médias sociaux et dans votre évaluation des performances.

Instructeur

Tom Yeh

University of Colorado Boulder

4 Cours9 964 apprenants

Offert par

University of Colorado Boulder

En savoir plus sur Algorithms

Packt
Machine Learning – Modern Computer Vision & Generative AI
Cours
Statut : Essai gratuit
University of Colorado Boulder
Introduction to Deep Learning
Cours
Statut : Essai gratuit
Codio
Multimodal Generative AI: Vision, Speech, and Assistants
Cours
Statut : Essai gratuit
Packt
Advanced PyTorch Techniques and Applications
Cours

Pour quelles raisons les étudiants sur Coursera nous choisissent-ils pour leur carrière ?

Felipe M.

Étudiant(e) depuis 2018

’Pouvoir suivre des cours à mon rythme à été une expérience extraordinaire. Je peux apprendre chaque fois que mon emploi du temps me le permet et en fonction de mon humeur.’

Jennifer J.

Étudiant(e) depuis 2020

’J'ai directement appliqué les concepts et les compétences que j'ai appris de mes cours à un nouveau projet passionnant au travail.’

Larry W.

Étudiant(e) depuis 2021

’Lorsque j'ai besoin de cours sur des sujets que mon université ne propose pas, Coursera est l'un des meilleurs endroits où se rendre.’

Chaitanya A.

’Apprendre, ce n'est pas seulement s'améliorer dans son travail : c'est bien plus que cela. Coursera me permet d'apprendre sans limites.’

Ouvrez de nouvelles portes avec Coursera Plus

Accès illimité à 10,000+ cours de niveau international, projets pratiques et programmes de certification prêts à l'emploi - tous inclus dans votre abonnement.

Faites progresser votre carrière avec un diplôme en ligne

Obtenez un diplôme auprès d’universités de renommée mondiale - 100 % en ligne

Découvrir les diplômes

Rejoignez plus de 3 400 entreprises mondiales qui ont choisi Coursera pour les affaires

Améliorez les compétences de vos employés pour exceller dans l’économie numérique

Foire Aux Questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy.