How AI 'Understands' Images (CLIP) - Computerphile

185,743

6,128 0

Published 2024-04-25

With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.

www.facebook.com/computerphile
twitter.com/computer_phile

This video was filmed and edited by Sean Riley.

Computer Science at the University of Nottingham: bit.ly/nottscomputer

Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com/

Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com/

All Comments (21)

@michaelpound9891 2 months ago

As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!
@adfaklsdjf 2 months ago

thank you for "if you want to unlock your face with a phone".. i needed that in my life
@pyajudeme9245 2 months ago

This guy is one of the best teachers I have ever seen.
@edoardogribaldo1058 2 months ago

Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers
@orange-vlcybpd2 2 months ago

The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.
@aprilmeowmeow 2 months ago

Thanks for taking us to Pound town. Great explanation!
@chloupichloupa 1 month ago

That cat got progressively more turtle-like with each drawing.
@bluekeybo 2 months ago

The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.
@MichalKottman 2 months ago

9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?
@skf957 2 months ago

These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.
@beardmonster8051 2 months ago

The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.
@TheRealWarrior0 2 months ago

A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)! After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt). You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).
@Shabazza84 2 months ago

Excellent. Could listen to him all day and even understand stuff.
@rigbyb 2 months ago

6:09 "There isn't red cats" Mike is hilarious and a great teacher lol
@eholloway 2 months ago

"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024
@wouldntyaliktono 2 months ago

I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.
@uneasy_steps 2 months ago

I'm a simple guy. I see a Mike Pound video, I click
@musikdoktor 2 months ago

Love seeing AI problems explained on fanfold paper. Classy!
@lucianoag999 1 month ago

So, if we want to break AI, we just have to pollute the internet with a couple billion pictures of red cats with the caption “blue dog”.
@negrumanuel 1 month ago

Love the genuine background.