How AI 'Understands' Images (CLIP) - Computerphile

185,743
0
Published 2024-04-25
With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to turn text strings into plausible images? Dr Mike Pound expands on his explanation of Diffusion models.

www.facebook.com/computerphile
twitter.com/computer_phile

This video was filmed and edited by Sean Riley.

Computer Science at the University of Nottingham: bit.ly/nottscomputer

Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com/

Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com/

All Comments (21)
  • @michaelpound9891
    As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!
  • @adfaklsdjf
    thank you for "if you want to unlock your face with a phone".. i needed that in my life
  • @pyajudeme9245
    This guy is one of the best teachers I have ever seen.
  • Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers
  • @orange-vlcybpd2
    The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.
  • @aprilmeowmeow
    Thanks for taking us to Pound town. Great explanation!
  • @chloupichloupa
    That cat got progressively more turtle-like with each drawing.
  • @bluekeybo
    The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.
  • @MichalKottman
    9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?
  • @skf957
    These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.
  • @beardmonster8051
    The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.
  • @TheRealWarrior0
    A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)! After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt). You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).
  • @Shabazza84
    Excellent. Could listen to him all day and even understand stuff.
  • @rigbyb
    6:09 "There isn't red cats" Mike is hilarious and a great teacher lol
  • @eholloway
    "There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024
  • @wouldntyaliktono
    I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.
  • @uneasy_steps
    I'm a simple guy. I see a Mike Pound video, I click
  • @musikdoktor
    Love seeing AI problems explained on fanfold paper. Classy!
  • @lucianoag999
    So, if we want to break AI, we just have to pollute the internet with a couple billion pictures of red cats with the caption “blue dog”.