How AI 'Understands' Images (CLIP) - Computerphile
185,743
Published 2024-04-25
www.facebook.com/computerphile
twitter.com/computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharanblog.com/
Thank you to Jane Street for their support of this channel. Learn more: www.janestreet.com/
All Comments (21)
-
As people have correctly noted: When I talk about the way we train at 9:50, I should say we maximise the similarity on the diagonal, not the distance :) Brain failed me!
-
thank you for "if you want to unlock your face with a phone".. i needed that in my life
-
This guy is one of the best teachers I have ever seen.
-
Dr. Pound's videos are on another level! He explains things with such passion and such clarity rarely found on the web! Cheers
-
The legend has it that the series will only end when the last sheet of continuous printing paper has been written on.
-
Thanks for taking us to Pound town. Great explanation!
-
That cat got progressively more turtle-like with each drawing.
-
The man, the myth, the legend, Dr. Pound. The best lecturer on Computerphile.
-
9:45 - wasn't it supposed to be "minimize the distance on diagonal, maximize elsewhere"?
-
These guys are so watchable, and somehow they make an inherently inaccessible subject interesting and easy to follow.
-
The biggest problem with unlocking a face with your phone is that you'll laugh too hard to hear the video for a minute or so.
-
A very important bit that was skipped over is how you get an LLM to talk about an image (multimodal LLM)! After you got your embedding from the vision encoder you train a simple projection layer that aligns the image embedding with the semantic space of the LLM. You train the projection layer so that the embedding of the vision encoder produces the desired text output describing the image (and or executing the instructions in the image+prompt). You basically project the "thoughts" of the part that sees (the vision encoder) into the part that speaks (the massive LLM).
-
Excellent. Could listen to him all day and even understand stuff.
-
6:09 "There isn't red cats" Mike is hilarious and a great teacher lol
-
"There's a lot of stuff on the internet, not all of it good, I should add" - Dr Mike Pound, 2024
-
I love these encoder models. And I have seen these methods implemented in practice, usually as part of a recommender system handling unstructured freetext queries. Embeddings are so cool.
-
I'm a simple guy. I see a Mike Pound video, I click
-
Love seeing AI problems explained on fanfold paper. Classy!
-
So, if we want to break AI, we just have to pollute the internet with a couple billion pictures of red cats with the caption “blue dog”.
-
Love the genuine background.