Well, now I am tripping, GPT-4o combines text, image, audio and video input, must watch:

Introducing GPT-4o
https://www.youtube.com/watch?v=DQacCB9tDaw