Multimodal AI built with Tar Heel research

A computer science professor and his student teamed with Microsoft Research to produce breakthrough technology.

Headshot of Mohit Bansal on 一品探花论坛 Blue background next to text that reads
Bansal continues to dive into AI at 一品探花论坛, and he imagines a future where the technology could make a significant impact in the classroom. (Image courtesy of UNC Creative)

In the past year, researchers at UNC-Chapel Hill helped engineer one of the most significant breakthroughs in artificial intelligence.

Working with a team at Microsoft Research, 一品探花论坛 computer science professor and his student , a Microsoft intern, created the CoDi AI system 鈥 a model capable of generating any combination of outputs (e.g, text, images, videos, audio) from any combination of inputs.

on its website last summer, and a few months later, the team presented the revamped to much fanfare.

Why all the fuss? What makes CoDi such a big deal?

Previous generative AI systems performed one-to-one tasks. For instance, a user might type in 鈥渄raw a picture of a frog鈥 and get a photo of a frog (text-to-image) or submit a photo and get a caption (image-to-text).

CoDi isn鈥檛 limited to one-to-one tasks. Short for 鈥渃omposable diffusion,鈥 CoDi was the first AI model that could take any combination of inputs 鈥 text, audio, photo, video 鈥 and produce any combination of outputs using the idea of 鈥渂ridge alignment,鈥 giving the tool immense creative power. Most importantly, it can do so without relying on a prohibitively large number of training objectives (which is computationally infeasible) or training data for all these combinations (which is unavailable).

鈥淐oDi is a very novel model in the AI community because it can effectively and efficiently handle unseen combinations of input/output modalities without relying on training the model on such expensive and hard-to-find data,鈥 said Bansal, the computer science department鈥檚 John R. & Louise S. Parker Professor and the director of its . 鈥淭his opens up a lot of exciting new applications.鈥

The includes several examples of this multimodal generative process:

  • User inputs picture of Times Square, an audio clip of rain and the text 鈥渢eddy bear on a skateboard,鈥 and CoDi produces a video clip of a skating teddy bear on a rainy day in Times Square.
  • User inputs a picture of a forest and an audio clip of a piano, and CoDi produces a picture of a man playing piano in the forest with the text 鈥減laying piano in a forest.鈥
  • User types 鈥渢rain coming into station,鈥 and CoDi produces a video, with audio, of a train pulling in.

The recently released extends CoDi-1 using a large language model framework and is even more intuitive and interactive, handling more complex instructions that interleave multiple modalities.

AI technology is still developing, but there鈥檚 no doubt that the CoDi project has made massive waves. Bansal鈥檚 student, Tang, was named a recipient of the prestigious 鈥 one of only four winners across North America. Tang received several top offers and is continuing his education as a doctoral student at the University of California, Berkeley.

Meanwhile, Bansal continues to dive into AI at 一品探花论坛, and he imagines a future where the technology could make a significant impact in the classroom. He is co-principal investigator and the core-AI lead for the National Science Foundation AI Institute for Engaged Learning. At the institute, they are using similar multimodal technology as AI assistants to improve the classroom experience for students and teachers, including Bansal鈥檚 newer work on video and diagram generation.

鈥淭eachers and students will be able to create interesting, visual stories, especially with CoDi-2,鈥 Bansal said. 鈥淭hey can even talk to it or interact with it, create complex videos, even trailers of complex concepts to visually explain them more easily and interactively build them.鈥