This February, OpenAI unveiled Sora[1], an AI model that can generate videos up to a minute long, while maintaining visual quality and adherence to a user’s prompt. While the world is buzzing with excitement over how Sora further advances AI applications, we invited SJTU Professor Zou Junni to provide a brief introduction to the technical principles of Sora and offer guidance on future research directions.
1. Can you briefly introduce the technical principles of Sora?
We estimate that Sora can be divided into three main modules: DALLE-3[2]+ GPT4[3]; Spacetime Encoder & Decoder; Diffusion Transformer (DiT) [4]. Specifically:
1) DALLE-3 + GPT4 enables the re-captioning of video data.
2) Spacetime Encoder & Decoder builds temporally consistent joint encoding and decoding for image and video.
3) Diffusion Transformer achieves scalable long-sequence training and inference.
2. How much of a gap is there between current domestic research and Sora?
Among these three modules, based on the analysis of known model tools and publicly available information in China, there is still a considerable gap in terms of capabilities compared to Sora. Currently, domestic research seems to be in its early stages, focusing on generating single images, single modalities, small-scale data, and short sequences. Additionally, the scale and quality of training data sources, as well as the engineering capabilities of AI systems (AI frameworks, AI compilers, AI chips, large models), are constraints on the development of related research in China.
3. In your advice, as a student, what should we focus on in the field of generative AI in the future? What can we do to better adapt to and promote its development?
With the rise of multi-modal large models like Gemini, GPT4-V, and Sora, generative AI is gradually moving towards large-scale, multi-modal, highly interactive, and high-fidelity directions. At the same time, considering the existing foundation of generative models: deep neural networks, which are fundamentally powerful function approximators, generative AI still faces the challenge of how to model probability density. Sora itself also faces serious illusions in understanding the physical world.
Therefore, future research can focus on three directions:
1) Research on the fundamentals of generative theory. Investigating deeper and more powerful foundational theories remains crucial for generative AI.
2) Multi-modal 3D image generation and 4D video generation. The success of Sora indicates the emergent capabilities of large models in physical logic, providing a foundation for research in generating multi-modal 3D images and further 4D video generation.
3) AI for CG and physics-based generative models. Sora demonstrates the potential for generative AI to construct virtual physical worlds. Therefore, systematically understanding the physical world and generating data on top of that is an important direction for the future development of large models and artificial intelligence.
【Reference】
[1]. Openai,com/sora
[2]. Openai.com/dall-e-3
[3]. Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint asXiv:2303. 08774, 2023.
[4]. Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. 2023: 4195-4205.
【Author Biography】
Zou Junni (zoujunni@sjtu.edu.cn) is currently a full Professor with the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. She was a visiting professor with the Department of Electrical and Computer Engineering, University of California, San Diego (UCSD), USA, from 2011 to 2012.
Dr. Zou is now responsible for research and graduate education at the Institute of Media, Information, and Network, SJTU. Her research interests include multimedia communication, distributed network optimization, immersive visual processing, reinforcement learning.
Long Kexin (kexinlong@sjtu.edu.cn) works with the Global Communications Office of the International Affairs Department at SJTU, serving the global promotion needs of faculty and students. She is engaged in content creation and editing to shape the international image and enhance the reputation of SJTU.