In this second part, the new milestones for the industry with Artificial Intelligence, AI, are explained.
By: PhD. Luis Fernando Gutiérrez Cano and Mag. Luis Jorge Orcasitas Pacheco*PhD.
Second milestone for the audiovisual sector: This development represents a first for the audiovisual sector by providing benefits by predicting clean patches from noisy patches, demonstrating remarkable scalability in video generation. Despite facing challenges in content interpretation and quality, its ability to improve sample quality with increased training makes it a powerful tool for producing high-quality visual content.
Variable durations, resolutions, and aspect ratios
Previous approaches to image and video generation typically resize, crop, or crop videos to a standard size. However, we found that training with data in its native size offers several benefits.
Flexibility in sampling
Sora can display videos of different aspects, allowing the creation of content for different devices in their native aspect ratios.
Improved composition and framing
Training with videos in their native aspect ratios improves composition and framing compared to models that crop all training videos to be square.
Third milestone for the audiovisual sector: This procedure represents a breakthrough for the audiovisual sector, with a series of benefits, challenges and opportunities. Considering variable durations, resolutions, and aspect ratios in training models like Sora benefits by enabling the creation of diverse content that is adaptable to different devices and viewing needs. This flexibility in sampling enhances Sora's ability to generate high-quality visual content in a variety of formats. However, previous approaches that resize or crop videos may have hurt the quality and composition of the generated content.
Language Understanding
A large number of videos with corresponding text captions are required to train text-to-video generation systems using the re-captioning technique, as introduced in DALL-E 3. This method involves first training a highly descriptive caption model and using it to generate text captions for all videos in our training suite. Similarly, when training with descriptive video subtitles, the fidelity of the text is improved as well as the overall quality of the videos. Following the DALL-E 3 approach, we leveraged GPT to convert short user prompts into detailed captions, which are sent to the video model. This allows Sora to generate high-quality videos that accurately follow the user's prompts (Figure 6).
Fourth milestone for the audiovisual sector: This approach represents a benchmark for the audiovisual sector, with a number of benefits, challenges and opportunities. The re-subtitling technique introduced in DALL-E 3 benefits the industry by enabling the training of text-to-video generation systems with a large number of videos and their corresponding subtitles. This results in an improvement in text fidelity as well as the overall quality of videos by providing descriptive subtitles. However, the need for a large amount of data and training with detailed captions can pose challenges in terms of resources and time. Despite this, leveraging GPT to convert user prompts into detailed captions enhances Sora's ability to generate high-quality videos that accurately follow the user's prompts, increasing its versatility and usefulness in the production of audiovisual content.
Sora's versatility
Sora's versatility goes beyond text commands, including images and videos as well, allowing for a wide variety of editing tasks such as creating looping videos, animating static images, and extending videos forward or backward in time. Sora can generate videos based on DALL-E images, extend existing videos, edit videos as prompted by text, and interpolate between two input videos seamlessly. Below are examples of these capabilities, including DALL-E image animation, video extension, video-to-video editing, and video connection.
Fifth milestone for the audiovisual sector: This achievement represents an essential advance for the audiovisual sector, with a series of benefits, challenges and opportunities. Sora's adaptability benefits the industry by offering a wide range of editing capabilities, such as looping video creation, still image animation, and extending videos over time. However, the complexity and amount of resources required to use these features could pose a challenge for some users. However, Sora's ability to generate videos based on DALL-E images, extend existing videos, edit videos according to text prompts, and interpolate between two input videos seamlessly, enhances its usefulness and versatility in the production of audiovisual content, which could boost creativity and efficiency in the industry.
Imaging capabilities
Sora's imaging capabilities allow you to create images by arranging Gaussian noise patches (small variations in the image) into a spatial grid with a temporal extension of one frame. This tool has the ability to generate images of various sizes, even reaching a resolution of 2048x2048. By training large-scale video models, Sora exhibits a variety of emerging simulation capabilities, such as 3D consistency, long-term consistency, object permanence, interaction with the environment, and simulation of digital worlds. These skills suggest that the continued scalability of video models is a promising path towards the development of simulators capable of the physical and digital worlds, as well as the objects, animals, and people that inhabit them.
Sixth milestone for the audiovisual sector: This advance benefits the audiovisual sector by providing a versatile tool to create high-quality images with different sizes and resolutions. However, the complexity and resources required to use these capabilities can be challenging for some users. Despite this, by training large-scale video models, Sora demonstrates emerging simulation capabilities that enhance its potential for the development of simulators of the physical and digital world, including objects, animals and people, which could boost innovation and creativity in the audiovisual industry.
Conclusions and recommendations
Sora is not available to the general public as it is in an evaluation phase where it is essential to ensure that its use is not diverted for improper purposes, guaranteeing the safety of future users. It has not yet been confirmed whether Sora will have a pricing policy or a free version.
The advancement of artificial intelligence, especially in the field of video generation, has generated new opportunities and challenges in the audiovisual sector. Sora, the model developed by OpenAI, represents a significant step towards creating simulations of the physical world, offering a wide range of capabilities for the production of high-quality visual content.
All in all, the future of AI-powered video is promising, and models like Sora are leading the way in creating simulators of the physical world. With a careful approach and responsible implementation, this technology has the potential to transform the audiovisual industry and open up new creative and commercial possibilities.
References
OpenIA (2024) Video generation models as world simulators. Link: https://openai.com/research/video-generation-models-as-world-simulators
*Luis Fernando Gutiérrez Cano and Luis Jorge Orcasitas Pacheco, are professors and researchers at the Universidad Pontificia Bolivariana headquarters Medellín, in the undergraduate and postgraduate programs of the Faculty of Social Communication-Journalism. In this edition, it has the support of students Laura Sofía Arboleda Ortega and Mariana Giraldo Correa.
Leave your comment