Sora 是 OpenAI 发布的一个文本到视频的生成模型,可以根据描述性的文本提示生成高质量的视频内容。其能力标志着人工智能在创意领域的重大飞跃,有望将简单的文本描述转变为丰富的动态视频内容。
Sora 模型的发布,在技术界引起了广泛的关注和讨论,但目前 OpenAI 并没有公开发布 Sora 的计划,而是选择仅向少数研究人员和创意人士提供有限的访问权限,以便获取他们的使用反馈并评估技术的安全性。
We explore large-scale training of generative models on video data.Specifically,we train text-conditional diffusion models jointly on videos and images of variable durations,resolutions and aspect ratios.We leverage a Transformer architecture that operates on spacetime patches of video and image latent codes.Our largest model,Sora,is capable of generating a minute of high fidelity video.Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
This technical report focuses on(1)our method for turning visual data of all types into a unified representation that enables large-scale training of generative models,and(2)qualitative evaluation of Sora’s capabilities and limitations.Model and implementation details are not included in this report.
Sora is a diffusion model; given input noisy patches(and conditioning information like text prompts),it’s trained to predict the original “clean” patches.Importantly,Sora is a diffusion transformer.Transformers have demonstrated remarkable scaling properties across a variety of domains,including language modeling,computer vision,and image generation.
In this work,we find that diffusion transformers scale effectively as video models as well.Below,we show a comparison of video samples with fixed seeds and inputs as training progresses.Sample quality improves markedly as training compute increases.Base compute[block_sep]4x compute[block_sep]32x compute
Sora是OpenAI发布的一个文本到视频的生成模型。它可以根据描述性的文本提示生成高质量的视频内容。Sora的能力标志着人工智能在创意领域的重大飞跃,它有望将简单的文本描述转变为丰富的动态视频内容。Sora模型的发布,虽然在技术界引起了广泛的关注和讨论,但目前OpenAI并没有公开发布Sora的计划,而是选择仅向少数研究人员和创意人士提供有限的访问权限,以便获取他们的使用反馈并评估技术的安全性。
We explore large-scale training of generative models on video data.Specifically,we train text-conditional diffusion models jointly on videos and images of variable durations,resolutions and aspect ratios.We leverage a transformer architecture that operates on spacetime patches of video and image latent codes.Our largest model,Sora,is capable of generating a minute of high fidelity video.Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.我们探索了在视频数据上大规模训练生成模型。我们同时在变化时长、分辨率和宽高比的视频和图像上训练文本条件扩散模型。我们利用一种Transformer架构,该架构在视频和图像的潜空间时空分块(Patch)上运行。我们最大的型号Sora能够生成一分钟的高保真视频。结果表明,扩展视频生成模型是构建物理世界通用模拟器的一条有前途的途径。This technical report focuses on(1)our method for turning visual data of all types into a unified representation that enables large-scale training of generative models,and(2)qualitative evaluation of Sora’s capabilities and limitations.Model and implementation details are not included in this report.
Sora is a diffusion model; given input noisy patches(and conditioning information like text prompts),it’s trained to predict the original “clean” patches.Importantly,Sora is a diffusion *transformer*.Transformers have demonstrated remarkable scaling properties across a variety of domains,including language modeling,computer vision,and image generation.Sora是一个扩散模型;给定输入的噪声块(以及像文本提示这样的条件信息),它被训练来预测原始的“干净”分块。重要的是,Sora是一个扩散Transformers变换器。变换器在包括语言建模、计算机视觉和图像生成在内的多个领域展示了显著的扩展性。In this work,we find that diffusion transformers scale effectively as video models as well.Below,we show a comparison of video samples with fixed seeds and inputs as training progresses.Sample quality improves markedly as training compute increases.在这项工作中,我们发现扩散变换器(Diffusion transformers)作为视频模型也能有效扩展。下面,我们展示了随着训练计算增加,固定种子和输入的视频样本质量显著提高的比较。Base compute4x compute32x compute