SOTA 是“State Of The Art”的缩写,意思是“最先进的”“当前最佳水平”。
在 AI 领域中,例如在多模态的研究中,(c)类方法虽然是 SOTA,但很多想法都受到(d)类的代表作 ViLT 启发。ViLT 把 VE 彻底换成简单的 patch projection 模块,借鉴了 ViT 的思维,其出发点是对视觉数据处理方式的创新。
以往的许多研究通过使用各种方法对视频数据进行生成建模,包括循环网络、生成对抗网络、自回归变换器和扩散模型。但这些工作通常存在一些局限性,而 Sora 作为一个视觉数据的通用模型,可以生成持续时间、纵横比和分辨率各异的视频和图像,长达一分钟的高清视频。
Much prior work has studied generative modeling of video data using a variety of methods,including recurrent networks,generative adversarial networks,autoregressive transformers,and diffusion models.These works often focus on a narrow category of visual data,on shorter videos,or on videos of a fixed size.Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations,aspect ratios and resolutions,up to a full minute of high definition video.以往的许多研究通过使用各种方法对视频数据进行生成建模,包括循环网络、生成对抗网络、自回归变换器和扩散模型。这些工作通常关注于视觉数据的狭窄类别、较短视频或固定大小的视频。Sora是一个视觉数据的通用模型——它可以生成持续时间、纵横比和分辨率各异的视频和图像,长达一分钟的高清视频。[heading2]Turning visual data into patches
(c)类方法虽然是SOTA,但是很多想法都受到(d)类的代表作ViLT启发,因此在介绍SOTA之前,我们先详细看看(d)类的ViLT是怎么做的。本节开头的VE,TE和MI的分类方法就是来源于ViLT,ViLT的出发点是把VE彻底换成简单的patch projection模块,借鉴了ViT的思维,如下所示Region Feature。就是传统的CNN backbond+Det head的方式,本质做检测,然后用ROI Align把对应的feature抽出来作为vision token,计算量比较大Grid Feature。只过CNN backbond,把最后的feature作为vision token,计算量也很大Patch Projection。受到ViT启发,上来就过个简单的conv把$$32\times 32$$的像素区域变成一个patch,然后就直接作为vision token,这样推理速度奇快无比网络结构整体网络架构图如下所示,典型(d)类没跑了跟ViT非常像,是个encoder结构。Text经过embedding之后是$$L\times H$$,Image经过embedding之后是$$N\times H$$。Text和Image前面各有一个CLS token,因此总的输入尺寸是$$(L+H+2)\times H$$注意这里的PE(Position Encoding)有两部分首先是0和1分别编码text部分和image部分其次是在text内和image内的常规的位置编码Loss设计训练loss相对比较复杂,得好好说说,包含3个Image Text Matching(ITM)-类似于constrastive los,从text的CLS token出来
[August Kamp]( a musician,researcher,creative activist and multidisciplinary artist.“Sora represents a real turning point for me as an artist whose scope has always been limited by imagination being at odds with means,”she explains.“Being able to build and iterate on cinematic visuals this intuitively has opened up categorically new lanes of artistry to me...I truly cannot wait to see what other forms of storytelling will come into reach with the future of these tools."[heading2]Josephine Miller,Creative Director[content][Josephine Miller]( the Co-Founder and Creative Director of London based Oraar Studio,specializing in the design of 3D visuals,augmented reality and digital fashion."Sora has opened up the potential to bring to life ideas I've had for years,ideas that were previously technically impossible,”she states.“The ability to rapidly conceptualize at such a high level of quality is not only challenging my creative process but also helping me evolve in storytelling.It's enabling me to translate my imagination with fewer technical constraints."