以下为一些关于 video lip sync(视频唇形同步)的高质量项目介绍: 谷歌的“Generating audio for video”项目正在开展进一步研究。该项目具有以下特点:
需要注意的是,这一项目仍在不断完善和改进中。
Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional.与现有的视频音频解决方案相比,我们的研究与众不同,因为它可以理解原始像素,而且可以选择添加文字提示。Also,the system doesn't need manual alignment of the generated sound with the video,which involves tediously adjusting different elements of sounds,visuals and timings.此外,该系统无需手动调整生成的声音和视频,因为手动调整需要对声音、视觉效果和时间等不同元素进行繁琐的调整。Still,there are a number of other limitations we’re trying to address and further research is underway.不过,我们还在努力解决其他一些限制因素,进一步的研究正在进行中。Since the quality of the audio output is dependent on the quality of the video input,artifacts or distortions in the video,which are outside the model’s training distribution,can lead to a noticeable drop in audio quality.由于音频输出的质量取决于视频输入的质量,视频中超出模型训练分布范围的假象或失真会导致音频质量明显下降。We’re also improving lip synchronization for videos that involve speech.V2A attempts to generate speech from the input transcripts and synchronize it with characters' lip movements.But the paired video generation model may not be conditioned on transcripts.This creates a mismatch,often resulting in uncanny lip-syncing,as the video model doesn’t generate mouth movements that match the transcript.
Our research stands out from existing video-to-audio solutions because it can understand raw pixels and adding a text prompt is optional.与现有的视频音频解决方案相比,我们的研究与众不同,因为它可以理解原始像素,而且可以选择添加文字提示。Also,the system doesn't need manual alignment of the generated sound with the video,which involves tediously adjusting different elements of sounds,visuals and timings.此外,该系统无需手动调整生成的声音和视频,因为手动调整需要对声音、视觉效果和时间等不同元素进行繁琐的调整。Still,there are a number of other limitations we’re trying to address and further research is underway.不过,我们还在努力解决其他一些限制因素,进一步的研究正在进行中。Since the quality of the audio output is dependent on the quality of the video input,artifacts or distortions in the video,which are outside the model’s training distribution,can lead to a noticeable drop in audio quality.由于音频输出的质量取决于视频输入的质量,视频中超出模型训练分布范围的假象或失真会导致音频质量明显下降。We’re also improving lip synchronization for videos that involve speech.V2A attempts to generate speech from the input transcripts and synchronize it with characters' lip movements.But the paired video generation model may not be conditioned on transcripts.This creates a mismatch,often resulting in uncanny lip-syncing,as the video model doesn’t generate mouth movements that match the transcript.