How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version)
Basic
2
Ṁ412028
5,342
expected
1D
1W
1M
ALL
Resolve when we find out.
If they do not use tokens, resolve NA. This situation seems highly unlikely since OpenAI has repeatedly stated that they used Diffusion Transformers.
We only focus on the latent diffusion model part. If they also used Transformers for the VAE compression, we ignore that part.
For reference:
The Original ViT uses 16 by 16 tokens for a picture of 256* 256 pixels. This architecture did not use VAE to compress to latent space.
Gemini 1.5 Pro uses 300 tokens per second.
LLaVA-UHD uses up to 5k tokens for 4k resolution images.
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Related questions
Related questions
Was synthetic video data generated and used in training Sora?
30% chance
How many seconds will Sora take to generate 10 seconds of video?
87% chance
Will Sora be able to generate a 5 second video in less than 2 minutes when it first releases publicly?
90% chance
Will Sora (video model) be able to generate decent video of Sora (Kingdom Hearts character)?
81% chance
Does Sora use DPO?
50% chance