How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version) | Manifold

How many tokens does Sora use to encode one second of high-resolution video (1920*1080)? (February version)

Basic

2

Ṁ41

2028

5,342

expected

1D

1W

1M

ALL

Resolve when we find out.

If they do not use tokens, resolve NA. This situation seems highly unlikely since OpenAI has repeatedly stated that they used Diffusion Transformers.

We only focus on the latent diffusion model part. If they also used Transformers for the VAE compression, we ignore that part.

For reference:

The Original ViT uses 16 by 16 tokens for a picture of 256* 256 pixels. This architecture did not use VAE to compress to latent space.
Gemini 1.5 Pro uses 300 tokens per second.
LLaVA-UHD uses up to 5k tokens for 4k resolution images.

This question is managed and resolved by Manifold.

#AI Video Generation

Get

1,000

and

3.00

Related questions

Was synthetic video data generated and used in training Sora?

-5% 1d23% chance

Does Sora use DPO?

How many seconds will Sora take to generate 10 seconds of video?

Related questions

Was synthetic video data generated and used in training Sora?

How many seconds will Sora take to generate 10 seconds of video?

Does Sora use DPO?

© Manifold Markets, Inc.•Terms + Mana-only Terms•Privacy•Rules