Storage devices with neural networks pre-loaded sold in major electronics retailers by end of 2035

Premium

Ṁ7234

2035

30%

chance

ALL

Will major electronics retailers (Best Buy, Amazon, etc in US; other countries also count) sell consumer storage devices (most likely SSD) with LLM (or other neural network) files pre-loaded? That will provide a large buff for fast inference on users' devices, and would resemble games/music/other content packaged on CDs. As of now, all AI models are accessed through cloud services or require separate downloads.

Resolution based on product listings and announcements from major retailers; please comment on this market if you see any.

Update 2025-05-01 (PST) (AI summary of creator comment): - Only storage devices with ≥2.5 GB of pre-loaded models qualify (any less is too small for users to do anything useful)
- Must be specifically storage devices (allowed to be several in a package) sold, not as integral components of smartphones or similar devices.

This question is managed and resolved by Manifold.

#️ Technology

#AI

#Technical AI Timelines

#Innovation

Get

1,000

and

3.00

15 Comments

15 Holders

27 Trades

Sort by:

The resolution criteria says “other countries also count”. Not gonna lie, if i walk long enough around Shenzhen, China, i will find preloaded LLMs in some electronic sales chains. Not sure how major though.

There’s a couple pain points that a storage company would have to be incentivized to address for this to happen. Many of these would require “breaking through” several different layers of abstraction for computer manufacturers. I could see tightly integrated systems (phone manufacturers, special-purpose device makers) doing this, but a hard drive company?

Here’s some reasons why I think this is unlikely:

Inconvenience of downloading LLMs isn’t likely to be significant: disk drives aren’t shipping with preinstalled files, even standard ones like operating systems. Consoles could save consumers a ton of time if their hard drives shipped with games, for example, so why would we expect parts suppliers to do this for LLMs?
LLMs aren’t standardized enough for every user to prefer the same 100GB model: why would it be cost effective to distribute a standard open-weight model? There are thousands to choose from, and that’s likely to become worse as training gets domain-specific. Right now, LLMs change so rapidly that preloaded drives sitting in store inventory would be hopelessly outdated by the time they’re sold.
The assumption that tighter SSD<->GPU communication can lead to better inference speed: This isn’t a requirement for the market, but it was one of the things that started this market in the first place. Circa 2024, the major problem preventing models from running locally on consumer GPU devices is lack of GPU RAM. SSDs could help by acting as a slower backing store perhaps, but the bandwidth is probably too small to be useful. What I think is way more likely is for RAM to get larger, and GPUs use DMA to access the system RAM in faster / higher-bandwidth ways than is currently possible.
- Example: suppose I want to have a 64GB language model running at 20 tokens per second. If the entire thing is streamed from SSD backing store for each token, the requirement would be something like 10.2 terabits of SSD/GPU bandwidth per second. We’re currently a factor of about 500 away for SSDs

Feel free to bet against me! The idea is interesting even though I don’t think it’s where the industry is headed.

This market only applies to devices primarily intended for storage containing a gallery of LLMs for users to use, is that right? Android phones with a small on-board Gemini and laptops with on-board Copilot would cause this to resolve NO, right?

@KimberlyWilberLIgt yes, primarily intended for storage as per my previous comment ("and it must be specifically storage device...") - containing LLM/image-generating model (Stable Diffusion)/something other recognizable as a neural network (audio-generating perhaps?)/collection of those.

And you're right: Android phones with a small on-board Gemini, iPhones with a small on-board LLM and laptops with on-board Copilot wouldn't cause YES resolution.

Why would an external SSD with a neural network on it lead to faster inference? The neural network weights are loaded into RAM before inference is done, so the inference speed depends on the RAM speed not the SSD speed.

@ahalekelly if a complement part of this innovation is done, which would be "connecting SSD directly to GPU, and laying out weights in a sequential manner accessible through common memory read operations", that would prevent any need for loading weights (and, presumably, reduce RAM footprint).

Idea reflected at https://manifold.markets/Kearm20/will-i-be-able-to-run-deepseekv3-10#1e7my2kbsyx, but it came to me even earlier - at some point in 2024.

Will I be able to run DeepSeek-V3 100% locally on my home GPU rig by Jan 5th 2025 LIVESTREAMING IN DESCRIPTION!

4% chance.

bought Ṁ3,000 NO

@AnT There are several ways for PC GPUs to directly access SSDs already. Windows has the DirectStorage API, Nvidia has the GPUDirect Storage API, and PCIe 5.0 has Compute eXpress Link (CXL). These help reduce latency and CPU load, but they don't increase bandwidth.

But none of these help with the fact that SSD speeds are in the single-digit GB/s which is much slower than RAM, the iPhone 16 RAM is 60 GB/s and a 4090 GPU is 1000 GB/s. And an external SSD wouldn't be any faster than an internal one, possibly slower because USB 4.0 currently maxes out at 40 Gbps which is only 5 GB/s, and many computers only have 10 Gbps or 20 Gbps USB. SSD speeds and USB speeds will increase, but I think they will continue to be behind RAM speeds for a while.

I think the reason media used to be distributed on CDs/DVDs was because
1) Optical storage is cheaper per GB than hard drive or SSD storage - doesn't apply
2) Movies and video games were large relative to hard drive size, so users couldn't store many games or movies on their hard drive - doesn't apply
3) Movies and video games were large relative to internet speeds, so downloading them would have taken a long time - could apply to LLMs
4) Movies and video games didn't constantly have new versions coming out, so it wasn't a problem if the disc sat on a shelf for a few months before selling - doesn't apply

So I think the most likely scenario for why this might happen is if computer hardware improves much faster than internet speeds. Say you can run a 1 TB LLM on your laptop but you only have 100 Mbps internet so it would take 24 hours to download the weights, you might buy a hard drive with the latest model on it and then copy it to the laptop's SSD

Also, anyone can list an item for sale on Amazon, it's quite easy and doesn't cost anything, someone could do it just to resolve this market Yes.

I think you should only count items Sold By Amazon and not items listed on Amazon and sold by third-party sellers.

@ahalekelly

2) Movies and video games were large relative to hard drive size, so users couldn't store many games or movies on their hard drive - doesn't apply

I think it does still apply. DeepSeek v3 in bf16 format is 1.3 TB, which is quite large (I doubt many users have drives larger than 2 TB).

bought Ṁ50 YES

@ahalekelly it is quite possible to make a GPU with its own slot for SSD and (if needed) internal PCI Express bus, and use a sane amount of (volatile) vRAM like 24-32 GB instead of trying to scale.

Do you think reading model weights on flight would be a bottleneck, rather than all the matrix multiplications?

The inference performance you would get with a given memory bandwidth is easy to ballpark estimate, the memory bandwidth is around 2x the model size per token generated. So if you have a 1TB model and are running it off a cutting-edge SSD that can do 10 GB/s, that’s >200s per token. Ouch. This is why you need enough RAM to store the entire model at once, streaming the model from the SSD is far too slow, at least for the foreseeable future.

On 80GB A100s you’d need at least 13 of them to fit the 1TB model in RAM, and at 2 TB/s each, that’s 13 tokens per second for $260k. Nobody is going to be running 1TB models at home anytime soon, and if you have 1 TB of RAM then the cost of a few TB of SSD for nonvolatile storage is a rounding error.

A single GPU with 32GB of VRAM can only run 32GB models, sure you could put an SSD on it but what’s the point?

Almost all hardware is memory bandwidth limited when it’s only running one chat at a time, but you can parallelize running multiple chats on the same model at once and your compute (TFLOPS) scales approximately linearly with the number of chats while the memory bandwidth (GB/s) doesn’t increase much with more chats, since they’re all using the same weights. Which is why it’s far more efficient to run a cluster that computes many users’ chats in parallel than for each person to have their own compute.

@ahalekelly ouch, indeed. And how much memory does an activation tensor take (actually, how many chats could I handle on one 32 GB GPU)?

It raises an interesting question on whether users have many tasks at once on which LLMs could be invoked, or whether a few invocations with different random seeds would help model to get to a correct solution.

To be clear, only storage devices with ≥2.5 GB of pre-loaded models qualify (any less is too small for users to do anything useful),

and it must be specifically storage device (allowed to be several in a package) sold - not as integral component of smartphone or like..

Related questions

Related questions