[Ethereum] Is the size of the Ethereum blockchain a problem for full nodes in the coming months

blockchaingo-ethereum

In August 2017 the size of my chaindata directory was about 12GB. Now it's November and my chaindata directory is now 220GB – as per expectations with the growing popularity of Ethereum.

I run my geth node on Windows 10 with this command-line:

geth.exe --syncmode fast --cache 1024 --datadir D:\Ethereum\GethBlockchain

I had been running my node on my 2TB platter drive, however around late October my geth node simply couldn't keep-up with the new blocks: it was processing blocks it received at a slower rate than they were being produced, I assumed it was my internet connection or something else.

However I saw in Task Manager that my 2TB disk's Random IO was at a constant 100% usage. I swapped it out for a 500GB SSD (just SATA3, not NVME/U.2/M.2) and my disk Random IO dropped to about 20% and geth was able to process blocks fast enough again. It seems that Ethereum (or at least geth) really needs fast Random IO SSD drives in order to process new blocks (why? don't Merkel trees mean there's no need to query older blocks?).

Given that the blockchain is growing at a very fast rate such that it will probably hit 1TB within a few months (and then likely 2TB and onwards before halfway through 2018) this is a problem because that's hitting capacity limits of SSDs affordable by normal people – then you start needing special (read: expensive) setups comprised of either super-expensive high-capacity SSDs or a massive array (presumably concatenation/JBOD, no need for RAID necessarily) of SSDs just to store the blockchain. I don't like the idea of having to spend $200 every few months for another 500GB SSD – eventually I'll run out of SATA ports.

If this holds true, even without it continuing to accelerate, eventually the Ethereum blockchain will be impossible to store on a geth full-node running on commodity hardware: removing its democratization and putting its control only in the hands of those who can afford the hardware necessary to run a full-node.

I read that geth with --syncmode fast will prune the blockchain of unneeded extra data (e.g. intermediate results for smart-contracts) during its initial sync, but after the sync is finished then any new blocks processed will retain that extra data and that it is not currently possible to prune those new blocks.

This has been raised in the Geth/Mist repos on GitHub, the advice seems to be to delete the entire blockchain and run another --sync fast – but I'm finding that this becomes less and less feasible as time goes on. There is talk of this inability to prune/clean-up the blockchain store as a temporary inconvenience, so I hope it means they'll add support soon.

  1. Why does geth perform so much random IO?
  2. Is there any way to compress the chaindata directory contents at all to minimize disk usage?
  3. What will happen when commodity hardware cannot be used to run a full-node? Do the Ethereum Project organizers have a plan for this eventuality?

Best Answer

Regarding 3, I don't think this will happen for a long time. There are other options for running a full node. I'd suggest looking at Parity, running a full node using that client with the "warp" feature only takes 20-30 GB.

For more detail as to why, I suggest reading this blog post:

The Ethereum Blockchain Size Will Not Exceed 1 TB Anytime Soon

Related Topic