[Ethereum] Difference between a pruned and unpruned blockchain

blockchainlight-clientsstate-trie-pruningsynchronization

At the Berlin Blockchain Meetup, Gustav Simonsson teased the Homestead release coming soon ™ and we were discussing blockchain bloat and the current size of the Ethereum blockchain.

We discussed the pruned and unpruned blockchain and the geth fast sync option. Now I wonder:

What is the difference between a pruned and unpruned blockchain? Does the pruned blockchain still consist of blocks? Is it even a "block chain"?
What is the difference in space requirements? The full, unpruned chain is currently 7GB in size. How much space does a pruned chain require?
How about pruned fast-sync clients to light clients? If the space requirements are lower, isn't it better to use fast-sync clients rather than unsafe light clients?

Could pruned chain clients be considered as light clients?

Best Answer

Lets take it one step at a time.

Blockchains generally work by having an origin (genesis) state with a few accounts having funds, and then every block that you place on top of the chain moves those origin funds around, also granting a bit of extra for the miner. So whenever you import a new block into your existing chain to take a look at what your view (state) of the world is, and transform that state according to the transactions contained in the block, arriving to a new view of what you believe the world looks like. You don't discard your past view of the world because if there is a fork in the blockchain (e.g. a miner turns up with a better block, or maybe two better blocks), then you need to transform your view from that past state to the better version. This leads to all past states that you transitioned through being accumulated for eternity. This is an unpruned state/blockchain, and is at 7GB currently for Ethereum.

The important thing to notice is that most of the times you don't care about how much fund an account had 3 years ago, you only care about what the state currently is (maybe a few days ago too). So why keep all that extremely old past transition state around? State pruning is essentially taking all that intermediate state, and flushing it down the toilet. The important thing to realize is that you only throw away the intermediate world view, never the blocks themselves or any other data that might be unhealthy for the network (i.e. a joining node needs that data to sync). Thus by pruning your state trie, you lose the ability to query the past balance of accounts, but at benefit of reducing your amount of stored data to aout 1/5-1/6 of its original size.

Ok, so what about fast sync? Well, following the previous thought pattern, if you don't care about the balance of a random account from 3 years ago, why would you want to replay the entire transaction history of the blockchain, just to get to the current state. So what fast sync does, is that it downloads all the blockchain, but it does not execute the transactions generating the world view one block at a time. Instead it only verifies the proof-of-works, and when the entire chain is downloaded it looks at the state root (hash defining the current world view) and downloads the state database directly from within the network, reconstructing the final state from the start, without needing the transient states for it. This means, that beside downloading the blocks, it needs to download additional data, the state trie itself, so it's exchanging bandwidth for processing power (i.e. I download the state, don't generate it). The end result of fast sync is a pruned database from all intents and purposes, just via a different means. The current size of such a database is 1.2-1.3GB.

Related Solutions

[Ethereum] the parity light pruning mode

geth and parity have differents methods to save the ethereum blockchain in their internal format. I made many benchs because i find it so long just to use a Wallet.

The pruning mode is how the block data are saved. With the archive mode, all states are saved. So, you know the state at each moment without to reload all the blockchain. With fast and light, we assume that we don't need all this information but only the current state and few before, so we remove many intermediates states.

On geth, the --fast method saves a state of the blockchain at block B[-1500] and all the states after this block (B[-1] is the last block, B[-2] is the before last block ...). So it is possible to rewind at the state of 1500 last blocks. With a full blockchain, you can do it for all blocks.

In parity, there are 4 pruning modes or journalDB algorithms:

Archive (Archive): as geth Archive we keep all states
Fast (OverlayRecent): as geth Fast we keep all the full states of the B[-i] blocks
Light (EarlyMerge): we keep all the states of the B[-i] blocks but in differential (so it is smaller than fast but slower access)
Basic (RefCounted): we keep all the states of the B[-i] blocks like with OverlayRecent but we remove states after x changes... so we have only the xth last changes

Benchmarks done on i7 3720QM 16GB ram with Geth 1.4.4 (Homestead with 1.6Mi blocks)

_____________________________________________
| Option | Disk Used | Time | Disk Written  |
|--------|-----------|------|---------------|
| none   | 19.GB     | 5h00 | 1TB           |
| fast   | 3.7GB     | 1h00 | 100GB         |
---------------------------------------------

Benchmarks done on i7 3720QM 16GB ram with Geth 1.5.0 unstable (Homestead with 1.6Mi blocks found at https://gitter.im/ethereum/go-ethereum?at=574d26c010f0fed86f49b32f)

__________________________________________________
| command     | Disk Used | Time | Disk Written  |
|-------------|-----------|------|---------------|
| geth        | 21GB      | 5h00 | 150GB         |
| geth --fast | 4.2GB     | 21m  | 35GB          |
| geth export | 1.5GB     | 10m  |               |
| geth import | 21GB      | 3h30 |               |
--------------------------------------------------

Benchmarks done on i7 3720QM 16GB ram with Parity 1.2 (Homestead with 1.6Mi blocks)

_____________________________________________
| Option | Disk Used | Time | Disk Written  |
|--------|-----------|------|---------------|
| archive| 19.GB     | 2h00 | 300GB         |
| fast   | 3.7GB     | 1h30 | 20GB          |
| light  | 2.5GB     | 2h00 | 130GB         |
---------------------------------------------

Note: When you have a node with a blockchain, you can dump the chaindata of geth directory to use it with your other computers. I check it with Linux, Windows and OS X.

Note: if you use --cache with 1024, it could be faster. But it is not significant on my system. The same goes for the --jitvm

Note: the ethereum blockchain saved the final state after transactions but it is safer to replay the transactions to check them.

[Ethereum] How to manually configure the geth fast sync pivot block

Summary

I downloaded the geth source, modified the source code to specify the fast sync pivot block, compiled the code, removed the old chaindata and started the fast syncing. Once this is complete, I'll be back to running the regular geth binaries.

UPDATE This experiment failed. There were a few different errors with my hack that prevented the blockchain to fast sync to the specified block and then normal sync after the specified block. Back to full archive node sync.

Anyone has any suggestions?

Details

I downloaded the source code for geth and modified the source code for section that calculates the fast sync pivot point eth/downloader/downloader.go, lines 419-441:

case FastSync:
    // Calculate the new fast/slow sync pivot point
    if d.fsPivotLock == nil {
        pivotOffset, err := rand.Int(rand.Reader, big.NewInt(int64(fsPivotInterval)))
        if err != nil {
            panic(fmt.Sprintf("Failed to access crypto random source: %v", err))
        }
        if height > uint64(fsMinFullBlocks)+pivotOffset.Uint64() {
            pivot = height - uint64(fsMinFullBlocks) - pivotOffset.Uint64()
        }
    } else {
        // Pivot point locked in, use this and do not pick a new one!
        pivot = d.fsPivotLock.Number.Uint64()
    }
    // If the point is below the origin, move origin back to ensure state download
    if pivot < origin {
        if pivot > 0 {
            origin = pivot - 1
        } else {
            origin = 0
        }
    }
    glog.V(logger.Debug).Infof("Fast syncing until pivot block #%d", pivot)

I modified the last line above to change the Debug into Info and added the following two lines below the code above:

    glog.V(logger.Info).Infof("Fast syncing until pivot block #%d", pivot)
    if (pivot >= 2394190) {
      pivot = 2394190;
    }
    glog.V(logger.Info).Infof("Fast syncing until modified pivot block #%d", pivot)

I recompiled and started off the fast sync process using the modified binaries:

Iota:go-ethereum user$ make geth
...
Done building.
Run "build/bin/geth" to launch geth.

I checked the version of the modified geth:

Iota:go-ethereum user$ build/bin/geth version
Geth
Version: 1.5.3-unstable

I removed the old damaged chaindata:

Iota:go-ethereum user$ build/bin/geth removedb
/Users/bok/Library/Ethereum/chaindata
Remove this database? [y/N] y

Removing...
Removed in 35.242291ms

I started the fast sync:

Iota:go-ethereum user$ build/bin/geth --fast --cache=1024 console
I1120 23:44:44.870142 ethdb/database.go:83] Allotted 1024MB cache and 1024 file handles to /Users/user/Library/Ethereum/geth/chaindata
I1120 23:44:44.878926 ethdb/database.go:176] closed db:/Users/user/Library/Ethereum/geth/chaindata
...
I1121 08:33:51.340811 eth/downloader/downloader.go:441] Fast syncing until pivot block #2664150
I1121 08:33:51.340847 eth/downloader/downloader.go:445] Fast syncing until modified pivot block #2394190

After the fast syncing is complete, I'll go back to using the regular geth binaries.