[Ethereum] Index and query blockchain data and levelDB

leveldbmysqlquery

Would it be more performant to query the blockchain data directly through the underlying database levelDB instead of using the provided geth api? How would it be possible? Can I just index the transactions by sender / receiver in mysql or elasticsearch? I saw vulcanizeDB doing this

Best Answer

Geth does three operations on the database that you should consider when trying to find alternatives.

  1. Geth loads the entire database in levelDB (what I believe you want to do).

  2. Geth then deciphers all (or required) key-value pairs using RLP.

  3. Geth then takes these key-value pairs (which are nodes on a merle patricia trie) and creates merkle patricia tries` to understand the state of the blockchain.

LevelDB was not a challenge to get working. You can just grab any repo implementing Google's levelDB in your preferred language.

If you were going to tackle this problem, you would have to accomplish all three of these tasks to access the data on the blockchain. I have tried creating this using my own programming knowledge, but fail to accurately decode complex RLP using java (I'd recommend using python) since RLP structures could be varying in complexity, so a fixed array of one type created in java is out of the question. There are a few packages out there on GitHub that tries to decode RLP to some degree.

Forming the merkle patricia tries is challenging as well. Preferably you would identify the root of the tree. Each node (unless a leaf node) will contain references to another location in the database. The roots of each tree are contained in the block header, and the block header's location start with a special sequence of characters (if you search all values in a small levelDB for ethereum, you will see these special cases). Then the traversal of the tree is each account's public address.

I don't know what efficiency boosts you would get from creating your own solution, this greatly depends on how true you are to the yellow paper and which language you use. If you have a special case you want to use this database for (IE searching for smart contracts, or inputs) and will implement only certain aspects of this case into your program, then I am sure you will see an increase in performance with your non-complete Geth implementation.

Warning: this is not exactly how Ethereum works, but it gives you an idea. Think of the data on the blockchain as strings of information. You need all the characters to produce any string of information on the blockchain. Ethereum would then store each character in a HUGE database. If you decode any character you will see that each one points to another, if you follow the path you can decode one string. The issue is you want to know where "dog" is stored in the blockchain, but that is actually several locations. If you know where it starts (a block header) then you can construct the information by using steps 2 and 3.

I have tried to understand these concepts myself, so if I am off, or need to clarify anything please ask! I wish you luck in pursuing this if I haven't deterred you with my lengthy response!

Related Topic