Symphonious

Living in a state of accord.

Exploring Ethereum 2: The Curious Case of the Invisible Fork

It’s like something out of a Sherlock novel – the doors are all locked, windows barred and yet somehow the jewels have been stolen. Or in this case, every block created forms a single chain and yet somehow the validator client has wound up on the wrong fork. Fortunately Sherlock, or rather Cem Ozer, was on the case. This is the story behind those very annoying “Produced invalid attestation” errors that Teku would sometimes generate.

The blocks involved look a lot like:

Slot2829303132
Block Root0x01

0x02

0x030x040x05
Parent Root0x00

0x01

0x020x030x04

There are no other blocks floating around. This looks like a chain working perfectly with no forks. However, timing matters in ETH2 and there’s an invisible fork hidden in there. Slot 32 is the start of a new epoch so validator needs to calculate the duties it should perform but in this case the block for slot 32 arrived late, so the validator calculated it’s duties based on:

Slot2829303132
Block Root0x01

0x02

0x030x04<empty>
Parent Root0x00

0x01

0x020x03 

While in ETH1 that would just mean you’re behind head, in ETH2 whether a block exists or not effectively creates a fork. That missing block contributes to the randao value and so all the committee shuffling and duty scheduling based off the randao changes depending on whether the slot is empty or not.

So when the validator calculates its duties based on slot 32 being empty, it gets a different set of duties than it would if the block had already arrived. The net result is that attestation signatures appear invalid because the validator they come from is calculated based on the shuffling, not explicitly stated in the attestation.

Later when the block for slot 32 does arrive, the beacon chain considers it as just an extension of the current fork, so doesn’t tell the validator client to recalculate duties. An epoch later when those scheduled duties actually happen, they’re still scheduled as if that slot was empty and so the signatures appear invalid.

Cem’s fix is elementary (as most things are once you understand the problem) – the beacon chain node needs to fire a reorg event when a previously empty slot is filled, which causes the validator client to recalculate its duties.

So case closed? Not quite… We’d actually already thought of this potential problem and the validator client was already listening for block imported events. When a block was imported, any duties from two or more epochs were invalidated (at the start of an epoch, you can safely calculate duties for that epoch and the one after, so the blocks in epoch 3 only affects duties for epoch 5 and later). That should have caught this case – why was the problem still happening?

It turns out that while the duties were correctly invalidated when the block was imported, block import isn’t what updates the best block – running the fork choice algorithm does. It turns out the validator client wound up recalculating the duties before fork choice had been run to consider the new block and thus recalculated duties based on the slot still being empty. With Cem’s fix in place we’ll be able to remove this first attempt at a fix and only invalidate on re-org events.

ArchUnit

Stumbled across ArchUnit today which looks useful. In particular I think there’s power in being able to assert that certain packages really never depend on each other. Although gradle/maven modules would probably be a better level to assert at.  It’s depressingly common for code bases to be split into separate modules with the intention that they be a clear separation of concerns only for a web of dependencies to be added because someone wanted to reuse some class and didn’t refactor to a common module.

Of course, it may not be that difficult to write module level tests in Gradle – the dependency structure is already there and easy to work with…

Exploring Ethereum 2: Weak Subjectivity Period

Occasionally the term “weak subjectivity period” pops up in Eth2 discussions. It’s a weird concept that you can usually just watch fly by and not miss too much. But when you’re talking about how to sync an existing Eth2 chain it becomes quite important.  Probably the best resource for it is Vitalik’s post: Proof of Stake: How I Learned to Love Weak Subjectivity I’ve struggled to get my head around it and why it matters so am writing up my current understanding. There is almost certainly at least one mistake in here somewhere…

So what is the weak subjectivity period? It’s the period that a client can be offline for and when it comes back online be able to completely reliably process blocks to get to the consensus chain head. For proof of work you can always do this, but not for proof of stake. To see why not, let’s look at an example.

Say we have an Eth2 network chugging away. Once 2/3 of those validators have attested to a particular epoch it’s considered finalised and no re-orgs can change it. In order to finalise two conflicting epoch’s you’d need at least 1/3 of validators to sign conflicting attestations but doing so is a slashable offence so there’s a very strong economic incentive to not do that. That incentive is essentially what crypto-economics are all about, whether you’re talking PoW or PoS it’s not that it’s mathematically impossible to break the chain, but that it costs you more money than anyone is willing or able to spend.

At this point it sounds like you should be able to just process blocks reliably, confirm the attestations and signatures all line up and it all works out. What’s the catch?

The catch is that validators can withdraw their staked funds and stop being a validator. There are limits on how fast those withdrawals can happen but once the money is out the economic incentive to not misbehave is gone. Critically despite the validator having withdrawn the money, they still have their private key and can sign things – with no staked funds anymore they can’t be slashed for it. Nodes fully sync’d to the chain know they are no longer a validator and reject those signatures but nodes further behind don’t yet have that information and see the signature as valid.

So, if we have 1/3 of validators which have withdrawn their stake, if my node is far enough back on the chain to have not seen the withdrawal of any of those nodes, then 1/3 of the validators you currently think are valid have no incentive to be honest and can sign any blocks or attestations with complete impunity and potentially form a chain which conflicts with the finalised state but is otherwise entirely valid. They can feed you those blocks to lead you down the wrong chain.

However if your node was further along the chain to see one or more of those validators exit, you’d reject their attestations leaving less than 1/3 of the validators as dishonest and allowing you to reliably reach the real chain head.

So the weak subjectivity period is essentially how far behind your node can be before 1/3 of validators can have exited without you knowing about it. Once you  fall behind more than that, you need to confirm the chain you want to sync to out of band.

Exploring Ethereum: What happens to transactions in non-canonical blocks?

In Ethereum, there are often short forks near the head of the chain because two miners found the PoW solution for new blocks at around the same time. Only one of those chains will ultimately wind up being considered the canonical chain and any blocks from non-canonical chains wind up having no effect on the state. Non-canonical blocks may wind up being added as ommers in future blocks, but everything in this article applies regardless of whether that happens. 

If your transaction winds up in a non-canonical block, is it doomed for all time? Fortunately no. Most likely your transaction will still wind up being added to a block on the canonical chain even though it’s already wound up in a block on a discarded fork.

Firstly, your transaction might have been selected for both the non-canonical block and the block that wound up on the canonical chain. Those two blocks are part of different forks so the transaction may be valid on both. In that case, whichever chain gets picked your transaction is included. However, depending on which other transactions were picked and the effect they had on world state, your transaction may have done completely different things on each fork.

Secondly, it’s possible that a node saw the non-canonical block first and treated it as part of the canonical chain (because it was the best block known to it at the time). In that case, the node will have removed your transaction from the pending transaction pool because it was already processed.  When the block that actually wound up on the canonical chain turns up, the node will perform a chain re-org to switch to this new block and relegate the original block to a fork. If your transaction isn’t in the canonical block, good clients will re-add it to the pending transaction pool so it can be added to a future block.

If your transaction went into a block that was always seen as non-canonical, then it wouldn’t have been removed from the transaction pool in the first place and continues to wait for a chance to get into a canonical block.

Bottom line, getting included in a non-canonical block does your transaction no harm. It will generally hang around until it makes it onto the canonical chain.

Exploring Ethereum: Ommers vs Non-Canonical Blocks

One subtle detail in the way Ethereum works is that there is a difference between Ommers and Non-Canonical Blocks. It’s common for people to use the term Ommer for both of these and most of the time the difference doesn’t matter but sometimes it does.

So what is a non-canonical block? Non-canonical blocks are ones which a client imports but which don’t wind up on the canonical chain. Maybe they were on the canonical chain for a while and then a re-org switched to a different chain or maybe they were imported to a fork and spent their entire life languishing there. Either way, they don’t form part of the current consensus chain and have absolutely no effect on the world state. It’s like their transactions have never been executed and no one gets any form of miner reward for them. Non-canonical blocks must be entirely valid blocks that could form part of the canonical chain, but weren’t because we found a better chain.

Apart from not getting any mining rewards, that probably sounds a lot like an Ommer, and it is. But actually an Ommer is just a block header which has been included in the Ommer list of another block. The block does not have to be completely valid – it just has to have a valid proof of work solution and be from a fork that started within the last 6 blocks of the block that included it. The transactions are not included so there’s no verification that they were in any way valid. The miner of the Ommer gets a small reward when it is included as an Ommer and the miner of the block that includes it also gets an award for doing so.

A block can be both an Ommer and a non-canonical block, and in practice usually are. However, it is possible for a block that would be invalid to import as a canonical or non-canonical block to be included as an Ommer.