Last week I put together a video explaining attestation inclusion and delays in ETH2. This is one of the keys to maximising your validator rewards and the health of the network.
We run a bunch of hosts on EC2 and while most just have the default DNS name, they all have a Name tag to identify their purpose. While there’s lots of automation setup via ansible, it is often still necessary to SSH directly into a particular box, especially while debugging issues.
I find it annoying to have to go log into the AWS console just to find the DNS name of the particular box I want so I’ve written a little script that searches for hosts based on the name tag and can then SSH into them.
It requires having the
aws utility setup and able to login, plus having
awssh <name> where
<name> is a regex that matches any Name tag. Typically a substring match is simplest. So to SSH to the Medalla testnet bootnode we run I’d just use
If more than one node matches it provides a list of the matching nodes, so to select one of the Medalla nodes I’d run
awssh medalla then select from the resulting list.
It’s like something out of a Sherlock novel – the doors are all locked, windows barred and yet somehow the jewels have been stolen. Or in this case, every block created forms a single chain and yet somehow the validator client has wound up on the wrong fork. Fortunately Sherlock, or rather Cem Ozer, was on the case. This is the story behind those very annoying “Produced invalid attestation” errors that Teku would sometimes generate.
The blocks involved look a lot like:
There are no other blocks floating around. This looks like a chain working perfectly with no forks. However, timing matters in ETH2 and there’s an invisible fork hidden in there. Slot 32 is the start of a new epoch so validator needs to calculate the duties it should perform but in this case the block for slot 32 arrived late, so the validator calculated it’s duties based on:
While in ETH1 that would just mean you’re behind head, in ETH2 whether a block exists or not effectively creates a fork. That missing block contributes to the randao value and so all the committee shuffling and duty scheduling based off the randao changes depending on whether the slot is empty or not.
So when the validator calculates its duties based on slot 32 being empty, it gets a different set of duties than it would if the block had already arrived. The net result is that attestation signatures appear invalid because the validator they come from is calculated based on the shuffling, not explicitly stated in the attestation.
Later when the block for slot 32 does arrive, the beacon chain considers it as just an extension of the current fork, so doesn’t tell the validator client to recalculate duties. An epoch later when those scheduled duties actually happen, they’re still scheduled as if that slot was empty and so the signatures appear invalid.
Cem’s fix is elementary (as most things are once you understand the problem) – the beacon chain node needs to fire a reorg event when a previously empty slot is filled, which causes the validator client to recalculate its duties.
So case closed? Not quite… We’d actually already thought of this potential problem and the validator client was already listening for block imported events. When a block was imported, any duties from two or more epochs were invalidated (at the start of an epoch, you can safely calculate duties for that epoch and the one after, so the blocks in epoch 3 only affects duties for epoch 5 and later). That should have caught this case – why was the problem still happening?
It turns out that while the duties were correctly invalidated when the block was imported, block import isn’t what updates the best block – running the fork choice algorithm does. It turns out the validator client wound up recalculating the duties before fork choice had been run to consider the new block and thus recalculated duties based on the slot still being empty. With Cem’s fix in place we’ll be able to remove this first attempt at a fix and only invalidate on re-org events.
Stumbled across ArchUnit today which looks useful. In particular I think there’s power in being able to assert that certain packages really never depend on each other. Although gradle/maven modules would probably be a better level to assert at. It’s depressingly common for code bases to be split into separate modules with the intention that they be a clear separation of concerns only for a web of dependencies to be added because someone wanted to reuse some class and didn’t refactor to a common module.
Of course, it may not be that difficult to write module level tests in Gradle – the dependency structure is already there and easy to work with…
Occasionally the term “weak subjectivity period” pops up in Eth2 discussions. It’s a weird concept that you can usually just watch fly by and not miss too much. But when you’re talking about how to sync an existing Eth2 chain it becomes quite important. Probably the best resource for it is Vitalik’s post: Proof of Stake: How I Learned to Love Weak Subjectivity I’ve struggled to get my head around it and why it matters so am writing up my current understanding. There is almost certainly at least one mistake in here somewhere…
So what is the weak subjectivity period? It’s the period that a client can be offline for and when it comes back online be able to completely reliably process blocks to get to the consensus chain head. For proof of work you can always do this, but not for proof of stake. To see why not, let’s look at an example.
Say we have an Eth2 network chugging away. Once 2/3 of those validators have attested to a particular epoch it’s considered finalised and no re-orgs can change it. In order to finalise two conflicting epoch’s you’d need at least 1/3 of validators to sign conflicting attestations but doing so is a slashable offence so there’s a very strong economic incentive to not do that. That incentive is essentially what crypto-economics are all about, whether you’re talking PoW or PoS it’s not that it’s mathematically impossible to break the chain, but that it costs you more money than anyone is willing or able to spend.
At this point it sounds like you should be able to just process blocks reliably, confirm the attestations and signatures all line up and it all works out. What’s the catch?
The catch is that validators can withdraw their staked funds and stop being a validator. There are limits on how fast those withdrawals can happen but once the money is out the economic incentive to not misbehave is gone. Critically despite the validator having withdrawn the money, they still have their private key and can sign things – with no staked funds anymore they can’t be slashed for it. Nodes fully sync’d to the chain know they are no longer a validator and reject those signatures but nodes further behind don’t yet have that information and see the signature as valid.
So, if we have 1/3 of validators which have withdrawn their stake, if my node is far enough back on the chain to have not seen the withdrawal of any of those nodes, then 1/3 of the validators you currently think are valid have no incentive to be honest and can sign any blocks or attestations with complete impunity and potentially form a chain which conflicts with the finalised state but is otherwise entirely valid. They can feed you those blocks to lead you down the wrong chain.
However if your node was further along the chain to see one or more of those validators exit, you’d reject their attestations leaving less than 1/3 of the validators as dishonest and allowing you to reliably reach the real chain head.
So the weak subjectivity period is essentially how far behind your node can be before 1/3 of validators can have exited without you knowing about it. Once you fall behind more than that, you need to confirm the chain you want to sync to out of band.