Author: mxs — Sébastien Rannou @ 2023-10-20
https://docs.google.com/document/d/1RwchcUsquJYxR_Dk2PAoEY3RNJtWrkpT3f4b_YO-it8/edit
This document describes the infrastructure setup for running validation at scale using Web3Signer which we put in place at Kiln to run our Holesky validators (~100k). One of the goal was to assess we could run validators from multiple geographical locations while keeping guarantees from Web3Signer. We initially opted for a naive approach and this document covers some pitfalls/improvements we ended up with. A big thanks to the Consensys team who helped us by providing custom flags to Web3Signer to be able to tweak its threading model.
The overall architecture looks as follows:
Each validator client loads a subset of validation keys, is connected to a beacon/exec pair and gets signatures from a fleet of Web3Signer instances connected to an anti-slashing database. This setup is quite common to folks using Web3Signer. The difference with classic setups are the scale at which it runs, and the fact that validators/beacons/exec nodes are ran from different geographical places.
TL;DR: You likely want to favor having Web3Signer instance as close to the database as possible.
In the above architecture, the Web3Signer instances can be positioned:
We initially assumed it would end up in similar signing latencies as the overall distance end-to-end was the same but it’s not the case.
Whenever a validator client reaches out to Web3Signer it sends a single HTTP request so there is only one back and forth here. On the other hand whenever Web3Signer receives the signature request, it sends a transaction to the database which first locks the validator to check, then from within this transaction sends multiple sub-queries (this happens in the transaction handler). This means increasing the latency between Web3Signer and the database will have a ~5x impact on the overall signing latency as there are more back and forth happening:
The impact of higher latency between Web3Signer instances and the database will result in longer queue delays and eventually timeouts if the threading model of Web3Signer is not tuned (next section).