Operation Eat the Cluster: A Proposal

dogfooding

#1

We currently run a cluster to supplement the Status network because we don’t have enough people running them themselves. This is a pain point for us as it is against our principles, is extra cost and time we spend to support our network, and goes against the whole idea of what we’re trying to do (based on us depending on it).

@Bruno has spent quite a bit of time creating guides and shipping at-cost machines to run a status node yourself. I have planned on doing this, but have put it off because it’s additional funds that aren’t high on my spending priorities (US taxes are rough this year, don’t buy a house using crypto). My wife came up with this idea and I ran with it, she is smarter than me.

We are clearly trying to find ways of making the network more resilient and less dependent on the few of us that monitor it. Chaos day is a testament to this, and our ability to get around things quickly.

We currently pay around ~1380 USD/month (eth.beta) for hosting this cluster (confirmed by @jakubgs). We have eth.staging and eth.testing but we should keep those.

I propose the following:

  • Status subsidizes the purchase of the required hardware for core contributors, so that they can all run their own mailserver and status-node.
  • Core contributors then build and run Status nodes as part of their employment (when feasible of course). It is important that contributors go through the steps to do it, and not rely on others to do it for them. We are not building things that emphasize off-loading responsibility.
  • This currently requires some technical ability, we can get around this by doing it in tranches, each building tutorials and processes that make it easier for the next. This makes the next Chaos Unicorn Day a breeze already. Simple tranches are as follows:
    • technical crew
    • semi-technical crew
    • non-technical crew

Benefits of this:

  • We offload the dependance of the Status network from our cluster, if it fails, the network stands. In fact, we potentially drastically boost its resiliency by increasing the total number of nodes being run, which ends up growing with the company.
  • The people who work for Status have the requisite knowledge to help those around them to bolster the network. In my opinion, anyone who works full time should at least understand how to run a node. If they can’t, then we’re failing.
  • We reduce the financial burn of running the entire cluster. Of course we should run at least a few nodes (bootnode, statistics gathering, etc).
  • We want people to run nodes. Why would they when we don’t? This dog-foods the concept and helps us build what needs to be built to make that an easy job, and we continue to behave in a manner ask others to follow.

Downsides:

  • cost: at ~66 people and 360 Euro (Bruno’s no OS build), we’re sitting at ~24,000 (being rough). This is still potentially less than the cloud services. There will be a few folks in the company that simply don’t have a good way to sustainably keep one running (maybe). And we severely downgrade our running cloud-hosted cluster. Could end up being positive in the long-run.
  • time: The time it takes to get everyone to get the hardware, go through the steps, troubleshoot, write guides, upgrade, etc takes away from them building the final release. I argue that who cares if we have an app in the main store if it doesn’t have a resilient network to support it. If we get a massive on-ramp of people who want to know how to run their own node, and we can’t on-board them quickly to the right way of doing it, when we’ve failed.
  • These things will change: With incentivization work being done to try and get peopel to run their own nodes, this tech may change in the process, which will require everyone to upgrade. This could be drastically more annoying than throwing it on @jakubgs shoulders (lol).

I understand this won’t be a walk in the park, but it feels like the efforts put into this are required for us to deliver what we idealize in our principles and thoughts. I’d appreciate your criticism where I’ve forgotten something or glazed over why this won’t work, or your support in getting it done.


#2

I have on my TODO list to write up a similar proposal, getting core contributors to run own nodes, so kudos for beating me to it! I obviously agree with this proposal.

In terms of baby steps, we still need cluster for things like CI. I’d like to see being a NODE OPERATOR as a badge of honor, and perhaps this is something we can encourage through badges/flares or recognition through TH etc? A lot of this is less technical and more of a cultural shift.

To start with it can be enough to gather interest from a dozen or so people who want to trial this officially, and also become ‘official’ bootnodes. Ideally mix of the different tranches that you mention. If we want people to run their own nodes, we have to do it ourselves first.

I’d love to see more thoughts and interests from semi-/non-technical functions like PeopleOps (cc @ceri @j12b @rajanie), Product (@rachel), UXR (@hester), Community (@cryptowanderer (?)), Marketing (@jonathan).

  1. Does the enthusiasm and proposal for this make sense?
  2. Would you be interested in running your own node?
  3. Would you or your team be interested in making this more accessible/amplifying this message?
  4. From Status POV this is important in order for us to be true to our principles, be sustainable/decentralized/censorship-resistant etc. That said, how does this fit into your team’s roadmap?

#3
  1. yes, I see it as part of accessibility.
  2. yes, it’s been on my wishlist as a weekend project, but so far perceived it to big a time, cost and effort project to even try.
  3. I can document my non-tech process as a start for a guide. That said the process needs to be uber simple. Some sort of batch file would be helpful.
  4. I totally agree to the importance, but don’t see it a high priority for Design. Usability of the core app still comes first and is already overflowing since the team has gone from 9 to 3 people:)

#4

This would be a great candidate for a Gitcoin Kudos badge program too. We run those for Nimbus: https://gitcoin.co/kudos/1161/nimbus_contributor

Anyway, excellent proposal @petty, a couple of notes about pricing. This can be cut down by 50% easily if paying with DAI or (better!) if people order their own hardware directly, bypassing my tax issues here, and you’d also need a much smaller SSD which can significantly reduce cost. You also wouldn’t need the acrylic case.

Assembling the units is literally 10 minutes of work with a screwdriver, or 5 minutes if you don’t get a case. The software is plug and play - just burn the disk image (instructions will be posted in detail) and plug it in and you’re ready. Only configuring Status remains but that’s simple enough to documents across trenches like you said.

So I would absolutely encourage everyone to order their own own hardware, it’s much cheaper that way.


#5

I think it is a great proposal, especially if your goal is to create a really decentralized app (which it is). So the sooner we bite this bullet and make it happen, the better.

So, from the tech perspective, we need our nodes to replace:

  • a mailserver;
  • an RPC node;
  • an IPFS gateway;

anything else?

and if anyone wants to join to me in developing the WebUI to administer the nodes that I started during the BUIDLWEEK, feel free to ping me :smiley:


#6

That is not exactly correct but close, the actual numbers are:

  • eth.beta - ~1380 USD (this is the one actually used by the App)
  • eth.staging & eth.test - 2 x ~350 USD (these are for devs and testing)

So the total would be ~2430 USD. Though ~1380 USD is the real number since even if we get rid of our main cluster the test fleets would probably still remain.


#7

What kind of help would you need?


#8

Speaking with my POps hat on:

  1. Does the enthusiasm for this make sense? Absolutely - we’re all here with a common interest in decentralisation. Educating ourselves wherever we are on the continuum of tech knowledge is part of being a Status contributor. Does the proposal make sense? Yes - pending further discussion/agreement that everyone’s down with the costs involved ($ and time) cc @Dani
  2. Yes - can’t speak for the whole team but increasing our technical understanding is something high on our professional development list, and this exercise (challenging as it may be) seems like a worthwhile thing to do. On my side I anticipate I’d need quite a bit of handholding, but it looks like you’re factoring that sort of thing in to the planning.
  3. Sure, we’d be happy to help with coordination/comms/cat herding if this went ahead, also could help onboard others once we’re set up.
  4. We don’t have anything like this currently in our OKRs, but we’re always open to getting involved in things that either a) push Status forward or b) support Status’s contributors - this seems to hit on both, so we could look at how to slot it in with our current plans if it goes ahead.

Question on:

  • These things will change: With incentivization work being done to try and get peopel to run their own nodes, this tech may change in the process, which will require everyone to upgrade. This could be drastically more annoying than throwing it on @jakubgs shoulders (lol).

Does that mean potentially all the hardware purchased/efforts made could become obsolete? Do we have a sense of what the impact of upgrading would be?

Also - wondering about how this would impact everyone’s ability to focus on existing Status OKRs (e.g. final release), what timeframe would you be looking at for this?


#9

No. The devices @petty mentioned are weak, but relatively speaking. In absolute terms, they’re actually pretty beefy and I run full ethereum nodes on them. There’s no way for Status to become more demanding than those devices can handle. If it ever does, it has completely failed. So no worries on hardware becoming obsolete - only software.


#11

Yeah, I meant software upgrade, which requires people to slightly move from a “set it and forget it” type situation to upgrading it’s software, which is something we’ll eventually have to face anyway.


#12

updating the initial post with those numbers, thanks.


#13

What are the storage requirements and how will they look like in the future?

How much does storage does Status keep?


#14

I will appreciate anything related to either frontend web or golang… I can help with getting anyone up to speed it is just I have quite a few things on my plate right now, so I’m just in need of someone helping with the load


#15

How much does storage does Status keep?

Good question. Do we have metrics on this @jakubgs? None of the data Status keeps is essential, so you can basically purge everything once it’s older than 24 hours, but still would be useful to know.

I will appreciate anything related to either frontend web or golang… I can help with getting anyone up to speed it is just I have quite a few things on my plate right now, so I’m just in need of someone helping with the load

Let me see what I can do, if anything. Can you shoot the details over to me via [email protected] please?


#16

currently a mailserver fully loaded is 2.2 gb of storage, but depends on previous traffic i take


#17

Funny, I was thinking this morning that this kind of stuff (I was thinking about it in terms of eth2 beacon nodes, but this is kind of the same…) could be the ultimate reuse for that old phone everyone has lying around in the drawer… likely it’s powerful enough to run the software, has an appropriate power consumption profile etc.

There’s a few hurdles of course - storage, networking, ease of installation etc. Something like dappnode would help here obviously, or perhaps termux.


#18

With what?

Random extra characters because Discourse is the absolute worst.


#19

envelopes, so only if acting as mailserver, otherwise i d expect storage to be negligible ( hundreds mb possibly less)


#20

In what time period is this - ever? If you auto-purge everything older than 24 hours, does it reduce it as much as I’d expect it to?


#21

Yes, 2.2 gb is for 30 days, so if traffic is constant and you only want to keep last 24 hours, should be < 100 MB (the caveat is that this is writable by anyone, so it’s a potential vector for attack, but that’s another story).