Partitioned topic post-mortem


#8

I’m pretty sure it was mentioned on the core devs call when we had an outage of HK mailservers. It was in the action items for the postmortem: https://notes.status.im/5apzxihzQcqRN0WzpvL6SQ and there was a discuss post about it since December: Whisper, Darkness vs. Traffic
And even a discussion in the PR here: https://github.com/status-im/status-react/pull/5004

Yep, we did not

We don’t have a protocol spec and no one audited it which is written in the post-mortem, in the actions.


#9

Do you think these links (a) make it clear it is a breaking change (b) had any kind of larger consensus in Status, vs local decision making within a swarm (which is great for 99% of cases)?

We don’t have a protocol spec and no one audited it which is written in the post-mortem, in the actions.

This seems fine from a technical perspective, might be useful to make it clear. The cultural aspect is more important though. I’d also caution against relying too heavily on future work though (Big Protocol Spec with Security Audit), vs just a plaintext “roughly like this” separate file that captures our knowledge of what compatibility means. We already have a few clients (well, ish), so that’d work as a litmus test, at least as context for discussions and decisions.


#10

nope, I agree that it should be worded differently, so my poor job on this point.

Just to make it clear: We aren’t plainly breaking compatibility, people who are on 0.10.0 and 0.10.1 aren’t affected by that, their app will work just fine.

Yeah I 100% agree. The problem is that we don’t have even that plaintext thing. We have something related to what Clojure-based message format means, but we don’t have any documentation for something across the stack. And all we do (as @adamb) pointed out is very JS-specific, maybe because of that.

UPD: I’ll do an iteration of this post mortem based on this discussion, thanks for the feeback :slight_smile:


#11

We had discussions about how to keep compatibility, these were the possible solutions proposed:

  1. Wait until users of an incompatible version deplete (this is just delaying breaking compatibility until it has low impact)
  2. Upgrade on partitioned topic only once a positive acknowledgment is received that the other end has a compatible version (this can be carried on in contact requests / messages). This approach currently is limited by the fact that multiple devices might be running different versions (i.e desktop most users still use the released version of the app).

Protocol negotiation does not work well in our case given the asynchronous, mostly offline nature of communications, not to mention the fact that communication are really multi-cast (public key), rather than device-to-device.

As a matter of fact, we haven’t released this version, so we are still in time to backtrack (it’s just a flag in the config), and take a deeper look.

In general, I think we should also note that the first PR received little to no reviews after a week of bothering people, and the discuss post had only two contributors, although that was mentioned in the dev meetings, on status and core-chat stands up.

When we had any discussion, most of the people didn’t go further than saying "don’t do it’ or providing not so relevant examples (i.e synchronous online apps without multi-device support).
None of these are helping us move towards a solution, although they are worth mentioning and more constructive engagement would be appreciated.

If we want to grow as a team, we should also start making people accountable for the lack of engagement in critical moments (i.e. discussion about partition topic/compatibility, which in my opinion were fairly visible), instead of waiting for someone to act and criticize post-mortem (which is not a post-mortem, as we haven’t quite pulled the trigger yet, so come forward!)


#12

In general, I think we should also note that the first PR received little to no reviews after a week of bothering people, and the discuss post had only two contributors, although that was mentioned in the dev meetings, on status and core-chat stands up.

I don’t think a PR with specific implementation is a place to discuss a particular protocol change. I know we have been doing it this way since forever but it’s high time we change this process.

It’s true that it was mentioned a few times on various calls but in this kind of matter, I think, the best and only available place we have is discuss.status.im.

If we want to grow as a team, we should also start making people accountable for the lack of engagement in critical moments (i.e. discussion about partition topic/compatibility, which in my opinion were fairly visible), instead of waiting for someone to act and criticize post-mortem (which is not a post-mortem, as we haven’t quite pulled the trigger yet, so come forward!)

+1


#13

The high traffic due to single topic is a serious issue and will only get worse as time goes by and more users come along. We should fix it as soon as possible because the impact can only grow the longer we wait.

The reason we are still in beta is precisely because we still have major issues like these to hammer out, and which is why we should take advantage of it when we can, braking compatibility is that. We can mitigate by waiting a bit until people upgrade, but eventually we have to get this done. (I guess this kinda raises again one of our main issues in debugging user problems, lack of information on what is happening on our clients. Understandable due to privacy concerns, but maybe we could at least collect app versions being used. But there’s a separate discussion.)

Lack of documentation is definitely an issue, but considering how we are still changing things that is understandable. A braking change is a great point to make new docs.
Though I don’t get your comment about “JS-specific” things, are you referring to the same thing as @adam in;

That does make alternative implementations difficult.
Either we make an example client library be our source of truth and documentation of the protocol, or we make the document that is the source of truth of the current state of how things work. As a dev I’d rather have a heavily commented library that explains the protocol and can be an example that can be re-implemented in other language rather than a doc, but that’s my bias.


#14

Oh, so cool to see this discussion is more active and constructive than any previous discussions around this topic. Kudos guys!

Yep, that was my idea (and looks like many people support that), that instead of doing an obviously a very hard thing to keep compatibility here, we just play “we are in beta” card and softly break compatibility with old enough versions.

The major problem that I see, and that we need to fix is that no one looked at our current protocol holistically (transport layer, presentation layer, abstractions) and no one made any risk analysis about how it will scale in the future. We should do that, so we are as aware as we could be of risks like that.


#15

I do refer to exactly that RNG, yes. Though, it uses a standard algorithm, it is a bit tricky to get the same seed to the algorithm as JS does.

So in JS it is something like that:

var seed = generateNumericHash('0xabcabcabc....'); // this is a hell of JS-specific code
var rng = MersenneTwister(seed); // this is standard
var number = rng.getDouble(); // this is standard

so in, say, Go (or C++), you have to do like that

seed := emulateJSMathToGetNumericHash("0xabcabcabc...") // you don't want to see the implementation, believe me
rng := MersenneTwister{seed: seed}
number := rng.getDouble()

#16

I believe you.
So based on what @adam said:

Can we verify that we can actually drop this pseudorandom bullshit without actually reducing security or teenage edginess(also known as “darkness”)? If so, we should definitely do so.


#17

Can we verify that we can actually drop this pseudorandom bullshit without actually reducing security or teenage edginess(also known as “darkness”)? If so, we should definitely do so.

that’s funny :slight_smile:

we could drop it, and I think it’s worth thinking about it, as @adam 's method is much better, I wish I had thought about it before, the issue is that it would delay going live (essentially 0.10.1 would not be compatible anymore, 0.11.0 will still use the shared topic, and we’ll break in a release after that).

I don’t have a strong opinion myself, but definitely an option on the table.


#18

If it is much better and more future-proof, maybe we should delay this thing for a few more releases and redesign it? What do you think?


#19

It’s not more future proof, as it’s only different in the way the topic is calculated, but is definitely better, simpler and much easier to implement in other languages.

I think it’s worth in my opinion, we definitely want to include the change in 0.11.0 though, so that potentially we could go live in >= 0.12 , but should be a simple change (hopefully, migration of nightlies might be a bit more complex).


#20

In my opinion, a better and much simpler approach is to get e.g. X point from the public key and do X mod N where N is a number of topic partitions.

Beyond obscurity and difficulty to reimplement, I tend to agree the mersenne twister doesn’t add much. I believe we may find it hard to get others to adopt as some sort of community UX friendliness standard, for this reason.

If I was making up a new scheme I’d probably (securely) hash the public key before using it for fingerprinting, then use the bits from that hash directly in the lookup like is usually done. I’d maybe try to make the number of bits per word match some specific part of the fingerprint so they can be translated back and forth, for better verification, for those that desire to do so.

See also Choosing a unique three-word name for the difficulty of choosing somebody else’s three-word name: long story short, how boring it is to repro the twister is the biggest obstacle :slight_smile:


#21

Oops, got confused and was primarily thinking about the three-letter name in the user interface when writing the above, but the same reasoning applies to topic selection - specially about spreading public key bits.


#22

Do we intend for this post-mortem to live here in discuss… I’d rather it be a separate document so I can start a section of post-mortems in the docs to gather them in a single place of “lessons learned”


#23

Just wanted to add that the single topic was a progressive change.

Initially it was only used as a discovery topic, for contact requests, after which step a random topic was used.

There was a bug there which lead to a drop in reliability, but instead of being fixed the single discovery topic started to be used for everything.

After some time the random topics were removed from the code altogether, until it became partitioned.


#24

it is more a “pre-mortem” right now to discuss.
the real post-mortem should be a separate document for sure


#25

UPDATE: We decided to NOT ENABLE partitioned topic for 0.11.0 and spend more time researching for better solutions.


#26

Update: The conversation continues in https://github.com/status-im/status-react/pull/7981#issuecomment-483939668

Copy-pasting comment in Github so this doesn’t get lost in a specific PR:

What @adambabik said. Spoke with @mandrigin about this, but it should be a separate PR to specs repo with a proposal that takes into account upgradability, etc. It doesn’t have to build on the current spec WIP PR, but it can refer to it.

Relevant here would also be to mention:

  • bandwidth usage and how it might change with launch and users
  • how this would impact trade-off of potentially not having compatibility
  • what a 100% or 99.9% compatibilty would look like (multicast vs device abstraction) and what it requires
  • entertain option of what keep listening to discovery topic and send less over it through some form of toggle would look like, i.e. 99.9% compatibility case (similar to device sync, but perhaps corner case re recovered account way in future)
  • upgrade path, both from old to new as well as from proposal to further (say we change topic partition again in future)
  • summary of options with trade-offs
  • possibly options for sacrificing compatibility in short term and regaining it later, e.g. with device data sync
  • evaluation should have a big premium/bias towards keeping compatibility, for reasons mentioned in several previous core dev calls

I realize this sounds like a lot, and it’d be easier to just hack it and implement it. But we are trying to create a protocol and spec that exists on its own, and can’t be changed ad hoc. This will only increase in importance going forward, especially as we get more clients. By doing this properly now we are changing the culture of what spec changes look like, and ensuring we know how to do it now rather than later. It is similar to Chaos Unicorn Day in that sense. It is cheap to learn this now, expensive to learn it later.

As a role model here I just want to point to @arnetheduck, who has a lot of experience on working on p2p protocols, including getting 70m install for a compatible reverse-engineered p2p app 10 years ago. @cammellos rightfully points out there are some differences in terms of protocol negotiation etc, and this is the kind of consideration we want to see in the spec proposal. Similarly, @mandrigin and others have pointed out that we do have real-world constraints, such as launching an app with non-sucky BW usage, which might lead to different trade-offs. This type of thing can too be reasoned about thoughtfully in a proposal which clearly outlines different options and trade-offs with rationale.

Once that proposal is in a better shape, let’s bring it up in Core Dev call.


#27

the very first draft (will work on rationale, etc tomorrow): https://github.com/status-im/specs/pull/4