(Mostly) Data Sync Research Log

April 1, 2019

Continued with data sync comparison: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-comparison.md

Diff: https://github.com/status-im/status-research/commit/7e8d8864fbbb34446eef85368d26ade85664db77

Protocol weekly notes https://notes.status.im/EVxnIQeqTpquNETtdXcqNw#

1 Like

2019-04-02

Continued with data sync p2p comparison: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-comparison.md

Not happy with the dimensions break down. Tweaking a bit:

Compared dimensions

*Request for comments: What’s a better way to decompose these properties? Ideally something like 2-5 dimensions that matter most. Right now it reads a bit like a laundry list, or like an ad hoc classification.

These dimensions are largely taken from the survey paper by Martins as well, with some small tweaks. To make it easier to survey, we divide up the dimensions into rough sections.

1. Why and what are we syncing?

  • Problem domain. Why are we syncing stuff in the first place?
  • Minimal unit of replication . The minimal entity that we can replicate. The actual unit of replication is the data structure that we are interested in, usually a collection of entities. Related: version history abstraction (linear vs DAG).
  • Read-only or read and write. Is the (actual unit of replication) data static or not?

2. Who is participating?

  • Active vs passive replication. Are participants replicating data that they are not interested in themselves?

3. When and where are we syncing?

Replication control mechanisms.

  • Single-master vs multi-master. Both entity and collection.
  • Synchronous (eager) vs asynchronous (lazy).
  • If asynchronous, optimistic or not.
  • Replica placement. Full or partial replication.

4. How are they syncing?

  • P2P Topology. Public or private P2P? Unstructured/structured/super-peer or friend-to-friend/group-based?
  • Peer Discovery. How do peers discover and connect with each other?
  • Transport requirements. Any specific requirements on the underlying transport layer?

5. What guarantees does it provide?

  • Consistency-model. What are the guarantees provided?
  • Privacy-preservation. Assuming underlying layers provide privacy, are these guarantees upheld?

6. Engineering and application, how does it work in practice?

  • Actively used. Is an instance of the protocol used in the real-world?
  • Mobile friendliness. Any specific considerations for mobile?
  • Well-defined spec. Is there a well-defined and documented spec to refer to?
  • Protocol upgradability. Are there any mechanisms for upgrading the protocol?
  • Framing considerations. Does it support messages that don’t need replication?

More diff

As well as filling in properties and comparisons of various properties.

Diff: https://github.com/status-im/bigbrother-specs/commit/19ecae4ef700a5eeddf1b683dac1fa73f2f38e54

Also

Edited some phases based on https://notes.status.im/status-protocol-stack# and setup static site https://status-im.github.io/bigbrother-specs/

Further reading

1 Like

Been a while since last post. Let’s post a brief update, starting with the two goals that were posted in OP a month ago.

Looking back

  1. Research and hacking.
    a) Continue studying relevant literature and existing solutions
    b) Document findings and create juxtaposition of different approaches
    c) Informed by above, resume PoC hacking and precise specification

Once there’s confidence in PoC and spec, make a plan for getting it into the app

  1. Finding great people.
    a) Keep track of researchers and people with relevant expertise
    b) Triage based on their relevance and likely availability
    c) As budget and time frame allows, reach out for collaboration

1 (a) and (b) have been done enough for (c ) to be resumed. It allowed us to get a deeper understanding of things such as consistency models, how people look at replication in the literature, and survey various approaches to this problem.

While a and b could continue for longer (much longer), we have enough to get a minimal viable version going (more on this later). As long as it supports upgradability we can improve things as we go along. This is important as it allows us to stay on the critical path.

2 Dean joined recently and has been hacking away, both on the general stack (https://github.com/status-im/bigbrother-specs/) and minimal viable data sync (Minimal Viable Data Sync Research Log).

We’ve also started having more regular calls (https://github.com/status-im/messaging-pm/) with other people under the Web3 Messaging banner. This includes closer collaboration with mixnet researchers such as Nym/Panoramix, and also recently the Swarm team.

Briefly on survey/research paper and utility thereof

For the continued research and juxtaposition, while slightly deprioritized last few weeks, it is also something important that we should keep working on. There are about ~10 bigger ish issues that need more thinking, as well as some feedback and misc TODOs that should be addressed. All of thee are in-line in here https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-comparison.md

The proposed next milestone of the p2p data sync survey paper is to create a solid draft survey/research paper. This should be something of sufficient quality that we can put it out to the larger community, including academia. The idea being that this allows us to be evaluated seriously by security/ethereum/dist-sys community, and this can have other positive side effects such as:

  • snowball effect in terms of research, a la Gnutella research (lots of papers on this)
  • more eyeballs and evaluation of our protocol stack
  • more qualified people taking us seriously, recommending our work
  • more qualified people interested in wanting to participate with us

The latter is already partly happening through e.g. Web3 Messaging calls, and by being public with research we have multiple groups reach out to us, being featured in Week in Ethereum, etc. We can reach wider than the Ethereum community though.

More on this in another update.

Misc PoCs

Some PoCs that have been written recently that relate to data sync:

Briefly on specs of Status protocol stack

Slightly out of scope of data sync per se, but there are some dependencies and interface points. We’ve made a bunch of progress on documenting on our current protocol stack, largely thanks to Adam.

This also includes discussions around bandwidth saving changes (while keeping compatibility), and the dependencies this has on things like moving to device-based comms, multidevices, and so on.

You can see the umbrella issue of things that have to be done here: https://github.com/status-im/specs/issues/10

Minimal viable data sync spec and requirements

The goal with this work is to get a minimal viable data sync spec and PoC out as soon as possible so the app can use it. This can later be enhanced and upgraded. You can read Dean’s log here Minimal Viable Data Sync Research Log

The current description is here: https://notes.status.im/O7Xgij1GS3uREKNtzs7Dyw#

As part of this we have a set of requirements that we have been going back and forth on, in an attempt to lock down the things we know now. More on these specific requirements in the next section. This is still an open discussion, and to read more about this please consult above hackmd or ping me or Dean.

Towards minimal requirements

Here are the proposed requirements as of today.

  1. MUST sync data reliably between devices.
    By reliably we mean having the ability to deal with messages being out of order, dropped, duplicated, or delayed.

  2. MUST NOT rely on any centralized services for reliability.
    By centralized services we mean any single point of failure that isn’t one of the endpoint devices.

  3. MUST allow for mobile-friendly usage.
    By mobile-friendly we mean devices that are resource restricted, mostly-offline and often changing network.

  4. MAY use helper services in order to be more mobile-friendly.
    Examples of helper services are decentralized file storage solutions such as IPFS and Swarm. These help with availability and latency of data for mostly-offline devices.

  5. MUST have the ability to provide casual consistency.
    By casual consistency we mean the commonly accepted definition in distributed systems literature. This means messages that are casually related can achieve a partial ordering.

  6. MUST support ephemeral messages that don’t need replication.
    That is, allow for messages that don’t need to be reliabily transmitted but still needs to be transmitted between devices.

  7. MUST allow for privacy-preserving messages and extreme data loss.
    By privacy-preserving we mean things such as exploding messages. By extreme data loss we mean the ability for two trusted devices to recover from a, deliberate or accidental, removal of data.

  8. MUST be agnostic to whatever transport it is running on.
    It should not rely on specific semantics of the transport it is running on, nor be tightly coupled with it. This means a transport can be swapped out without loss of reliability between devices.

This brings down the total list of requirements from 11 conditions to 8, and 5 SHOULDS to 0. This is still a work in progress and feedback or questions on these requirements are welcome.

Misc notes

Just some points that might be worth mentioning.

It was brought to our attention that Teller Network would be interested in using data sync for state channels. This is one of several alternatives being evaluated, and its requirements would roughly be reliability (why mailservers don’t cut it for now) and scalability (why Ethereum PoW chain might not be the best choice). Other alternatives being evaluated are POA/POS chain, etc. One idea here is to choose a short-term solution, black box it, and work on something more desirable over long term, such as generalized plasma solution, etc.

From a data sync point of view, this is interesting as a specific use case that fits the requirements. A concern here is in terms of timeline.

Another point worth emphasizing is making sure we see data sync as generalized messaging, including machine to machine communication, and don’t get blindsighted by our focus on supporting the app. There are probably more areas where we aren’t taking this general view sufficiently seriously (e.g. assuming human users everywhere).

Next steps

Main strands of work.

  1. Refine minimal requirements above, associated spec with wire protocol and protobuf, and get a working PoC that we can (later) use through console-client, and possibly other means. Aggressive deadline for inital version of this has been set at end of April, but this might be revised.

  2. Work on getting a a solid draft for all the associated specs in spec repo. This is ongoing work and with Adam going on vacation this might take a tad longer than expected. Old ETA for this was mid May but may be revised.

Associated with spec effort is also critical path for getting Whisper partition topic bandwidth consumption with compatiblity into app. This is a must for a public launch, and interfaces with existing spec, the app, multi device pairing, and data sync.

  1. As bandwidth allows, continue on data sync p2p comparison survey/research paper. No ETA set. Might collab with Jacek on this later in person.

As well as continue with general collaboration with Messaging for Web3 teams. Potentially something more with Swarm team as there appears to be some opportunities there in the medium term.


I probably forgot some things, but that’s the (long) gist of it in terms of updates.

Until next time!

2 Likes

May 6

Compilation of notes from previous week. Today most focus was on Bloom filters. Specifically as they pertain to dealing with data loss and privacy preservation (see minimal requirements above), as well as to what extent they are a necessary optimization for bw/latency, and how they relate to sync with untrusted nodes. See research questions at end for more details.

Briefly on survey paper

Realized one thing that was bugging me with the existing survey work, is that it largely compares these different sync protocols in a vacuum. This means things that shouldn’t be captured there are, and there’s not explicit focus on constraints of the problem, nor proposing a solution.

While the desired output here is code into the app (this remains the critical path), we also want to keep doing this paper in parallel, being more problem and solution oriented as opposed to detached survey. To be continued.

Previous

Some notes and misc resources I didn’t get around to post before. More of a dump/reference.

Also revisited SoK and conversational security a while ago:

SoK Small note: they say self-destructuring messages is impossible, which is true, but they are possibly missing local threat model as seen from real world issues.

Three small pieces: (Notes captures elsewhere)

image
because pictures are pretty

<Notes from Git internals and object model, tags,references, remotes etc missing>

Bloom filters and their applications to data sync.

Main paper: Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. “Theory and practice of bloom filters for distributed systems.” IEEE Communications Surveys & Tutorials 14.1 (2012): 131-155.

In this paper they go over basics of Bloom filters, and surveys multiple variations of Bloom filters with different characteristics.

image

They also go through various applications, from general search, caching, network routing to p2p data sync.
image

Bloom filters, merkle trees and blockchains

Achieving Blockchain Scalability with Sparse Merkle Trees and Bloom Filters. The main idea in this blog post is to combine Sparse Merkle Trees (SMT) and Bloom filters to increase Bitcoin scalability. SMTs are a more sparse version of Binary Merkle Trees, and leverage default hash values for empty leafs. The SMT is used to prove membership of a transaction, whereas the bloom filter is used to prove non membership of a transaction. That is, to prove that an unspent transaction hasn’t been spent in a previous block. This is useful as you can have a compact representation, instead of going over all block headers and prove a transaction hasn’t been spent.

Similar ideas have been proposed for e.g. Plasma in Ethereum. See https://ethresear.ch/t/plasma-cash-with-sparse-merkle-trees-bloom-filters-and-probabilistic-transfers/2006 I believe there’s a difference here if you use something UXTO like vs account based. There are some concerns on the performance of SMTs, but this is largely irrelevant at the interface level. See post by Vitalik for some ideas to optimize this. https://ethresear.ch/t/optimizing-sparse-merkle-trees/3751

Bloom filters and data sync

See papers above, things to read more.

  • Efficient PDA sync. Using Bloom filters for heterogenerous networks and synchronization protocols.

Only skim. Most sync uses “slow sync”, which is band for BW and latency. Their idea: synchronize hosts using one message in each direction with length |A-B| + |B-A|. Their approache CPISync supposedly more efficient than even standard Bloom filter approaches, using some kind of characteristic polynomial (didn’t dive into math details).

  • InformedContentDelivery AcrossAdaptiveOverlayNetworks. Optimizing throughput through better set reconciliation and collobrative downloads (p2p?). Also erasure coding.

  • Efficient Peer-to-Peer Keyword Searching. Quick search with low bandwidth using Bloom filters and DHTs. Inverted index over DHT. Some mathy calculations and experimental results, including for cpu,latency and with forms of caching.

  • TRIBLER: a social-basedpeer-to-peer system. Exploiting similar interest and friendship/community to increase usabiliy/perf in p2p. Set of extensions to BT. Talks about 5 challenges: of achieving decentralization, availability, system integrity/trust, providing proper incentives and finally network transparency (NAT traversal, etc). Taste-buddy-based content discovery. Buddycast. Bloom filters are used for their Megacaches. Using empirical data and carwler to see Friendster average friend (~243) and f2f connections (9147), within order of magnitude same for others. This means ~260 bytes to discover friend-of-friend, and enable thousands of peers to exchange a lot of data quicky. Buddycast: taste buddy exploration (random peer) vs exploitation (taste buddies). Ratio thereof.

  • How they solve the availability issue and how this relates to replication between untrusted nodes, Swarm etc.

  • Need to look into this in more detail

  • If you randomly sync with nodes, wouldn’t this impact sync scope? For a topic, say.

  • https://pdfs.semanticscholar.org/94e6/26b7ff25b07d23fd345aad254abbdd0a574b.pdf. Data bundle synchronization. Bloom filters, NAT traversal, data bundle selection algorithms. Good for high churn. Node from candidate list (probabilistic to get good behavior, 1%/45%/50% iirc, tracker/friend/f2f/random? ish). You select a range of bundles to sync, but which are these nodes? Only a range. They mention 100k bundles, so that’s all their content?

Flow Model: Node selection -> Bundle Selection -> Bloomfilter creation -> NAT traversal -> Waiting for reply. Need to read and think about this more - my mental model of replication space is different now from last time.

image

Churn resiliance - 1000 nodes 30m to 30s experiment.
image

Also, this paper is better than their wiki. (Dispersy Read the doc system overview).

Misc

(Delft, Niels Zeilemaker etc? Where are these people now? Reach out?)

Research questions

  1. Can Bloom Filters be used to do more efficient data sync?
    Unquestionably yes.

  2. Are Bloom Filters necessary for a minimal version of data sync, or can they can be added after the fact?
    Leaning towards adding after the fact as enhancement being a desirable possibility.

TODO: Still here - the main unknown is in terms of how to deal with sync scope and sync with untrusted nodes.

Can we generalize the node selection and bundle selection step? This might make it more elegant (make it work->right->fast).

  1. Can Bloom filters be used/relied on for privacy?
    See paper (http://netgroup.uniroma2.it/wp-content/uploads/2015/04/BBL-PSD-cameraready.pdf) for analysis. Also how does Tribler deal with this? Vs private overlay.

  2. Can Bloom filters ability to prove non-memberhip be used to deal with exploding messages and data loss?
    To be explored.

  3. How big is the perf difference for reasonable parameters and Bloom filter?
    I.e. bandwidth saving, to calculate.

TODO: Still here - need to set some assumptions and do some math, or compile existing empirical data. Can probably KISS to get order of magnitude intuition.

Next

Look into above questions more in detail, especially as blocking critical path. Study Tribler in more detail.

iirc it doesn’t, by sharing a bloom filter you allow a node to discern information about you, it’s similar to the sync with untrusted nodes issue

May 8

Looked into Dispersy bundle synchronization and Tribler into more detail. Efficient sync using subset bloom filter selection, as well as some good ideas on dealing with challenging networks of various types.

Briefly on large group chats etc

Just to capture briefly ideas from physical notebook. Terse version.

  1. Leverage multicast when underlying protocol allows it, i.e. use | topic | message | for sending offers/acks. This assumes you send to many IDs and these are all listening to that topic.

  2. Random selection for large group chats. I.e. pick K out of N to offer/ack to, this allows not having quadratic scaling for naive syncs.

  3. Log collator idea. f(topic, logs) => topic, one log. Problem: Censorship resistance (proof of data availability~)? No denial of service.

Tribler

The main idea is to exploit the social connections between people who share similar interests, and use this to increase usability and performance in p2p networks.

Content discovery and recommendations is based on taste buddies, i.e. users with similar interests.

Talks about decentralization/availability/integrity/incentives/network transparency as being fundamental challenges in p2p networks.

Architecturally, it is a set of extensions to Bittorrent. It has a module for social networking, and ‘megacaches’ to locally query metadata. This meetadata are things like friends list (information on social network); peer cache (info on super peers); preference cache (preference list of other peers). All of these are under 10MB in size.

The problem domain is Bitorrent so static files. In UI there’s downloads/friends/files liked and taste buddies.

They have a buddycast algorithm (epidemic based on social graph) and a ‘peer simularity evaluator’.

Bootstraping use random super peer to get more nodes, then use those other ones.

Session boundary problem - using stable peer ids, i.e. keypairs.

Bloom filters for storing and testing set membership for these megacaches. Filtering peers from messages that are known to destinations.

Size of these filters functions of expected connections. Empirical data: Friendster 240 average friends and FB 9100. [FB ~200 median, 340 average]. (This makes sense, since 100 and then 100*100 with some overlap).

To get a bloom filter with 1% false positive rate you need <10 bits per element, so that’s 260 bytes to test and figure out common friends of friends.

Buddycast: Each peer has a number of taste buddies in their megacache with preference list (what content they want) and random peers (i.e. no known preference list). Then every 15s, each peer can either exploit or explore. Exploit = connect to taste buddy, explore = conect random peer. This ratio can be titrated.

A buddycast message also has taste buddies, top 10 preference list and random peers. Buddycast can be disabled for privacy. Peers are listed either by freshness (random peers) or similarity (~overlapping preference lists?).

Briefly talking about BT and rarest-first, tit for tat, and BTSlave using repeaters to increase content replication.

  • Curious, does the megacache concept scale for messages? Assume a message is 1KB, and average user sends 100 messages a day (70 for slack average sent). If you are talking to 100 other people (~200 friends in some constellation of N chats), that’s 10k messages per day or 10 mb per day already. The key thing here is that it’s metadata Tribler is storing, not messages themselves, which (a) allows it to be a cache, cause data integrity not hard requirement (b) constant space, ish (prune preference list at top N).

  • What would happen if we used this on top for most popular/recent content? Still allowing for other means.

Main paper May 8: Dispersy Bundle Synchronization

This is their academic paper equivalent to https://dispersy.readthedocs.io/en/devel/introduction.html - not clear exactly what overlap looks like. I find it interesting that in their docs they call it Distributed Permissions System, but in the paper they emhpasis data dissemination in challenged networks. EDIT: Oh, cause this is about the Bundle Synchronization in particular, not Dispersy or how they use it.

Introduction

Requires minimal assumptions about reliability of communication/network copmonents. Decentralized and pure p2p. Specifically for challenged networks.

What are examples of challenged networks?
Mainly: Vechuilar ad hoc networks (VANET), Delay Tolerant Networks (DTNs).
Additionally: P2P Networks and smartphones.

What’s challenging about VANETs?
Connection windows are small.

What’s challenging about DTNs?
Connections are only possible when network isn’t very well-utilized.

According to Dispersy, what’s challenging about pure p2p networks?

  • Churn - node lifetime measured in minutes
  • NAT traversal
  • Malicious nodes: frequent interactions

In past, these were treated in isolated. For Dispersy, these concerns are combined.

What’s the main idea in Dispersy?
Nodes advertise locally available data in one message, and repeat random interactions lead to quick spreading of data.

NOTE: It appears that the goal is to spread all bundles to all nodes, i.e. a form of full replication. It is also active, since all nodes are spreading all data.

Individual bundles can appear out of order, so it’s different from Delivery in Bramble. They suggest not having dependencies on bundles.

Key things in paper:

  • 1:N and M:N data dissemination with minimal complexity/state info
  • Easy repeated random interactions between nodes, wrt churn, failure and NATs
  • Allow for 100k bundles+ and M+ nodes
  • Experimental data and real-world usage

Related work:

What are the three forms of distributing updates Demers proposed?
direct mail, anti-entropy and rumor mongering.

According to Demers, how does direct mail data distribution work?
Update is sent from one node to all other nodes.

According to Demers, how does anti-entropy data distribution work?
Node chooses random node and resolve difference in state.

According to Demers, how does rumor mongering data distribution work?
Node receive new update (hot rumor), then forward update until it goes cold (received from many nodes).

Demers initially thought anti-entropy too expensive; optimiztion to use check-sums etc. (NOTE: Before Bloom filters popular?)

VANET can be seen as P2P due to churn and many nodes w/o central component.

What is the network churn in VANETs due to?
Movement of the vechile.

Lee had urban vechiles with event summary such as license plate and then bloom filter to detect missing summaries. Bloom filters as being very efficient for membership testing, in Lee’s case using 1024 Bytes. (NOTE: for 1% false positive rate that’d be ~1000 cars). They also coupled this with a push to one-hop neighbor. The bloom filter here is thus only used to detect missing summaries. They call this a form of harvesting.

System Overview

‘Bundle synchronization middleware’ for challenged networks. Extends the ant-entropy method using compact hash representation (Bloom filters). Allows two nodes to compare data sets w/o sending complete copy.

TODO: Insert overview of sync and receive

In Dispersy, What does the synchronization five steps look like?

  1. Node selection
  2. Bundle selection (some range)
  3. Bloom Filter creation
  4. Send BloomFilter (NAT traversal)
  5. Wait for reply (Pause and goto start)

Synchronization extends harvesting by Lee. They only have a range of bundles, which keeps false positive rate low w/o big bloom filter.

NOTE/QUESTION: How does that follow? Bloom filter size is dteremined by bits-per-element (m/n), but if the element is the set of possible bundles what does it matter what the specific range is? Does this imply the recipient knows which range it is supposed to receive, in order to limit the universe of possible elements?

They say “in contrast Lee used fixed size w/o selecting bundles”. This seems to imply the recipent knows what the range is.

System Components

Allow for different types of challenged networks, so Dispersy is modular. They have a few different components:

  • Network Construction
  • Node Discovery
  • Robust Node Selection
  • NAT puncturing
  • Synchronization
    (- Synchronization performance)

image

Network construction

All nodes communicate in one or more overlays.

What’s an overlay in p2p?
An additional network build on top of e.g. Internet.

A public key is used as identifier of an overlay, which has to be known to nodes joining.

NOTE: Is this sloppily written or am I misunderstanding? Is this just NodeID or do they mean some kind of bootstrap node / Network ID? EDIT: I guess if we look at each node as having their own overlay.

Why did Dispersy choose to run on UDP over TCP?
Because it allows for better NAT firewall puncturing.

Each overlap has a candidate list,

What is the candidate list that each Dispersy node has in each overlay?
List of active connections within that overlay.

When we connect to an overlay, the candidate list is empty, thus requiring node discovery.

Node Discovery

What’s the most basic form of node discovery if you have an Internet connection?
Using trackers.

What characteristics does a tracker have for node discovery?
(1) Known to all nodes (2) Must be connectable (not behind NAT-firewall).

What does a tracker do in terms of node discovery?
(1) Maintains list of nodes for several overlays (2) Returns these upon request.

After populating candidate list with initial nodes, for example by asking a tracker, these can be used for new connections.

A node A asks node B for introduction to new node, B sends reply with node C. Then A can puncture NAT.

Description of NAT puncture mechanism, more in later section.

Robust Node Selection

We have to select a node from candidate list to send intro request to, but there might be attacker nodes in the list.

Why is node selection in p2p non-trivial?
You might connect to a malicious node, leading to e.g. eclipse attack.

What’s manipulating the candidate list of nodes in p2p called so that it includes attack nodes?
Eclipse attack.

What can a malicious node do after eclipsing a node?
Control all data received by victim.

How does Dispersy create a more robust node selection algorithm?
By dividing candidate list into three categories, one with two subcategories.

What are the 3-4 selection list categories in Dispersy?

  1. Trusted nodes
  2. Nodes we contacted before (successfully)
  3. Nodes that contacted us before
    3a) Received introduction request
    3b) Nodes introduction

When selecting a node from candidate list, category 1 is 1% of the time, 99% divided by category 2 and 3. Each subcategory gets 24.75% of the time.

NOTE/TODO: This probably generalizes to how Bitcoin and Ethereum deals with Eclipse attacks, in terms of buckets etc. Might be worth comparing these.

For each step in Dispersy node selection, what’s the probability of choosing each category?
1% trusted, ~50% nodes we connected to before, and ~25% for each of other nodes having contacted us before.

How does Dispersy’s node selection algorithm guard against eclipse attacks?
Attacker nodes will (usually!) end up in the third bucket, which has a cap on the selection probability.

In terms of which node is selected in each category, this is elaborated on later. But roughly oldest for NAT timeouts, as well as most similar friend (IIRC).

If an attacker has a lot of resources they can still eclipse node, hence the use of trusted nodes.

NOTE/TODO: Eclipse video worth rewatching with better notes on mitigations. Also: How does 1% impact eclipse possibility to do harm? If you have limited amount of connections you might still not reach 1% case, so seems to depend on what harm individual/set of nodes can do.

Trusted nodes (i.e. trackers): Contacting a trusted node will completely reset candidate list (!). They did experiments with manipulating category 2 nodes.

Why are trusted nodes (trackers) less suspectible to eclipse attacks?
They are contacted a lot and constantly by honest nodes, which makes it harder to pollute the candidate list. (N.B. Sybil attacks).

NOTE/TODO: Read up more on Sybil attack and Sybil resistance.

NOTE: They seem to claim 4 million concurrenct nodes in P2P empirically (cite 18, Gnutella), and implicitly that the risk of Sybil attack for that is low.

NOTE/TODO: How does recency impact the sybil attack resistance? IIRC in Eclipse presentation there’s an argument that old nodes are safer, I guess this depends on bucket design? I.e. if % allocation changes.

NAT puncturing

image

How many nodes connected to the Internet are behind a NAT, according to Halkes et al 2011?
Two thirds, up to 64%.

NOTE/TODO: this number might be even higher nowadays with smartphones? Read up on current state of NAT traversal, and how this changes with different protocols, i.e. QUIC etc.

As a consequence of this, each introduction step is used with distributed UDP NAT puncturing mechanism.

How does Dispersy get around the fact that 64% of nodes are behind a NAT?
By integrating each node discovrey step with a distributed UDP NAT puncturing mechanism.

How does the UDP NAT puncturing mechanism work in Dispersy? Three steps.

  1. When A asks for introduction B, B sends introduction-reply to A AND performs a puncture-request to C (with A’s info)
  2. C then sends puncture message to Node A
  3. C’s message will be blocked by A, but when A selects C they can connect to them

NOTE/TODO: A bit fuzzy on the details of how the ports are matched up, worth revisiting basics. Same re NAT timeouts.

Due to NAT timeouts, nodes are invalidated after some time (~30s-1m). See Halkes (cite 9) for more.

Synchronization

Bundle synchronization through bloom filters, allow nodes to check for missing bundles. Additionally, piggybacking on introduction mechanism to allow just one message to be sent.

Why is it extra important to limit size of bloom filter in Dispersy?
Due to UDP + piggybacking on introduction message, so that the MTU of link isn’t exceeded.

What happens if the MTU of the link is exceeded for UDP?
If UDP consists of multiple fragments, there’s a higher probability it will be lost.

What’s a typical MTU size used on the Internet, according to RFC 3580?
1500 Bytes (2304B for 802.11).

NOTE: Might be more up to date resources for this / gotchas for various links.

NOTE/TODO: How big would e.g. Sphinx packet format be? Same for various types of standard cryptographic payloads, e…g Handshake. If they are written with TCP in mind, perhaps this is a gotcha.

<1000 bundleds can be synchronized with 1% false positive rate for 1024 Bytes; <2000 for 10% false positive rate.

Assuming bloom filter of 1024B, how many messages can we synchronize with 10% false positive rate?
<2000

What’s a flaw in Lee’s MobEyes bloom filter design that uses 1024 Bytes filter?
There’s no limit on bundles being synced, which means the false positive rate shoots up if we add e.g. 2000 vechiles (10% for x2).

How does Dispersy get around the flaw in Lee’s MobEyes bloom filter parameter choices AND MTU limitations?
By fixing false positive rate to 1% as a constant, and limiting number of bits available (MTU) (m), this means Bloomfilter has a fixed capacity, so we select subset of bundles to synchronize (n).

Subset selection

Based on global time property of each bundle, which is an implementeation of Lamport’s logical clock.

What property is subset selection in Dispersy based on?
Global-time property of each bundle, which is using Lamport’s logical clock.

Global time is highest global time received by node + 1 when creating a bundle. NOT unique, but provides partial ordering in overlay. Assume distribution is uniform ish.

Two heuristics for defining which subset to sync.

How does global time property help selecting what to sync in Dispersy?
By specifying a subset range of this (non-unique) global time to sync.

What heuristics are used in Dispersy to select subset to sync in Dispersy and when are they used?
Modulo heuristic when a node is lagging behind, and pivot heuristic when it’s almost up to date.

Modulo heuristic: count number of bundles, divide by Bloom filter capacity. E.g. 10k bundles, bloom filter 1000 capacity, use modulo 10. Then use random offset, e.g. select every 10+n, e.g. 1, 11, 21 etc.

What does modulo heuristic in Dispersy resemble and what does it not require?
Linear download (1, 11, 21 etc) and doesn’t require state.

What’s the downside of modulo heuristic in Dispersy?
If almost caught up, it’d take a modulo number of steps to sync up, e.g. 10th bundle +1/+2 etc offset.

For pivot sync, using exponential distribution between 0 and (locally known global time). Then select bloom filters to left and right, and pick the one~. (NOTE: A bit hazy on exact details here, would need to revisit; exponential means bias towards new).
image

**NOTE/TODO: How would we allow for these types of sync modes? It’d require global-time partial ordering visible for all nodes, AFAIUI

What does a node do when receiving Bloom filter in Dispersy?
Checks if it has missing bundles to send, based on local subset of bundles and testing against Bloom Filter.

What does receiving node in Dispersy need to find out if it has missing bundles to send?

  1. Bloom filter
  2. Subset defined by: a) range of global-time b) modulo and c) offset value.

Synchronization performance

Nice math based on false positive rate, MTU, bundles etc. Depending on synchronization level.
image

NOTE: Interesting to note re false positive rate, AFAIUI this is inversely related to BW perf.

Evaluation and evaluation

Actual simulation and empirical study of things described above. Including NAT experiments, random node selection etc. Skimming for now.

Churn resiliance based on session time. Propagation speed. Bandwidth consumption. High workloads. Internet deployment (integrate into Tribler).

NOTE: Cell phone perfmrance is much worse than WiFI connections. I.e. carrier grade APDM routers, hard to puncture. Also UDP concerns.

NOTE/TODO: Look into smartphone NAT traversal state of the art.
image

To sync 100k bundles Dispersy connected to ~1000 nodes and consumed 30 Mbytes of BW per run. (NOTE: Assuming bundle is ~1KB that’s 30*1000=30k for 100k or 1/3? Eh, unit is off somewhere).

Sucess rate of 77% for 15 different NAT traversal (N.B. smartphone worse). Can handle excessive churn (session times 30s).

Further reading and conclusion

Awesome paper, much better and clearer than Tribler Dispersy docs. Wish I read it in detail earlier. Lots of beautiful ideas and backed up by real world experience and math. Still need to think about the untrusted stuff, but at least the dependencies and mechanics are more clear.

Some citations from paper:

  • Bloom, Burton H. “Space/time trade-offs in hash coding with allowable errors.” Communications of the ACM 13.7 (1970): 422-426.

  • Demers, Alan, et al. “Epidemic algorithms for replicated database maintenance.” ACM SIGOPS Operating Systems Review 22.1 (1988): 8-32.

  • Halkes, Gertjan, and Johan Pouwelse. “UDP NAT and Firewall Puncturing in the Wild.” International Conference on Research in Networking. Springer, Berlin, Heidelberg, 2011.

  • Huang, Hong-Yu, et al. “Performance evaluation of SUVnet with real-time traffic data.” IEEE Transactions on Vehicular Technology 56.6 (2007): 3381-3396.

  • Laoutaris, Nikolaos, et al. “Delay tolerant bulk data transfers on the internet.” ACM SIGMETRICS Performance Evaluation Review. Vol. 37. No. 1. ACM, 2009.

  • Lee, Uichin, et al. “Dissemination and harvesting of urban data using vehicular sensing platforms.” IEEE transactions on vehicular technology 58.2 (2009): 882-901.

  • Singh, Atul, et al. “Defending against eclipse attacks on overlay networks.” Proceedings of the 11th workshop on ACM SIGOPS European workshop. ACM, 2004.

Ran out of time for timebox, and still have some synthesizing to do, but thought I’d post it here for e.g. Dean et al. TODO: understand in more detail the following points:

  1. What exactly would need to be added to our current setup to enable this (e.g. transparent global time and subset indicators)
  2. How this compares with current/Briar ish approach wrt untrusted node assumptions and latency (sync delay quite big) - how can we get best of both worlds? Also bloom filter privacy.
1 Like

Extending MVDS with stateless synchronization

Background

Existing spec for MVDS: https://notes.status.im/O7Xgij1GS3uREKNtzs7Dyw?view#

Data Sync log research including Dispersy Bundle Sync: (Mostly) Data Sync Research Log

Bundles and messages are used interchangeably.

Rationale

While the currently suggested implementation takes care of the requirements, it has a few things that can probably be improved:

  • performance in large group chats - minimize chatter
  • leveraging untrustrusted nodes better for churny nodes

Churny nodes means mostly-offline, smartphones, etc.

One way of doing this is to extend the current protocol proposal with some ideas from Dispersy Bundle Sync algorithm (DBS hereafter).

Overview

DBS uses stateless sync with bloom filters on subset of bundle rages. Essentially it goes something like this:

  1. Node selection
  2. Bundle selection, subset
  3. Bloom Filter construction
  4. Send Bloom filter (they use NAT traversal, but this can be separate component)
  5. Wait for reply

They assume you want to sync all data with all nodes. Essentially you have a candidate list and then you pick a (possibly untrusted) node from it, using a specific selection algorithm.

For us, we can do this per sync scope, and let nodes be the participants by default. As an extension, other nodes (i.e. any node) can be included, but there are some privacy considerations to work out to make this workable.

Even if we leave node selection alone it would already be an improvement, as you can leverage existing nodes in a group chat, with no difference from current MVDS setup.

The tricky part, and the one that requires some modification, is how to do bundle selection. To make the Bloom filter efficient, we need to be able to select a subset range of bundles/messages to sync. This has some requirements, things that should be doable by any node. That is, it shouldn’t be like a DAG inside the message.

For Bloom filter creation, what it gives you is a way to check that someone is missing messages. I.e. A sends B some Bloom filter of a range of bundles, and B can test against this range. If they have other messages A doesn’t have, it can send them, just like we do in MVDS right now. And B would do vice versa.

Bundle selection

To do bundle selection we need a few things. The main property that helps us cut bundles up is the global time property. This is a Lamport logical timestamp which provides a partial ordering of bundles. When a bundle is created, it takes the highest global time it has seen and +1. It is assumed the range of these timestamps are roughly uniform, and they are not assumed to be unique.

We probably want this to property to be outside the payload, as this allows any node to participate in this sync. I imagine this would be in the header. I imagine this will be per sync scope, but it might also be global. We need to figure out trade-offs and how much this needs to be specified.

In addition to this global time property, we need a few more things. These are more like mechanisms and things for the Bloom Filter Offer payload.

Essentially we want a way to take a specific subset of all bundles available. Then depending on what sync mode we have (old, lots of missing bundles) (new, just give me latest) we use different heuristics. These are the modulo heuristic and pivot heuristic. Without going into too much detail, the main thing we need to communicate here is:

  • global time range
  • some modulo number
  • some offset

In the modulo heuristic case, if we want 10k bundles, we might pick modulo 10, then offset 1…n and in each sync send a bloom filter with metadata: [lamport low, high], 10, 1. Then receiver would use bloom filter and check if sender is missing something in bundles: 1, 11, 21, etc. Then repeat process for each offset until we have synced across the whole bundle range.

NOTE: Need to confirm this is how they structure global time range

For pivotal sync, as far as I can tell, it’s a global time range and then we probabilistically choose it based on exponential curve. This has a bias towards most recent messages.

Changes required

As far as I can tell the only thing we really need here to enable this is to:

  1. Logic: Make node selection step explicit
  2. Logic: Make bundle/message selection step explicit
  3. Logic/Wire: Add Lamport clock to header

As well as later on:
4) Writing modulo and pivotal heuristics
5) Adding Bloom Filter creation, offer and handling logic
- Additional payload with global time, modulo, offset

Sanity checks and comments welcome.


Briefly on MTU budget

To allow for more efficient transports, it’s useful to be mindful of the MTU. From my current reading, 1500 Bytes seems like a safe standard. This is 1.5KB, or essentially 1KB with some wrapper on it. Some questions we should answer:

  1. How much should we care about this at this stage? I lack intuition to know how big of a PITA this can be, other than knowing it can lead to huge packet loss for UDP with NAT puncturing.

  2. Is this MTU budget correct? Are there other limits in UDP / smart phone routing / mesh networks etc we should care about?

  3. Are there any aspects of our design where this MTU might be hit? Example are: Big cryptographic handshakes; Header overhead; Excessive wrapping; Sphinx packet format; Bloom Filter size; Message payload.

  4. Is there some general heuristic for budgeting these things? I.e. how big of a payload budget do we have, given that we want to encrypt it, put some header, then routing info, etc.

How much should we care about this at this stage?

Not sure you need to care about MTU at all for a protocol at this layer, I think it makes sense if you are designing one that you know will run directly on UDP (for example a discovery protocol over udp, but it’s unlikely to be the case here, as it will have probably a few layers underneath), and in our specific case will run most likely over tcp (devp2p).

That’s not to say that we shouldn’t keep the payload small and lean, but we might try to keep it under 1.5K, and then is just being padded by the encryption layer and uses tcp as a transport, where mtu is less of an issue.

While this is true for current setup, I think assuming we’ll run on TCP forever is too strong. UDP was specifically chosen in e.g. Dispersy due to NAT traversal, and similar considerations are probably true for Internet less usage and running in the browser. The libp2p folks might have some discussion about this in their forum, but I haven’t looked into it.

No assumption on running on TCP forever was made :slight_smile:

my point is that as I understand, this layer will not be directly connected to peers (i.e you won’t have to do any NAT traversal at this layer), and will piggy back on the transport below for these kind of things (which will have to provide a list of peers and a way to communicate with them).

So unless you are planning to have this running directly on top of UDP/TCP, I don’t think we should be taking MTU much into account, although as mentioned, we should keep the payload as small as possible of course, but that goes without saying.

2 Likes

May 13

First, started over paper with a more problem-solution oriented and less huge comparison. First section draft done, related works: https://github.com/status-im/bigbrother-specs/commit/219918e0be53d2f035eb579effedfc9c3c21cb34. Tomorrow, continue incorporate existing MVDS etc requirements into problem and solution sections. Aim is to get a solid draft this week (~4-5000 words).

Also looked into Swarm feeds a bunch:

Adaptive frequency: https://github.com/ethersphere/go-ethereum/pull/881 and https://github.com/ethereum/go-ethereum/pull/17559

Parallel feed lookups: https://github.com/ethersphere/go-ethereum/pull/1332

What’s the purpose of the adaptive-frequency lookup algorithm FluzCapacitor in Swarm for Feeds?
To find the latest update in a feed, i.e. the latest chunk.

What problem does the adaptive frequency MRU(Feed) in Swarm solve?
Previous implementation required users to agree upon frequency and start time beforehand to publish updates.

What does the adaptive frequency MRU algorithm in Swarm allow users to easily guess?
Where the next update ought to be.

What is an Epoch in Swarm Feeds?
A section of time.

What pieces of data does an epoch consist of in Swarm?
A start time and a level.

What does a level correspond to in Swarm feed epochs?
A duration of time, 2^n.

What is epoch id in Swarm Feeds?
Number corresponding to binary operation of start time and level.

What’s the harvesting algorithm for Swarm Feeds?
Describes how to find latest update of a resource by walking back in time.

How does the harvesting algorithm for Swarm Feeds roughly work?
By dividing the grid of time (start time x duration) into smaller and smaller sections for each lookup step.

For Swarm feeds update, if you have a period covered at level n with some duration (start_time to start_time + n), what is always true for epochs at n-1 levels?
That they are covered by higher levels, each epoch provides ‘shade’ for layers below.

What does the parallel feed lookup function do in Swarm?
Explores keyspace more efficiently by launching queries in parallel.

The parallel feedback lookup algorithm (LongEarth) solved timeout issue, since before each lookup to remote node might wait 100ms, but this could result in timeout, which results in false lookup failures. Using a longer timeout, say 1s, would solve this. But then each lookup might consist of 5 lookups, which would lead to ~5s lookup (and even 20s for first lookup).

What does the parallel feed lookup function (LongEarth) in Swarm basically do?
Binary search, divides epoch grid into lookback area and lookahead area.

Once this is merged upstream: https://github.com/ethersphere/go-ethereum/pull/1332 then false timeouts should be less of an issue for remote node testing.


Louis proposal, some questions: Some thoughts on chats using with shuffling feed and pss locations reply


Want to revisit video Viktor made on Swarm vision, and also (new?) Swarm paper. See if there’s time to expand on Staples PoC a bit.

May 14

Brief update.

Draft now up to 4000 words: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md

Almost all of system model and enhancements. A bit left on DSP. Next is adding wire protocol and protobuf as well. Also add SSB. And isolate Feeds list and getting historical actual data. Then example clients, desired simulations ~, then some references and have a solid first draft. Is latency/UX a first class design requirement perhaps? Also multidevice sync at some point.

Question on message id compatibility: How is chunk ids in Swarm calculated? How do they differ from our message ids, and can they be consistently mapped? What about IPFS ids? Would be useful if the same piece of content had the same address, even with ~multicodec. If it isn’t possible, how would we do the mapping?

1 Like

By chunk id (swarm) do you mean the 32 byte address that the chunk has on the network, used for routing etc?

That’d be useful! I don’t know if it is possible, but having the same or similar address for content stored/calculated locally, in Swarm and IPFS would be great. Perhaps using some multicodec stream. I didn’t look into this yet so maybe it is already possible.

Also fyi @yenda I think you asked about this. I ended up hacking this for Staples PoC by reuploading content to get hash, hackathon style :stuck_out_tongue:

I doubt the hashes will ever be the same because of the BMT hasher scheme for swarm designed to enable “word-size proofs” of 32 bytes. References would have to be wrapped by some other object, but of course that would be possible, I presume. The multicodec would only solve how to literally reference the location under the different schemes, no?

Indeed, the problem is how to go from one like datasync/<hash1> to swarm/<hash2> when they refer to the same content. You can imagine references forming a DAG inside a datasync payload, but that doesn’t tell you how to find it in e.g. Swarm. You could overload it and say ‘here are some names for this’, but seems ugly and it’d not allow for nodes to later on replicate on Swarm. It might be doable by considering the wrapping, but haven’t through it through consequences properly yet.

May 16

Up to 5700 words for draft. https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md

Added brief SSB summary. Expanded on stateless sync enhancement and bloom filter subset considerations. Added subsection on content addressed storage and update pointers. Add current specification types and protobuf. Minor references. Added Status specific section on caveats for minimal version (mode of operation, role of mailservers, parent IDs, current implementation).

Next:

  • (Ongoing) add more details to spec and wire protocol
  • Example client section
  • Content addressed storage: We can use multihash/codec to refer to the same piece of content, but how and where do we communicate these so it is uniform? I.e. Swarm, IPFS, Data sync (and Whisper?). I.e. go from data sync ID to swarm chunk id
  • Proof-evaluation-simulation section desired simulations and variables (churn, latency etc)
  • Future work section (move enhancements into this?)
  • Crystalize system model with components and specific names (CAS, node etc)
  • Understand to what extent latency/UX is a requirement
  • Briefly mention multidevice sync
  • Ensure written in a generalized messaging way, i.e. not just human-to-human chat
  • Outline more generalized messaging use cases (group chat, transactions, state-channels, data feeds, ~m2m, more?)
  • Add references for systems compared
  • Ensure challenged networks captured adequately, i.e. DTN mesh networks too
  • Write conclusion
  • Turn into LaTeX
  • Review - read through whole and make sure structure makes sense; tighten up
  • Get feedback from a few people
  • Read over similar types of papers and see what is missing in terms of rigor/analysis
  • Get feedback/corrections from larger community (ethresearch, specific projects mentioned, etc)
  • Find good place to submit preprint (permissionless)

here are some names for this

I’m not sure I understand. Example?

Where exactly can I read about your message ids? And are those ids what you here call eg datasync/ ?

Maybe a specific sync DAG can commit to one specific storage layer, then store the disambiguations there?

I mean this message id: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md#custom-types or https://code.briarproject.org/briar/briar-spec/blob/master/protocols/BSP.md#23-messages in BSP.

By datasync/<hash1> and `swarm/2 I mean whatever it would look like using multiformats, a la image

Maybe a specific sync DAG can commit to one specific storage layer, then store the disambiguations there?

Possibly, it’s not clear to me exactly what this would look like though

Couldn’t they just be concatenated multiaddrs in ascending order?