Performance of mailservers


#1

Problem

Currently fetching from mailservers can be fairly slow in some circumstances, which effectively prevent us from fetching more than 24 hours from a given topic, which has a negative impact on user experience.

Initial benchmarks

I have run a few benchmarks on our current dataset to see what it looks like, here are the parameters used:

Total number of envelopes on the database: 1467846 (This is 30 days worth of data)
Total number of envelopes in the period we are interested in (1 week): 316479

Benchmark were run on commodity hardware (16GB RAM, i7), tested on the same machine as production to make sure numbers were comparable initially and continued locally once it was clear that there was little difference.

Also it important to mention that envelope topics are not evenly distributed, as some topics have a much greater share of traffic, which tails down pretty quickly:

 1043120 | \x5c6c9b56 (What is this?, there's roughly a envelope published each second)
  351359 | \xf8946aac (Discovery topic)
    7892 | \x98ac139b
    4441 | \x13f4f374

So that needs to be taken into consideration when querying.

In our current implementation, we specify a timestamp range and a limit, the limit being the max number of envelopes per query we want to publish.

Because we always iterate over each envelope in the timestamp range, the amount of work is constant regardless of the topic, so that the first thing we want to measure (publishing is not included in the benchmark as irrelevant for our purpose):

Also worth noting that each run was performed 10 times, outliers were removed, and pre-benchmark run was made, and the average was taken)

No concurrent queries:

Iterate over a week (316479): ~6s
Iterate over the whole dataset: ~27s

2 concurrent queries:

Iterate over a week (316479): ~8s
Iterate over the whole dataset: ~32s

4 concurrent queries:

Iterate over a week (316479): ~11s
Iterate over the whole dataset: ~48s

Roughly 60/70% of the time is spend iterating over the data, while the rest is retrieving the data from the database, so we can assume that the process is CPU bound. So that the probably the first thing we would like to look at.

The second thing worth noting is that of the envelopes retrieved from the database only a subset in most cases is what we are interested in.

The worst case scenario is asking for a topic with little of no envelopes, as in that case it would still require unmarshaling all the envelopes in the time range and match those against the bloom filter of the query, but none will be published and a single query will take ~6s.
So another performance optimization would be to only retrieve those envelopes that we know are going to be a match, and that requires a different indexing strategy.

Marshaling/Unmarshaling

For each envelope we do 2 things:

  • Unmarshal the object from RLP (expensive operation)
  • Match the bloom filter against the query (very cheap)

If the envelopes matches, we encode it back to RLP and publish it to the peer, which is clearly a waste.

The only reason we have to unmarshal the object is to match the bloom filter, so in order to avoid that we either marshal the bloom filter separately so it can be accessed without unmarshaling the whole envelope, or we index the envelopes in the database so that matching happens at the db level, which is preferable.

But first we look at how performance changes if we don’t unmarshal objects, in order to see if that’s a meaningful optimization:

No concurrent queries:

Iterate over a week (316479): ~1.2s
Iterate over the whole dataset: ~4.5s

2 concurrent queries:

Iterate over a week (316479): ~1.5s
Iterate over the whole dataset: ~6s

4 concurrent queries:

Iterate over a week (316479): ~1.5s
Iterate over the whole dataset: ~6.5s

8 concurrent queries:

Iterate over a week (316479): ~3s
Iterate over the whole dataset: ~14s

So we can see that’s a massive improvement already, and interestingly as well we notice that higher concurrency has less of an impact, until we hit the total number of CPUs (4 cores in my case).

So this is definitely the most important optimization we can do, but needs to come in tandem with a different way of storing envelopes, as we want to avoid having to unmarshal the object, albeit it would still take a long time on busy periods for a user to fetch the last 30 days for a public chat (~14s).

We’ll take a look at indexing next.

Indexing strategies

Here we will look at various indexing strategies, I will describe everything that I considered for completeness, although some of these are not that useful.

Whisper bloom filter

Whisper bloom filters are structured a bit different from canonical bloom filters, but otherwise they are very similar. It is encoded in a 64 bytes array, but can be considered effectively a 512 bits bloom filter.

First we look on how a topic (4 bytes of data), gets encoded into a filter:

Given a 4-byte topic, the bloom filter will chose at most 3 bytes, and then for each of these bytes it might turn on one individual bit.
The resulting byte array will have at most 3 bits sets effectively.
When combining bloom filters, we just OR the individual bytes.

A few things:

  • how many different topics are possible? that’s 4 bytes, so 2^32 different topics: 4294967296
  • how many different bloom filters? we need to use the binomial coefficient here, and sum all the combinations of 3 & 2 & 1 elements: B(512, 3) + B(512, 2) + B(512) = 22238720 + 130816 + 512 = 22370048 .

So assuming that topic->bloom filter is randomly distributed (not sure it is), for each bloom filter encoding a single topic there are 192 possible topics that would result in that bloom filter.

Using a different data store

The first attempt was just to use a different datastore (we use leveldb), just for the sake of curiosity and get a feel on how they perform. So I loaded up the dataset in redis, postgres and mongodb, indexed the timestamp field in postgres,mongodb, and use a similar strategy for redis.

After a few benchmark we saw that redis performed slightly better than leveldb, especially for concurrent queries, followed by leveldb, mongodb and postgres, although performance different in all cases was minimal, so not the improvement we were looking for, as expected.

Boolean fields

The first idea was to encode the bloom filter in multiple fields, one for each byte or one for each bit (resulting in 64 or 512 different fields respectively).

At query time we have a few strategies, the naive way would be to fetch any topic that has at least one bit set from the querying bool filter, resulting in a query such as (using SQL as I’d expect most people will be familiar, but any other db would have a similar, as long as there is support for secondary indexes):

SELECT envelopes WHERE b1 = TRUE or b2 = TRUE....

This will return more envelopes than necessary, especially for filled bloom filters.
If we want to match only what is exactly in the querying bloom filter, our query should look like this (say we have 11110000000…):

SELECT envelopes WHERE (b1 = TRUE AND b2 = TRUE AND b3 = TRUE) OR (b1 = TRUE AND b2 = TRUE AND b4 = TRUE) OR (b2 = TRUE AND b3 = TRUE AND b4 = TRUE) OR (combinations where only two bits are sets) OR (combinations where only one bit is set)

So we can count here the number of statements that are required for each query, by counting the number of combinations of 3, 2, 1 respectively, taking into account the number of bits sets, using the binomial coefficient.

We will look at the worst case scenario first (completely filled bloom filter):

B(512, 3) + B(512, 2) + B(512) = 22238720 + 130816 + 512 = 22370048

and we use a 64 bytes array counting each byte with a value as a “on” bit:

B(64, 3) + B(64, 2) + B(64, 1) = 41664 + 2016 + 64 = 43744

Clearly there are a few issues with this strategy.

If we go for 512 version (more accurate), likely we won’t be able to build a query given large bloom filters, and we would fall back on a table scan.
The 64 version is less accurate, but still within what a database might be able to handle.

Another issue is that most databases likely won’t be handling this queries very well, as not optimized for querying large number of fields.
Just to name a few mongodb limits an index to 32 fields, postgres as well, although it can be recompiled to support larger numbers, in elastic is a 1000.

Integer field

One way to overcome the limitations of the previous approach is to use the same techniques, but instead of having n columns, have a single column containing the integer encoded in the bloomfilter.

Doing that allows us to change our query to:

SELECT * from envelopes where bloom IN (x, y, z)

This has the immediate advantage that we only have to index a single field and database should support this indexes and query patterns.

The same drawback applies here as the method before, this is only really feasible if we consider it a 64 bits field, rather than 512, otherwise our IN query would get way too big to handle (worst case scenario using the 64bit method, would be 21K, as we can flip to NOT IN if it exceeds that number).

After a few benchmarks it’s clear that this strategy works extremely well for small bloom filters (~18ms for a single topic), while for filled bloom filter is essentially a table scan.
This would solve our immediate problem (join a new chat, fetch the last 30 days).

So how fast does our 64bit bloom filters fill up?
We can calculate the expected value of bits on using the formula, as we think effectively at each bit set as a 64 die roll:

64*(1-63/64)^n

Plugging a few n we see:

E(1) = 1
E(15) = 13.4 // 5 topics
E(30) = 24.1 // 10 topics
E(60) = 39 // 20 topics
E(120) = 54 // 40 topics
E(180) = 60 // 60 topics

To calculate the number of topics needed, we can just divide n by 3, as each topic would set 3 bits (in order to make it easier to calculate we assume it always sets 3 bits).

40 topics is basically 40 chats open, which is not uncommon, and at that point already we are looking basically at a table scan.

Pro

  • portable across databases
  • fast for single topics

Cons

  • does not help for normal history queries
  • retrieves more data than it ought to, which will have to be matched against the bloom filter “client” side

Postgres bitstrings

Another way, is to just index the bloom filter using postgres bitstrings and then query it with a query such as:

SELECT data FROM envelopes2 where bloom & b'%s'::bit(512) = bloom

effectively doing the same as we do client side on the database.

This method is not as fast as the previous one for single topics (~381ms vs ~18ms) but performs better for large number of topics (~2.3s vs ~4.5s of a full table scan)

small number of topics

No concurrent queries:

Iterate over a week (316479): ~189ms
Iterate over the whole dataset: ~400ms

2 concurrent queries:

Iterate over a week (316479): ~418ms
Iterate over the whole dataset: ~800ms

4 concurrent queries:

Iterate over a week (316479): ~850s
Iterate over the whole dataset: ~1.5s

8 concurrent queries:

Iterate over a week (316479): ~1.5s
Iterate over the whole dataset: ~2.5s

Large number of topics

No concurrent queries:

Iterate over a week (316479): ~500ms
Iterate over the whole dataset: ~1.5s

2 concurrent queries:

Iterate over a week (316479): ~1s
Iterate over the whole dataset: ~3s

4 concurrent queries:

Iterate over a week (316479): ~2s
Iterate over the whole dataset: ~6.2s

8 concurrent queries:

Iterate over a week (316479): ~4s
Iterate over the whole dataset: ~12.5s

We can see it performs better on average for a large number of topics, while it performs worse on average for a small number of topics, although performance are acceptable for our use cases.

Pros

  • Good overall performance
  • Easy querying

Cons

  • Database specific, not portable
  • Perform worse on small number of topics

Topic querying

A different way to approach the problem is to instead of matching each envelope in the database against a bloom filter, we can match the topics that we have in the database against that bloom filter and then query for those topics directly.

Currently we have 9606 distinct topics in a month, although that number is bound to increase as we move to partition topics/topic negotiation, it probably safe to assume that it would be < 100K for a month.

small number of topics

No concurrent queries:

Iterate over a week (316479): ~7ms
Iterate over the whole dataset: ~8ms

2 concurrent queries:

Iterate over a week (316479): ~3ms
Iterate over the whole dataset: ~8ms

4 concurrent queries:

Iterate over a week (316479): ~5ms
Iterate over the whole dataset: ~20ms

8 concurrent queries:

Iterate over a week (316479): ~6ms
Iterate over the whole dataset: ~40ms

large number of topics

No concurrent queries:

Iterate over a week (316479): ~1s
Iterate over the whole dataset: ~4s

2 concurrent queries:

Iterate over a week (316479): ~1.3ms
Iterate over the whole dataset: ~4s

4 concurrent queries:

Iterate over a week (316479): ~2.3s
Iterate over the whole dataset: ~8s

8 concurrent queries:

Iterate over a week (316479): ~4s
Iterate over the whole dataset: ~20s

So we can see it performs extremely well for a limited number of topics, but perform badly for a large number of topics (at which point a table scan is much more efficient).

Pros

  • Performs well for single-topic / small number of topics queries
  • Portable across databases with secondary indexes

Cons

  • Performs badly with large number of topics
  • Scales with the number of different topics in the network

Conclusions

Here’s a chart for non-concurrent queries:

| Method     | 1ws   |  1ms | 1wl | 1ml  | 
| Table scan | 1.2s  | 4.5s | 1.2s|4.5s  |
| BitString  | 189ms | 400ms|500ms| 1.5s |
| Topic      |  7ms  | 8ms  | 1s  | 4s   |

1ws: 3 topics, 1 week
1ms: 3 topics, 1 month
1wl: 256 topics, 1 week
1ml: 256 topics, 1 month

Overall definitely we want to stop unmarshalling envelopes (PR is already made), which will give us the best boost, but needs to be combined with a indexing strategy (or at least a change in the way we store envelopes).

Initially I was leaning to topic querying, having only benchmarked small topic and still processing the unmarshalling, but turns out it performs very poorly as the number of topics increases, at which point a table scan is more efficient.

The decision to make is whether we are ok having postgres (an external process) as a database, which probably will have to be baked in the docker image ( @jakubgs ) , if it’s so, I think bitstring offers the best average performance. If we want to squeeze the best out of it, we can mix the approaches and query by topic for low topics, although probably unnecessary at this stage.

Either way we would solve our immediate problem (a user will be easily able to fetch the last 30 days of history of a given public chat), and as we move to partitioned topic it should also help with those use cases.

If we want to keep using an embedded database, then probably topic querying and falling back to table scan is our best bet.

Let me know what you think, other strategies, if I have missed something etc.


#2

i LIKE this post about performance :blush: my hat’s off to u


#3

Thank you for the great analysis! I don’t personally have any problems with running an out of proc db here, it looks like a good fit for out scenario


#4

I see no issue with running PostgreSQL for our status-go processes if it is the best option for good results across different types of queries. Though I’d prefer to run it as a separate container and linking them. Combining multiple services in one container is an anti-pattern and makes managing and debugging the services more difficult.

I guess if we want to make PostgreSQL as the only available DB to use then we could do it as one image, but if status-go would retain an ability to use LevelDB(or some other embedded databse) as a default, and have PostgreSQL as an option with better performance, then separate containers make more sense.

Thanks for this post, really insightful!


#5

Initially I would expect to at least running them in tandem, and write to both db, so we can easily revert if things go awry, if we decided to go for a pg strategy that is.


#6

First PR, which will improve performance using LevelDB, https://github.com/status-im/status-go/pull/1459 , which is worth doing as we might want to offer leveldb as the default option with postgres possibly as a second option if we want to improve performance.


#7

Code is now running on staging I will leave it running for a day or so to make sure, and then I will push the images on the beta fleet.


#8

Thanks for the detailed analysis. Another concern with using Postgresql is that it’s rather heavy for a p2p app, and it likely further entrenches the special treatment our cluster has. I’d prefer the default dependencies stay as nimble as possible. Are there any other p2p client apps that use Postgresql?

As long as it is pluggable and we don’t spend too much time optimizing for “server farm” use case it seems like fine to me.


#9

This is only a dependency for mailservers (clients won’t be running this code), and it will be configurable (we might want to run postgres, but the average user who has their own mailserver is probably better off using leveldb).

If we decide clients as well should be serving envelopes (a la data sync), we’ll likely need a different indexing strategy (as it is now, just a key value store would be fine), so we might want to revisit, but in any case we definitely will not want to use postgres for clients, we would use an embedded db (leveldb/sqlite).


#10

also, for data sync we are more interested in specific messages and not in date-ranges matching bloom filters, right?
that makes key value stores or SQLite much more viable there, than with our current mailserver structure.


#11

A version of the code that can use Postgres & bitstrings as now been merged to status-go https://github.com/status-im/status-go/pull/1462 and deployed for testing on mail-01.do-ams3.eth.test.

Nodes will be using leveldb by default, and optionally use postgres.

Once we validate that is working fine, we will promote to the other fleets.