Diagnosing chat reliability issues


#1

Hello,

We all experience issues with chat reliability from time to time. It maybe message not being sent or channel history not loading. Recently, we started thinking (again) how we can track those issues and debug them. In this post, I’d like to list down all approaches we have tried in the past and start a discussion what to do next to have a full understanding of current chat reliability problems.

How we used to monitor chat issues

Chat reliability surveys

Our users received in-app surveys asking if they feel confident about using the chat. We used Instabug to show them and gather results. As far as I remember, we did not have many answers and the average response was that the chat is rather reliable.

Received / sent ratio calculated by tracking messages in internal builds

To have a precise metric for chat reliability, we tried to track received / sent ratio for messages. We used internal builds (nightlies) and Mixpanel to track IDs of messages to calculate the ratio. Unfortunately, we could not rely on this metric in the end as there was no way to differentiate between messages that were not received due to user being offline vs real chat issues. Also, we did not want to have any tracking of this kind in the app (even internal builds).

Reporting issues with logs via Instabug

Up until 0.9.30, our users could file a bug report directly from the Status app by shaking their phones. Apart from an issue description, screenshot and logs were attached to each report. This method of reporting bugs and requesting features was really popular. We received a lot of great feedback. And, attached logs (console logs like Android logcat) helped debug the issues.

We got rid of it in 0.9.30 as a part of removing 3rd party dependencies for security and privacy reasons.

TestFairy sessions

For internal builds, we used to have TestFairy sessions with videos and logs of app usage. This was used internally to debug issues and provide additional logs for bugs reported by QA team.

It was also removed in 0.9.30 due to security reasons.

Mailserver monitoring

We have https://canary.status.im and https://prometheus.status.im to monitor mailserves and be notified of incidents. So theoretically, we should be fully aware if there are some issues with chat history that are related to the backend side (e.g. mailserver is not responding).

What we have today

As far as I know, currently we only monitor mailservers. It means we know very little of what is happening on the client side. It is risky as we are not aware of issues users may experience today due to strange network conditions, geo location (far from mailservers), etc.

But, first thing we should do is to provide a way for core contributors to report issues they experience with all logs required to understand and fix the underlying problems. It may include more details logs (geth.log, logcat) or having a better way to share them with others.

I see the success metric as:

  1. Having a list of all chat issues
  2. Being fully aware of their reasons and impact
  3. Having a clear understanding how to fix them

#2

cc @Chad @anna @Graeme @pedro


#3

Thanks @lukasz,

I’d be more than keen to submit logs for inspection of an issue. If we have a copy of the logs you mention, could you link to it in comments? Would be interesting to see the extent of data which is revealed.

Otherwise is an emulator/emulators an option to monitor client side performance. Something like: https://eggplant.io/products/dai/eggplant-functional/


#4

Connecting to peers all the time suddenly.
All mailservers seem to be unresponsive.


#5

Same. https://github.com/status-im/status-react/issues/6700

UPDATE: Appears to be fixed now.


#6

Thanks for the detailed post @lukasz.

I mentioned the following in the Chat meeting, but probably worth documenting it here for the discussion:

  • right now we don’t have a way to enforce that we don’t inadvertently merge a geth change that will break compatibility with the deployed eth.beta cluster. I plan to add a simple canary test to the Makefile (alongside test-e2e) that will be run by CI just so we have an early warning. There is still the issue of handling upgrades of cluster while clients slowly update (maybe we only update half of the cluster initially?)

  • I’m starting to think that the whole idea of pinning a mailserver is not bringing us anything positive. It might be useful in the future when you are paying for using the mailserver, but right now I feel selecting a mailserver should be a non-sticky selection that can be overridden if the app detects that a better choice exists;

  • at Plex we used to have a toggle in the Advanced Settings section that enabled debug logs for a set period of time (20 minutes). It worked well by doing a few useful things to help with debugging:

    • it enabled an embedded HTTP server that exposed live logs and settings (main reason for the 20 minutes timeout);
    • it set the log level to debug (we could do the same for Status.log and geth.log);

    After we reproduced an issue, we could click a button and a zip file would be generated and prepared for emailing, containing - in our case - Status.log, geth.log and maybe a file detailing the system configuration (e.g. network configuration and conditions). Having something like this in our apps (both desktop and mobile) sounds like low hanging fruit that would bring standardization and remove questions like “where do I find the logs?”, “which log files are useful?”, “am I missing something before reporting the issue?”


#7

Received / sent ratio calculated by tracking messages in internal builds

i had an idea similar to this one, but using a bot. there will be reader and writer part of bot. they will be isolated from each other. both of them will connect with 2-4 peers randomly discovered, the same as our clients do. writer will produce message at some interval, and after communication round show sent/received ratio. additionally they will query known mail servers and verify that each mail server have a full state.

this will provide decent healthcheck for the decentralized network and part of mail servers. it might be actually extended to all discovered mail servers. but they need to be registered somehow. so, the goal would be have insights how reliable is the network. we can even go further and simulate conditions that people are experiencing with 3g/4g or flaky wi-fis. this is pretty easy with tools like comcast.

another pain point is that it is hard to implement full e2e tests in controlled environment. we can easily create a network with any packet loss and latency, using status-scale, and collect any metrics that we want. but we cannot do it for whole status-protocol, including part in react. it might be very useful to have such tests, especially if we will be adding acks, in some form, and re-transmission.


#8

@Graeme
We already developed something similar. See https://github.com/status-im/status-react/pull/6692. From my experience, it’s hard to rely on those external device providers.

@pedro

I feel selecting a mailserver should be a non-sticky selection that can be overridden if the app detects that a better choice exists;

It makes sense to me. As a user, I just want Status to be rock solid. It means app making decisions for me for my best interest (like switching mailserver). Of course, I need to fully understand what is happening (UX).

at Plex we used to have a toggle in the Advanced Settings section that enabled debug logs for a set period of time (20 minutes)

This would really help us. Currently, we can get the logs manually but it takes time and (basic) skills to do it. One click solution would be great. Btw, Android has the same option, when you execute “adb bugreport” Android will generate a zip file with all logs. We could do the same bug from Status Advanced settings so everyone can trigger it.
@pedro could you create GHI for it describing what kind of logs / data we want to have in that zip?

@dmitrys

i had an idea similar to this one, but using a bot.

I think this is the way forward. Instead of tracking users’ messages, let’s have a test environment that we can fully control. Some time ago we created a fully e2e test on UI level (cc @Anton) that did it but it was very simple and we could not control the environment (network conditions, etc.). I really believe we should do it on status-go level. This way it will be more easy to maintain, extend and control.


#9

Here you go:


#10

For sure we should avoid involving e2e functional tests for tracking message reliability since they were designed for completely different purpose

However we can try dockeized appium like in https://github.com/status-im/status-react/pull/6692 for getting more controls over the environment