We all experience issues with chat reliability from time to time. It maybe message not being sent or channel history not loading. Recently, we started thinking (again) how we can track those issues and debug them. In this post, I’d like to list down all approaches we have tried in the past and start a discussion what to do next to have a full understanding of current chat reliability problems.
How we used to monitor chat issues
Chat reliability surveys
Our users received in-app surveys asking if they feel confident about using the chat. We used Instabug to show them and gather results. As far as I remember, we did not have many answers and the average response was that the chat is rather reliable.
Received / sent ratio calculated by tracking messages in internal builds
To have a precise metric for chat reliability, we tried to track received / sent ratio for messages. We used internal builds (nightlies) and Mixpanel to track IDs of messages to calculate the ratio. Unfortunately, we could not rely on this metric in the end as there was no way to differentiate between messages that were not received due to user being offline vs real chat issues. Also, we did not want to have any tracking of this kind in the app (even internal builds).
Reporting issues with logs via Instabug
Up until 0.9.30, our users could file a bug report directly from the Status app by shaking their phones. Apart from an issue description, screenshot and logs were attached to each report. This method of reporting bugs and requesting features was really popular. We received a lot of great feedback. And, attached logs (console logs like Android logcat) helped debug the issues.
We got rid of it in 0.9.30 as a part of removing 3rd party dependencies for security and privacy reasons.
For internal builds, we used to have TestFairy sessions with videos and logs of app usage. This was used internally to debug issues and provide additional logs for bugs reported by QA team.
It was also removed in 0.9.30 due to security reasons.
We have https://canary.status.im and https://prometheus.status.im to monitor mailserves and be notified of incidents. So theoretically, we should be fully aware if there are some issues with chat history that are related to the backend side (e.g. mailserver is not responding).
What we have today
As far as I know, currently we only monitor mailservers. It means we know very little of what is happening on the client side. It is risky as we are not aware of issues users may experience today due to strange network conditions, geo location (far from mailservers), etc.
But, first thing we should do is to provide a way for core contributors to report issues they experience with all logs required to understand and fix the underlying problems. It may include more details logs (geth.log, logcat) or having a better way to share them with others.
I see the success metric as:
- Having a list of all chat issues
- Being fully aware of their reasons and impact
- Having a clear understanding how to fix them