Better pull requests process


#1

Problem

We have too big PRs and they are merged too slowly. This leads to general frustration, regressions, harder to revert commits, additional workload to rebase, a longer feedback cycle, etc.

Context

We discussed this during the last Core dev call. See below for notes. This is mostly relevant to Status Core and related swarms, though expertise from other projects that have a solid workflow is much appreciated (cc @iurimatias e.g.).

Additionally, we are practicing trunk-based development https://trunkbaseddevelopment.com/, which means we use things like feature flags to hide non-ready features.

There’s also a release process that impacts this: release process DOC

As well as a manual QA process: <TODO: Add link cc @anna>

And a automated test process: <TODO: Add link cc @Anton)

Here’s the current contribution/PR guide: https://status.im/docs/contributor_guide.html

Design review process / design -> code process doc: <TODO Add link cc @andmironov @hester)

Here’s the pipeline we use for contributions: https://github.com/status-im/status-react/projects/7

Metrics

Sample from biggest projects at Status. Spot the difference.

PR stats 2019-02-13

status-react: 30 open PRs. Oldest PR: 111 days. Median PR: 7 days.
status-go: 5 open PRs. Oldest PR: 65 days ago. Median PR: 7 days ago
embark: 4 open PRs. Oldest PR: 12 days. Median PR: 3 days.
nimbus: 1 open PRs. 15 hours ago.

Problem space

  • Too long-lived PRs
  • Lack of consistent use of feature flags for new flows (add code, not change, toggle at last point)
  • Hard to revert commits
  • Manual QA bottleneck, especially near release => add code to existing PRs
  • Hard to do good review on huge PRs
  • Current process doesn’t help dev get things merged due to manual QA rules
  • QA lacks resources for manual testing, and automated testing
  • Mobile and multiple build targets means PRs and automated testing more complex

Acceptance criteria and design review:

  • Implementation doesn’t match design
  • developer components and Figma are different
  • Figma uses fixes styles / lack of automatic responsiveness

Solution space

  • Use flags wherever possible, allowing for feature to be turned on in dev mode, then do final design review
  • Remove manual QA as a blocker for PR merging
  • Don’t dedicate all QA to release
  • QA measuring ‘time to test/unblock’ as a target to optimize for
  • Devs writing automated tests
  • QA testing nightly
  • Decouple automated tests from PRs
  • Figure out way to decrease distance between design and dev components (responsive layout, fiddle, etc)
  • Don’t allow WIP PRs
  • Reject PRs as ‘can’t review, too big’
  • Make it easier to revert commits (no two allow)
  • Close PRs older than a week by default
  • Measure above until we are better
  • Generally culturally enforce it and make expectation that PRs are merged within a day norm
  • Devs taking more responsibility to test their own stuff more and liberally reverting bad commits

Notes from Core Dev Call

Better PRs

Oskar: we tend to have too long-lived PRs. Previously we decided to use feature-flags much more liberaly, but it seems we have fallen out-of-practice with that.
Andrea P.: In some situations we don’t want to use feature flags. When we near releases, testers are too busy testing the release, which leads to people adding code to existing PRs.
Julien: It becomes to do a good review on a huge PR.
Pedro: there’s a lot to be said about breaking PRs into smaller ones, since people will be more keen to act on a PR review if they antecipate it will take a short time. If we see that still doesn’t work, we can try to diagnose where the bottleneck is.
Andrea P.: The system we have in place doesn’t help the dev. We should relax rules for manual QA, since some PRs which have big impact are delayed before they reach develop.
Igor: agree we should relax manual QA rule on small PRs, otherwise we’ll start having dozens on PRs in testing and be a bottleneck for QA.
Anton: makes sense to have small PRs, since there’s less stuff to update on the tests codebase for each PR.
Oskar: taking the example of the wallet redesign PR, why do we replace the UI flow, instead of adding the new one behind a feature flag, allowing QA to decide when to move to the new tests?
Iuri: we have a rule in that we don’t allow WIP PRs, with the exception of bounty PRs. If there’s a PR open, the idea is that the PR should be merged the same day, or possibly the next. Otherwise there’s a lot of churn with constant rebasing. It’s important for PR reviewers to be sensible and not ask for changes that are unrelated to the PR at hand (e.g., no asking for bug fixes that were uncovered by the PR work, but unrelated)
Oskar: we should strive to make status-react have a cleaner PR backlog like with Embark (oldest open PR is 11 days, with 9 open, even though the repo is more active).
Igor/Pedro: how does the Embark QA process look like? It has been an impediment in recent weeks, since with the focus on 0.9.33 release testing, the resources allocated for PR testing have been almost non-existent.
Oskar: that doesn’t give enough credit to Embark, we shouldn’t assume there’s nothing to learn because they are too different from status-react.
Barry: If anything, it’s despite of the tooling, not because of the tooling that we manage to be quick.
Oskar: how can we improve testing speed without adding new resources?
    this is the reason why we dedicated capacity to automated testing;
    Anton: we don’t have capacity to cover each scenario, so we do it manually. We plan to do it in the future, but the future never comes because there are more and more PRs;
    Oskar: I have a hard time buying that we need more resources, there must be something else we can do to improve speed;
    Igor: maybe we can not do all-hands-on-deck for release testing, and then they’ll take longer to complete. There were a couple of situations that required this type of organization to be able to respond in time.
Oskar: this is a long topic, does any one else want to contribute to the discussion?
Andrea P will start a discuss post about this subject.

Acceptance criteria and design review

Rachel: I’ve been observing a lot of PRs where the implementation doesn’t exactly match the design. It’s a general problem, not specific to one or two PRs.
Hester: Eric has done a lot of good work implementing components, so hopefully these issues will be less and less as more things are componentized.
Barry: Components as laid out by designers in Figma is different than the way that developers implement them. Figma uses fixed styles.
Can we lay things out in Figma using Flexbox or percentages instead? (To-do: ask designers to follow up)
Eric: we need a way to sync on what components have been implemented, as I’ve seen components being implemented twice (not necessarilt the same aspects of it). We could do a separate call to sync up (Rachel will set it up).

Next steps

  1. Continue to discuss solutions space here.
  2. Make decision about specific things we’ll try and who will ensure it happens.
  3. Follow up next core dev call and possibly TH.

ADDENDUM: The above “problem space” and “solution space” sections aren’t set in stone, they are merely a first attempting at aggregating problems and potential solutions people have pointed out. Some people below have proactively taken these and crystallized them further, separating the wheat from the chaff to understand what the actual problems are and what some solutions might be to these. More of this please! Don’t get too hung up on whatever the above laundry list says. I should have made this clearer in the initial post.


#2

Regarding feature flags, I think the current workflow needs to be a bit adjusted:

currently in order to remove a feature flag the feature needs to be tested (manual qa), regardless of the environment. This makes feature flags less effective as one of the reason you’d want to have a feature flag is to selectively deploy the feature to different set of users (canary).
For example, it would be very useful to test out a feature on nightly users, and get it to QA only once it is to be enabled in release.

So that’s a possible workflow, anything that is not enable in release, can be merged with reviews only, while anything that impacts release needs to be manually QA’ed (of course with some exceptions)
Desktop remains a bit of an issue as most user I believe are not on release cycles.
It is also worth in my opinion packaging nightlies as a separate app, as we have a few issues of downgrade to release etc, but testing upgrades becomes slightly more difficult.

add code, not change, toggle at last point

While I agree in some cases this is a good choice, I think it needs to come with a disclaimer, here what you are really doing is to delay integration of the code (yes, you are merging it, but only namely), similarly to what you do with long lived branches, but you are making the process much more visible, and relies on developers who modify related part of the codebase to ensure the new part is also updated, so care must be taken before settling on this approach, as most of the times a plain feature flag might just do as it will inevitably have less toggled integration surface.

I know Rich Hickey talks about a similar concept, but in a different context (api versions, write a new one leave the old one existing), similarly that’s common and used with microservices & APIs (keep the old one running, push out the new one with a changed interface). When it comes to git flows and UI changes, I have less references, anywhere you could point out for some literature?


#3

Agree re adjusting workflow. If a flag is on by default and we have automated tests, i don’t see why we need manual QA to remove it. Also agree re having experimental flags and generally making it more dynamic, I’d love to see that! E.g. “new wallet flow (experimental)” toggle in dev mode. Really a release is largely a set of features that we decide to ship based on their polish. If a feature is off and it breaks stuff it is bad code and should be reverted.

Nothing specifically for git flow flows and UI changes, but I think the same principles applies. Is there something in specific that you think makes this difficult? It’s quite common to present different flows for different segments, and especially with a functional programming / proper state management this is a lot less hairy. Bonus points if you can express the UI flow as a finite state machine! For git flows, one thing I’m not clear on is how, or to what extent you’d even want to, deal with flags for different version of a library. I’d be curious to understand how Google etc deals with this. Rich Hickey has talked about this problem as well, IIRC.

https://martinfowler.com/articles/feature-toggles.html is a good primer for flags in general and talks about a lot of techniques.


#4

Nothing specifically for git flow flows and UI changes

Just to clarify, the canonical use of feature flags (as described by martin fowler etc), I have used in the past many times and I think it’s a useful flow.

What I have not used in the past, and I have some concerns about, is this spin: (add code, not change, toggle at last point) , which I am not sure is battle-tested and have only used similar methods in different contexts (APIs,microservices etc).

Essentially this method is saying “rewrite, don’t change”, which might be an option, but in my opinion should not be the default option, as works well with something that is not changing (legacy) or of course new features, but gives you integration problems on parts of the codebase that are changing often, as your are not really integrating the code.


#5

I don’t agree with not allowing WIP PR, this is very convenient to build and test other platforms.

Another thing that hasn’t been mentionned is zero tolerance for logic in views. If there is no logic in views, there is much less room for bugs that can’t be caught with regular clojure tests.


#6

If PRs are being used to build artifacts then this should be done in a different tool, perhaps even using Jenkins itself by monitoring pushed branches instead of PRs.


#7

To specifically address this question

There were two alternatives to this large PR:

  • piecemeal PRs with incremental changes - not possible a a single screen is being replaced with multiple screens, so a partial solution is not runnable.
  • wallet code parallel to the existing one as in @oskarth’s suggestion - this actually wouldn’t fix the problem without the partial approach above, as the new solution would still need to be developed in the totality before finally merged/enabled.

However the downsides of the second approach would’ve been countless regressions. Duplicate code (especially when we’re talking about re-frame events) that does the same thing equals super-complex unpredictable state machine. This approach would’ve effectively undone the wallet bugfixing effort from last year. And because of the added splitting overhead, we’d be having this discussion in say August instead of February. But we’d have the illusion of progress because “PRs are rolling, we just keep merging one after the other”.

I’m not a fan of long lived PRs myself. For one thing, it’s much easier to work with small incremental ones. But we need to use common sense when to use what.

We had this same discussion last year over chat protocol, anyone remembers that?


#8

@oskarth Does it mean running e2e tests only against nightly? Please, elaborate.


#9

that makes sense if there is a special job to get artifact. monitoring will clog our jenkins, because building C++ code for 5 different platforms and e2e takes time. each our build is:

  • ios
  • ios-e2e
  • android
  • android-e2e
  • mac
  • windows
  • linux

not all the code is a simple cross-platform thing that you can compile once and use everywhere. each job takes about ~20 minutes to complete now (though they are run in parallel).
sure, we can probably extract Clojure building step but it is not trivial because prerequisites differ between platforms.


On the other hand, we probably should have a manual job when a person can request to build an artifact for a particular (or all) platforms manually, so PRs don’t stay open (cc @jakubgs)


#10

One good thing about merging early that is that integration bugs and some regressions are visible much earlier for the new features. Because it is unlikely that you’ll be able to completely isolate the feature from the outside world (aka have no side-effects whatsoever). If you write a code with no side effects at all (a different hashing function), then yeah, probably it is the same as long living branches. Otherwise, there is difference.

The Chrome team commits partial refactorings, which makes it extremely painful to support patches for their code :stuck_out_tongue_winking_eye:

Not really, you can start with refactoring to support both flows (the old one and the new one), extract some common logic somewhere, and decouple it from the UI flow itself. Then just make a separate view on the same models.

Even though I feel like the numbers are heavily exaggerated, there is one point I want to take from here that I do agree with.

There can be different strategies on implementing features:
(1) submit a PR, make sure there is minimum of regressions and the feature works, then merge it. PR stays for, say, 3 days.
(2) submit a PR, submit a fixup1 PR, submit a fixup2 PR, submit a fixup3 PR… Each one takes 1 day (looks better on this stats), but total feature time is 4 days here.

Let’s not look though only one lense here is what I wanted to say, I think. You can always select one arbitrary metric that looks valid and solid and make a team look either bad or good with it :slight_smile:


#11

PS: I don’t really have enough insight into the wallet refactoring to judge the PR length, maybe a compromise can be found either way, but I’m not ready to have decision without even looking at the code/problems/etc.


#12

Not really, you can start with refactoring to support both flows (the old one and the new one), extract some common logic somewhere, and decouple it from the UI flow itself.

Again, I know it can be done, and there are various techniques which I am familiar with, I am just saying that this is not what has been proposed (or if so, it needs more details), we talked about “add code, not change, toggle at last point”, so I took that literally.

What you mentioned clearly does not fit in that bucket, as you are changing parts of the codebase.

Similarly the comment about merging, is useful to catch integration bugs of course, only if you are changing, if you are adding instead of changing and that part is not being exercised (behind a toggle), there’s effectively no integration, until you replace existing code with the new one, then you have similar issues as long lived branches, in the same way @goranjovic described.


#14

agree, but I think that you should take @oskarth words that literally :wink: or I prefer not to :slight_smile:


#15

Yeah that was sloppily written / not properly factored out. There are many ways to Rome. I think the main thing we want out of this thread is:

  1. Admit we are have a problem
  2. Focus on finding solutions, not just pointing out why something can’t be done
  3. Make decisions on what to do
  4. Measure (note: this is not a north star metric on its own it’s really more about maximizing speed and quality together) and follow up

#16

We are all engineers, of course I take everything too literally :slight_smile: made me think of this: “I dabble in precision” https://www.youtube.com/watch?v=w-wbWGwZ7_k makes me chuckle every time I watch it


#17

Problem space

  • Too long-lived PRs

How many of them did we have lately? Did it cause any noticeable issues?

  • Lack of consistent use of feature flags for new flows (add code, not change, toggle at last point)

That’s objectively not a problem

  • Hard to revert commits

It’s hard for me to understand how smaller PRs make it easier to revert commit.

  • Mobile and multiple build targets means PRs and automated testing more complex

That’s just what we have, not like a problem which we would solve in terms of better pr process

Solution space

  • Use flags wherever possible, allowing for feature to be turned on in dev mode, then do final design review

how do we prevent changes to an old code when a new is only partially merged?

  • Don’t dedicate all QA to release

Then also release process should be changed somehow, otherwise it only will take more time in the result, no?

  • Don’t allow WIP PRs

How it makes pull requests process better? How it makes changes contain less regressions and be merged faster? Doesn’t make too much sense imo. If we are speaking about “the illusion of work” stuff as it was mentioned on the call that’s kind of completely different subject which is not related to “Better pull requests process”.


Objective problems (IMO)

  • PRs lifecycle often requires manual QA which consumes too much attention from QA team side
  • Sometimes it is hard to review big PRs

Steps to fix them

  • Remove manual QA from PRs lifecycle, make it only by request
  • Do QA in nightly
  • Rely on Design/UX people in reviews of UI heavy PRs
  • Encourage/oblige devs to write e2e tests
  • Reject PRs which are not reviewable (definition of not reviewable is quite a vague thing though)

#18

You mean like this one?
https://ci.status.im/job/status-react/job/manual/


#19

I would also add responsibility:

  • only add up to 3 reviewers, no more. Adding someone means he must review the PR. Nag everyday until the PR is reviewed.
  • never ask for a review in-chat, it shouldn’t be necessary
  • check your pending reviews once a day and do them as a daily routine
    Don’t take request for changes personally, request for changes often, once the requests are dealt with the assignee can remove and add again the reviewer for another round. Both need to be reactive, iteration shouldn’t take more than a day.

Having QA test only nightlies seems like a good idea as well. That means the PR submitter and reviewers have more responsibilities regarding testing their own stuff, documenting changes in their commits (to build a daily change log for QA to figure out most important areas to test). I like that. A bug triggers search for faulty commit and revert, an issue is opened and the PR to fix it will be tested by QA this time. If you want the highway don’t break develop.

Regarding the flags/smaller PRs making it hard requirements just creates new problems, if it is just for the sake having PR merged faster I don’t find it helpful. There is only some cases for which it is good, at least in the current state of the codebase.


#20

Thanks! It is all reasonable.

It is not clearly obvious to me that we shouldn’t manually QA the PRs after giving it some thought.

(1) First argument for manual QA in PRs is how we use Desktop and Mobile nightlies. If some PRs will break Desktop nightlies (and we don’t have any auto-tests to test that now) people will run to us saying that they have upgraded and it isn’t running anymore.

We should not use nightlies of any platform for daily usage if PRs aren’t tested. That leads us that some nightlies should be approved by QA for “internal usage”, and only those should be promoted.

(2) It will make finding changes that broke desktop much harder. If before QA were testing Desktop in-place and then it was exactly obvious what broke it. If we don’t do manual QA, and nightly is broken it is more time to figure out, why.

(Same arguments holds true for iOS as well, since we don’t have smoke autotests for it either).

So, are we okay taking these tradeoffs?


#21

So, I’m not suggesting that we are perfect and there is nothing that needs to be done.

What I’m suggesting is that we are rushing to make some policy conclusions before trying to figure out what causes that in the first place and why the things are they way they are.

Longer PRs are more of a symptom to me than a cause. And there are some undelying issues that we need to fix and as a side-effect we will probably have shorter PR times.

So I would suggest everyone to think about why do we have longer PR times, what are the underlying reasons.

I see 2 immediately:

  • our situation with nightlies/releases mess, especially with Desktop;
  • lack of autotest coverage for platforms other than Android;

what else?