Skip to content
This repository has been archived by the owner on Nov 5, 2021. It is now read-only.

Lightning fast messaging and cold phone #55

Closed
tiabc opened this issue Dec 1, 2017 · 28 comments · Fixed by status-im/status-mobile#2922
Closed

Lightning fast messaging and cold phone #55

tiabc opened this issue Dec 1, 2017 · 28 comments · Fixed by status-im/status-mobile#2922

Comments

@tiabc
Copy link
Contributor

tiabc commented Dec 1, 2017

Preamble

Idea: <to be assigned>
Title: Lightning fast messaging and cold phone
Status: In progress
Created: 2017-12-01
Started: 2017-12-11 (estimated)

Summary

Make the app usable from a performance point of view for all the supported user flows for product MVP.

Vision

Use basic hypothesis testing to solve the following qualitative user statement:

"I don't want to use the app; it is slow and buggy […] My phone gets hot, it uses up too much battery, and interface is laggy"

As well as focusing on critical path, making assumption explicit, operating under assumptions/uncertainty and doing minimal and tooling work necessary to solve the user story.

Swarm Participants

Swarm size: 10 people.

Requirements

Loosely: Understanding of supported user stories and devices. No additional requirements in this idea outside of what has been or will be specified as part of everyday work. Creating additional tooling is not part of this swarm's work.

Goals & Implementation Plan

(a) Get a global overview of this problem
(b) Use rigor and hypothesis testing to pursue the most fruitful directions
(c) Work as close to the root of the assumption tree as possible while still having leverage

See https://docs.google.com/document/d/1OZtzfojToJtZhj2LnokA9-YU7aL_gm9NzaGtI-vuZ6E/edit# for original doc.

General heuristics:

  1. Minimal scope in all dimensions to simplify the problem.
  • Supported devices: iPhone 6 and Samsung Galaxy Note 4 (fallback Galaxy S6).
  • Supported network: RPC only.
  • Supported user stories: minimal as defined by MVP (e.g. public/group chat optionally cut).
  1. Focus on iteration speed.
    Testing hypotheses quickly. If it takes 1 month to test hypothesis then we can only do 2-3 in 3 months, and we don’t learn a lot. But if we can do 2 a week we can do 20+ and our knowledge will be correspondingly higher.

Minimum Viable Product

Goal Date: 2017-12-29
Completed: 2018-01-10 (partial completion)

Description:

  1. Identify and agree on user stories / behaviors which causes unacceptable
    performance on supported devices and are part of Status Core MVP supported
    flows.

Example: As a user on iPhone 7 I want to be able to sign up in <30s without
feeling like my phone gets hot.

  1. Reproduce these behaviors and collect metrics that are currently available to
    us; identify needs for plausible but currently opaque metrics.

Example: Figure out how to get disk IO statistics over time on Android.

  1. Based on information in (2), outline hypotheses, with clear assumptions, as
    well the quickest way to test these hypotheses.

Example: Assuming phone gets hot due to high CPU usage, Whisper decryption is
causing high CPU usage. Quickest way to test: disable Whisper and simulate user
story without it.

Iteration 1

Goal Date: 2018-01-18
Description:

Deliverable 1: Test hypotheses:

Deliverable 2: User stories that are unacceptable right now for supported devices (Anna, Chad?)

Deliverable 3: Come up with new hypotheses based on what we learn in 1.

Iteration 2...N

Goal Date: TBD

Once we have all the tools for benchmarking in place and most bottlenecks are fixed, we need to ensure we have documents how to avoid having performance regressions in future as well as automated performance tests developed under #22.

Supporting Role Communication

Post-Mortem

Copyright

Copyright and related rights waived via CC0.

@tiabc tiabc added the draft label Dec 1, 2017
@oskarth
Copy link
Contributor

oskarth commented Dec 5, 2017

I pledge 5 hours a week as tree tender / Clojure-Go integrator / Clojure contributor, as this is what has happened in practice (more than 5h last week, hopefully less this week for offline inbox MVP, TBD for two weeks time).

@JekaMas
Copy link
Contributor

JekaMas commented Dec 5, 2017

I've done some whisper tests: on small network <100 whisper nodes we got a lot of message duplication, ×5 - x10 - each node recieves each message for many times, restores message, do some preparation, checks and after that discards duplicate.
On a broader network this should be less crytical but we cant garantee that in network there is no well connected inside itselfes clusters that badly connected with a whole network.
I've checked that we could do some kind of 'early stop' for duplicated messages. I can do that in 24 hours, but firstly start the discussion in a new ethereum issue.

@tiabc
Copy link
Contributor Author

tiabc commented Dec 7, 2017

@JekaMas that sounds good! They're implementing bloom filters, as I know, to do this more intellectually.

Another approach for us could be to set various topics instead of making it fixed for messaging. But it won't scale for the "real" network, only for our small subset. Though it sounds like a good approach from the first glance even for the real network.

@b00ris
Copy link
Contributor

b00ris commented Dec 7, 2017

I pledge focus of 1 (30h/week). After end of #1 - 40h/week

@JekaMas
Copy link
Contributor

JekaMas commented Dec 7, 2017

the same as @b00ris wrote.

@tiabc tiabc added in-progress and removed draft labels Dec 7, 2017
tiabc added a commit that referenced this issue Dec 7, 2017
@rasom
Copy link

rasom commented Dec 12, 2017

i pledge 30h/week

@oskarth
Copy link
Contributor

oskarth commented Dec 13, 2017

To get this moving quickly, I sketched out idea for how I think we can get to MVP:

Minimal Viable Product
Goal Date: 2017-12-22

Description:

  1. Identify and agree on user stories / behaviors which causes unacceptable performance on supported devices and are part of Status Core MVP supported flows.

Example: As a user on iPhone 7 I want to be able to sign up in <30s without feeling like my phone gets hot.

  1. Reproduce these behaviors and collect metrics that are currently available to us; identify needs for plausible but currently opaque metrics.

Example: Figure out how to get disk IO statistics over time on Android.

  1. Based on information in (2), outline hypotheses, with clear assumptions, as well the quickest way to test these hypotheses.

Example: Assuming phone gets hot due to high CPU usage, Whisper decryption is causing high CPU usage. Quickest way to test: disable Whisper and simulate user story without it.

@oskarth
Copy link
Contributor

oskarth commented Dec 15, 2017

Updated MVP in iteration. Only change from comment above is goal date changed due to uncertain availability / slow start.

@oskarth
Copy link
Contributor

oskarth commented Dec 18, 2017

@divan do you want to commit to this swarm?

@divan
Copy link
Contributor

divan commented Dec 20, 2017

Brief update on CPU investigation.

Tool for monitoring Status CPU (feel free to extend to support more metrics, forwarding to metrics collection platforms, etc): https://github.com/status-im/statusmonitor

I tested yesterday CPU usage for idle screen on release build with and without status-go (using STUB_STATUS_GO build flag). This flag effectively disables status-go usage (it is still compiled into app, though). I wanted to figure out what portion of background CPU activity for app in the idle state (i.e. chat screen open and no interaction) is introduced by status-go and by status-react code Here are results I got:

Idle with and without status-go (note: this phone has 8 cores, so max is 800%):
idle

login_screen_normal

stub_changing_screens

I'm going to do more measurements and data collection, including rebuilding and repeating it a couple of times from scratch to make sure I'm not making silly mistakes while running experiments.

@JekaMas
Copy link
Contributor

JekaMas commented Dec 21, 2017

Should we split this issue into two parts: a cold cell and a fast messaging? As far as i can see high cpu usage != high battery consuming in common case. One thread could wait for a resource and cpu usage would be 100% and no other threads cant use that cpu however a battery consuming is low in the case.
May be this article could be useful http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
Those graphs can show only consumed cpu time but cant say something about consumed energy, a real cpu work(the top command cant do that). So we can measure our slowness not 'battery usage'.
@divan @oskarth @tiabc

@JekaMas
Copy link
Contributor

JekaMas commented Dec 24, 2017

I'm going to desctibe the possible issue related to a whisper group chatting.
Ethereum decryptes any message twice: p2p net receives bytes, they converts into packet (github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/rlpx.go, type rlpxFrameRW, method ReadMsg), then packet decode into envelop (reflection is used for it), then envelop decryptes into user message.
Only near the end of the last stage the message could be discarded if already cached.

Actually each chat user recieves as many message copies as many they has connestions (if TTL is not achieved). It looks like this graph ang streams (the first number in node is node name and the second number is message copy count):
whisper_animation

So if a user has 50 connections he revieves 50 message copyes and each recieving does 2 stages of decryption and one decode with reflection usage. There're totally 100 decryptions and 50 decodes for each message in chat. It looks like an exponent and 'no free lunch'.

I think this could cause on a chat performance.

The possible solution is about changing current Ethereum protocol to be able discard messages from given servises with given hashes or uniqe ID (like GUID or twitter's snowflake or Lazada's Luigi).

@antdanchenko
Copy link
Contributor

I am going to add an ability to measure CPU and RAM difference per two provided builds in different user flows (create/recover user, send transaction/location/request, chat, group chat etc.) which are represented in appium automated tests.
In scope of status-im/status-mobile#2816

@oskarth
Copy link
Contributor

oskarth commented Dec 27, 2017

To not lose it in Slack history:

Jarrad asked "so what's the summary on performance issues so far? do we understand the underlying causes?"

My reply:

Work in progress.

  1. ULC/LES (slow/hot sync, disk filled up) less of an issue - we'll just disable just disable non-upstream node for MVP, then work concurrently on this as a medium/long-term investment task. (tiabc, jeka, b00ris, themue)

  2. Group chat laggy. This is distinct from public chat or 1on1 chat laggy. Due to slow JS ECC encrypt/decrypt and using asym encryption for group chat. We'll remove group chat for MVP. Put up bounty to to investigate better crypto lib - long term get symkey going for private group chat (session key a la double ratchet, and/or do this on Go side).

  3. General bad app perf. Likely due to RN<->Native render. Work in progress. Basics like switching to better list implementations and navigations (fewer roundtrips) WIP/done by @roman and also associated with RN upgrade etc. Have some more thoughts on this I'll try to share this week but it's a general RN problem.

  4. Whisper overhead in bg. @jeka has some thoughts on this. In general, @divan is also working on distinguishing status-go and status-react issues. Generally a bit opaque, but more importantly no direct user stories tied to this.

  5. To KISS for MVP, I recommend focusing on supported devices, i.e. Samsung Galaxy/iPhone 6s+ or 7+, as there are a lot of (general) RN perf issues for older Android devices. @anna is also working on user stories which are definitely in MVP user flow. We can talk more about this on Thursday (@roman wants to make it work for old Android devices too, I say fuck it for MVP - we can quantify and bounty a lot of this IMO.).

Not sure what @tiabc is investigating.

For you:

  • what device are you using?
  • are there any specific user stories that you find unacceptable? This would be really useful. For Ropsten RPC. Perhaps something for dogfooding session tomorrow (I'll be on a flight) - cc

@janherich
Copy link

I'm going to commit 30-40h/week to this swarm, starting with - status-im/status-mobile#2852

@b00ris
Copy link
Contributor

b00ris commented Jan 5, 2018

Investigation from me and @JekaMas related to one topic problem

Problem:

We use one topic for all whisper messages(https://github.com/status-im/status-react/blob/484e982bdf4b09c168aab142f190ef9427cbbfa3/src/status_im/protocol/web3/filtering.cljs#L6).

Hypothesis:

  • Whisper spend a lot of cpu to decrypt every message from outside users.
  • Increases our cpu usage to do this

Experiments:

(see https://github.com/status-im/status-go/tree/debug/whisper_perf_topic)

a) one topic

(https://github.com/status-im/status-go/blob/5a7e4c3a0019a3b5bf38cc3c76dadfefa7d14749/e2e/whisper/whisper_send_message_test.go#L89, run: make bench-one-topic)
1) Start one whisper node(receiver)
2) Create for the node message filter with topic 0x74657374
3) Start several whisper nodes(senders)
4) Sender group sends whisper messages with topic 0x74657374 (~80rps)
Result: envelop.Open(https://github.com/ethereum/go-ethereum/blob/1c2378b926b4ae96ae42a4e802058a2fcd42c87b/whisper/whisperv5/filter.go#L116) executes for all messages(so ~ 24000 decryption tries for 5 minutes)
cpu_bench_one_png

b) many topics

(https://github.com/status-im/status-go/blob/5a7e4c3a0019a3b5bf38cc3c76dadfefa7d14749/e2e/whisper/whisper_send_message_test.go#L161, run: make bench-many-topics)
1) Start one whisper node(receiver)
2) Create for the node message filter with topic 0x74657374
3) Start several whisper nodes(senders)
4) Sender group sends to whisper messages with random topics (~80rps)
Result: envelop.Open(https://github.com/ethereum/go-ethereum/blob/1c2378b926b4ae96ae42a4e802058a2fcd42c87b/whisper/whisperv5/filter.go#L116) not executes for all messages
cpu_bench_many

с) many topics

(https://github.com/status-im/status-go/blob/5a7e4c3a0019a3b5bf38cc3c76dadfefa7d14749/e2e/whisper/whisper_send_message_test.go#L16
1) Start one whisper node(receiver)
2) Create message filter over RPC
Result: Message filter was created with two topics(empty topic was added). go-ethereum bug(ethereum/go-ethereum#15810) Fix: ethereum/go-ethereum#15811. It causes all whisper network.
Every go-ethereum whisper node(with created message filter over rpc) tries to decrypt every whisper message 0_o , because function

func matchSingleTopic(topic TopicType, bt []byte) bool {
	if len(bt) > 4 {
		bt = bt[:4]
	}

	for j, b := range bt {
		if topic[j] != b {
			return false
		}
	}
	return true
}

can't return false if bt is empty slice.

Result
Many topics stats: Total envelops 24041. Opened envelops 0. CPU usage <1%
Single topic stats: Total envelops 24055. Opened envelops 23842. CPU usage 41.78%.

Migrating from one topics to many topics strongly reduce cpu usage for status app.

@oskarth
Copy link
Contributor

oskarth commented Jan 9, 2018

As part of MVP:

MVP for #55. Goal Date: 2017-12-29 (2w)

  1. Identify and agree on user stories / behaviors which causes unacceptable performance on supported devices and are part of Status Core MVP supported flows.

  2. Reproduce these behaviors and collect metrics that are currently available to us; identify needs for plausible but currently opaque metrics.

  3. Based on information in (2), outline hypotheses, with clear assumptions, as well the quickest way to test these hypotheses.

I spent some time to formulate precise hypotheses for some major pieces of work we are currently doing. They can be found here:

The main one missing is: status-im/status-mobile#2852 which will probably be factored out into two by @janherich @dmitryn and me sometime soon.

Please have a look at above. Also note that big infrastructure/tooling stuff and LES2 etc things are out of scope of this swarm. This is purely about critical path for MVP and doing the minimal coding necessary to test and eventually fix hypotheses.

I believe if we are in rough agreement with this we should consider MVP done (it's been a rough month) and figure out where we want to be the next 1-2 week for iteration 1 as part of the swarm group call tomorrow

@oskarth
Copy link
Contributor

oskarth commented Jan 10, 2018

Swarm 55 update:

Meeting notes: https://docs.google.com/document/d/1KEqE3JGpA4ZKmpbffZZubcRkVbV6v9DtTxzs5EV8Z68/edit#

Swarm people:

  • Oskar 10h/w (taking over from Ivan as lead)
  • Anna 10h/w
  • Roman 30h/w
  • Eric 20h/w
  • Dmitry 30h/w
  • Boris 40h/w
  • Eugene 40h/w
  • Jan? (assume 30h/w)
  • Ivan T and Ivan D assuming 0h until hear otherwise
  • Sergey QA (10h/w)

Iteration 1 scope:

Goal date: January 19th

Deliverable 1: Test hypotheses:

Deliverable 2: User stories that are unacceptable right now for supported devices (Anna, Chad?)

Deliverable 3: Come up with new hypotheses based on what we learn in 1.

@oskarth
Copy link
Contributor

oskarth commented Jan 10, 2018

Updated original issue. #55

@JekaMas
Copy link
Contributor

JekaMas commented Jan 17, 2018

@yenda Do many topics solve the issue complitely? If so we need one more patch on Ethereum (ethereum/go-ethereum#15811). This PR already approved but not merged yet. PR should decrease cpu usage even more.

I really want to hear about #55 testing it's hard to believe that one fix solves such issue :) Is app now fast and cold?

@yenda yenda reopened this Jan 17, 2018
@oskarth
Copy link
Contributor

oskarth commented Jan 19, 2018

Swarm update.

Misc

Iteration 1 update

Deliverable 1: Test hypotheses:

Partial progress: status-im/status-mobile#3072

Partial progress: status-im/status-mobile#2922 and basic logging for verification in status-go. @yenda blocked by some local tooling issues and critical release bugs.

Progress: status-im/status-mobile#2965 (comment)

  • 1.4. [Perf] Test hypothesis: re-frame event queue is getting overwhelmed (Dmitry)

Complete: status-im/status-mobile#3045 (solves/brings us back to baseline for status-im/status-mobile#2852)

Deliverable 2: User stories that are unacceptable right now for supported devices (Anna, Chad?)

Partial progress. Release blocker in terms of priority:

Main user story partially solved is: status-im/status-mobile#2852

Deliverable 3: Come up with new hypotheses based on what we learn in 1.

See iteration 2 for this.

Iteration 2

  • Continue with 1.1 (RN bridge, Roman) and 1.2 (many topics, Eric) as is.

  • Evolve 1.3 (Whisper network overhead) by checking p2p layer as Whisper overhead seems low, also verify numbers (Eugene).

  • 1.5? (realm writes to async queues, Jan). TODO: To be formulated more precisely; no specific hypothesis.

  • 1.6. [Perf] Test hypothesis 6: re-frame queue throughput causes bad app responsiveness (Igor)

Additionally, some RN navigation PR (Roman), general investigation in RN workers (Roman, Dmitry). User stories will continue to be formulated as 0.9.13 is released as well.

Main surprising thing learned is (at least to OP): Whisper overhead appears to be only 20% compared to baseline geth with upstream RPC, suggesting network overhead is at p2p/discovery layer.

Iteration 2 goal date: 2018-02-01.

New project board for iteration 2: https://github.com/orgs/status-im/projects/8

@oskarth
Copy link
Contributor

oskarth commented Jan 19, 2018

For board, in general, we have:

  • Identified user stories. These are real user behaviors that suck and will be part of the supported app in future as well (e.g. not syncing, or iPhone 5s, or console password lags). These are the things we are trying to solve.

  • Perf hypotheses. These are things that we can quickly test, learn a lot from, and use to attack user stories. These should be rigorous in terms of what assumptions they are making, that it's the quickest way to test something, and that it attacks a specific user story. All the ones on the board have been tidied up and new ones should meet the same standard. I'm happy to help write these if you have a hunch but don't feel comfortable writing it down so rigorously.

  • Other work. This might be born out of specific hypotheses, or be cut out to make a general issue smaller. But in general this should be a minority of the work we do. A hunch or some misc numbers pointing to a specific direction is not enough to warrant big code changes if we can't verify a specific hypothesis tied to a specific user story.

@oskarth
Copy link
Contributor

oskarth commented Jan 24, 2018

From Slack:

tldr splitting into two swarms

Hey all! So this swarm has grown quite a bit and is currently a bit too big. Initially we thought a lot of problems were on status-go side, but then it turned out a bunch are on status-react/app side. This has lead to the surface area being kind of big. @themue (and others) have brought up the idea that it'd be good to split the swarm up a bit. After talking to some people this is what we are going to do. They'd still practice same methodology - starting with end user story and testing a hypothesis that requires minimal amount of work to impact user story, but the specific goals would be a bit different.

One swarm/working group, say 55a, will be around UI/rendering stuff (Dmitry, Roman, Jan, Igor currently I think). Specifically this means hypotheses centered around status-im/status-mobile#3095 right now. @dmitryn has agreed to lead this one.

Another swarm, say 55b, is largely about network overhead status-im/status-mobile#2931 but also about Whisper many topic one (though this one doesn't have specific user story attached to it, but it generally seems promising). This one is largely Go related, but it also requires Clojure integration, so it's currently Eugene, Boris, and Eric. @jeka has agreed to lead this one.

As @anna and others identify more user stories we'll see where they best bit, but we can play this by ear I think.

As a start, to reduce coordination costs and decouple efforts a bit, we can just create two channels and have two separate meetings (last one was a bit rushed and covered a lot, as I'm sure some of you felt). Then it is up to @jeka and @dmitryn how they want to organize things, like creating a new idea or setting iterations or whatever.

@oskarth
Copy link
Contributor

oskarth commented Feb 14, 2018

Forked swarm into two separate pieces of work: Smooth UI (#76) and Low Network (#71). Closing this one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.