Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: a Stack/Heap approach #22

Open
thysultan opened this issue Jun 21, 2017 · 50 comments
Open

Optimization: a Stack/Heap approach #22

thysultan opened this issue Jun 21, 2017 · 50 comments

Comments

@thysultan
Copy link

thysultan commented Jun 21, 2017

Instead of

body {
  color: red
}

producing

[
  [RULE_START, 1],
    [SELECTOR, 'body'],
    [PROPERTY, 'color'],
    [VALUE, 'red']
  [RULE_END]
]

It could instead produce

HEAP = ['body', 'color', 'red']

[
RULE_START, 0,
SELECTOR, 0,
PROPERTY, 1,
VALUE, 2,
RULE_END, 2
]

Where the ints 0,1,2 are memory addresses for the corresponding values in the HEAP.

This way you can use a flat Uint32Array typed array for maximum throughput on performance.

@thysultan thysultan changed the title Explore a Stack/Heap approach. Explore a Stack/Heap approach Jun 21, 2017
@kof
Copy link
Member

kof commented Jun 21, 2017

Ah I got what you mean - using separate arrays for markers and props/values/selectors!

@kof
Copy link
Member

kof commented Jun 21, 2017

Thats an interesting idea, we should add a benchmark for this as well.

@kof
Copy link
Member

kof commented Jun 21, 2017

We could in theory end up with 3 arrays. Function refs, strings and markers. Each of them would be of one consistent type and markers even integers.

@thysultan
Copy link
Author

thysultan commented Jun 21, 2017

It could also mean we could reuse selectors, properties and property values. i.e

body {
  color: red;
}

h1 {
color: red;
}

Would produce

HEAP = ['body', 'color', 'red', 'h1']

[
RULE_START, 0,
SELECTOR, 0,
PROPERTY, 1,
VALUE, 2,
RULE_END, 2,

RULE_START, 3,
SELECTOR, 3,
PROPERTY, 1,
VALUE, 2,
RULE_END, 3
]

@kof
Copy link
Member

kof commented Jun 21, 2017

Oh so the HEAP array would become a collection of unique strings, neat! Not sure though how much it will bring after gzip, but def. worth considering!

@thysultan
Copy link
Author

Yes gzip might make the KB benefit a hard sell but It should improve the memory that the runtime JS VM has to allocate and code it has to parse.

@kof
Copy link
Member

kof commented Jun 21, 2017

Yeah, if it won't slow down parsing, definitely worth optimizing!

@thysultan
Copy link
Author

thysultan commented Jun 21, 2017

WIP of a compiler that produces this from css strings. https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html

@kof
Copy link
Member

kof commented Jun 22, 2017

How is it related to this issue?

@thysultan
Copy link
Author

Yes, this is not for the bench, just a proof of concept of a compiler that generates this format from a string source, related issue #16

@kof
Copy link
Member

kof commented Jun 22, 2017

Ok I have added this approach to the bench (see fromTwoTyptedArrays fn) https://esbench.com/bench/592d599e99634800a03483d8

I din't use Uint32Array because it slows it down a lot.

It is not too fast. It is possible that with a much larger test data it will show better relative performance, but mostly I think its a memory optimization, because we add more lookups when using this approach.

The "From Array of strings" test performs best so far.

@thysultan
Copy link
Author

For the tests to be fair you'd need to exclude the creation of the arrays from the bench and put that into the setup phase otherwise you're also measuring the creating of the data structure, which in this case is meant to be a preprocess operation instead of a runtime operation.

@kof
Copy link
Member

kof commented Jun 22, 2017

I do it for other tests as well, so its comparable

@thysultan
Copy link
Author

thysultan commented Jun 22, 2017

Since the different methods use very different data-structures that are created on each run it would not be comparable since the creating of the data-structures is meant to be a compile time operation, this would include fromArrayOfStrings as well.

@thysultan
Copy link
Author

thysultan commented Jun 22, 2017

You're also measuring the cost of creating the function on every operation i.e return (input) => { this could throw out any possibility that the VM could generate optimized code which means we might be just measuring a baseline(unoptimized) version that the VM could produce.

@thysultan
Copy link
Author

For example this bench https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html, *see the console.

It avoids all the points i mentioned above, showing that this bytecode format with a typedArray is faster.

1873 ops, from typedArray
87 ops, from stylis
1638 ops, fromSingleArrayOfStrings

@kof
Copy link
Member

kof commented Jun 22, 2017

You're also measuring the cost of creating the function on every operation i.e return (input) => { this could throws out any possibility that the VM could generate optimized code which means we might be just measuring a baseline(unoptimized) version that the VM could produce.

Can you elaborate on this specific case?

@kof
Copy link
Member

kof commented Jun 22, 2017

For example this bench https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html, *see the console.

on my machine in chrome is "fromSingleArrayOfStrings" the fastest

@thysultan
Copy link
Author

thysultan commented Jun 22, 2017

Can you elaborate on this specific case?

The operation is calling a function that returns another function, then useIt calls this function.

So, you're also measuring the cost of creating the function that converts the data-structure to a string which varies across implementations, this paired with the fact that it's being generated on every operations means that the function will also never get labeled as hot by the VM to produce optimized code since every run is both executing and creating a new function.

On my machine is "fromSingleArrayOfStrings" the fastest.

What where the results?

@kof
Copy link
Member

kof commented Jun 22, 2017

The operation is calling a function that returns another function

Did you see that the first function is self invoking? I scoped the static internals inside of the closure. At bench time there is only one fn call.

// then useIt calls this function.

I am not sure it does what it should, but my idea was to avoid smart optimizations which might end up with doing nothing. I am still not sure if it does the job nor if useIt function was actually necessarily in this case, but in any case all tests use it so they are fair.

So, you're also measuring the cost of creating the function that converts the data-structure to a string which varies across implementations

I don't think so.

What where the results?

Just refreshed a couple of times, the results vary between 2500 and 3500, this time the "bytecode" variant was faster multiple times. I am not sure the way you measure is correct. Did you try benchmark.js? I guess they have bullet proof logic for measuring.

@kof
Copy link
Member

kof commented Jun 22, 2017

Just noticed you are using "while" in one case and "for" in another, can you try making both implementations as similar as possible?

@kof
Copy link
Member

kof commented Jun 22, 2017

Regarding not measuring the data generation time. I realize that we would have better parsing results when we not generate the data for each test, but this would be less realistic, because in the practice we would generate the data first, then parse, both steps are involved in the real world scenario, right?

@kof
Copy link
Member

kof commented Jun 22, 2017

Regarding warming up and getting optimized hot code. I am not sure this is the right approach here. I am not sure that function will be called that often in a real world use case so that it becomes hot. Its not a webserver working same operations all the time. It will parse how many?, lets say 100 style sheets on average and its done. Also if done right, it will never parse all of them at once, but rather only those which are required for the current UI state.

For more realistic statistics we could count amount of css rules on popular websites.

@thysultan
Copy link
Author

Did you see that the first function is self invoking?

I missed that :P

Just refreshed a couple of times, the results vary between 2500 and 3500

Same here.

I am not sure the way you measure is correct.

What part is incorrect?

Regarding not measuring the data generation time? I realize that we would have better parsing results when we not generate the data for each test, but this would be less realistic, because in the praxis we would generate the data first, then parse, both steps are involved in the real world scenario, right?

No, you would not do this on every operation at runtime, the data-structure would be created once at runtime. For example

// data-structure
class Main extends React.Component {}

// X number of operations
ReactDOM.render(<Main>, document.body)
ReactDOM.render(<Main>, document.body)

Just noticed you are using "while" in one case and "for" in another, can you try making both implementations as similar as possible?

Sure, i copied you're implementation of fromSingleArrayOfStrings in the ESBench to compare with that.

@kof
Copy link
Member

kof commented Jun 22, 2017

What part is incorrect?

It seems correct, but very simplistic, I think jsperf did a lot of work to get better/more stable results.

No, you would not do this on every operation at runtime, the data-structure would be created once at runtime.

Yes but it would be also parsed just once I assume. After that its either something static or some models.

@thysultan
Copy link
Author

In the bench it's being created on every operation, since the bench runs this operation > 10,000 times it affects the results.

@kof
Copy link
Member

kof commented Jun 22, 2017

In the bench it's being created on every operation, since the bench runs this operation > 10,000 times it affects the results

You mean data generation for every test? Yes thats what I was trying to reproduce. I assume a test case includes data generation + parsing in the real world.

@kof
Copy link
Member

kof commented Jun 22, 2017

We can make for sure separate tests with data generation and without. Just to be sure that data generation doesn't eat up most of the performance budget. But at the end, important is a test which has both, data structure and parsing. Basically if data generation costs too much time but parsing is faster it doesn't matter, important is whats faster when both is done.

@kof
Copy link
Member

kof commented Jun 22, 2017

What might be really important is to do tests with more realistic data, meaning instead of 1 rule, using an average rule amount from some popular sites. Larger data sets might completely change the benchmark results.

@thysultan
Copy link
Author

thysultan commented Jun 22, 2017

In the real world the data-structure should probably be created just once for the lifetime of the page, the toString implementation on the other hand could get executed more than once.

What might be really important is to do tests with more realistic data, meaning instead of 1 rule, using an average rule amount from some popular sites.

I agree.

@kof
Copy link
Member

kof commented Jun 22, 2017

In the real world the data-structure should probably be create once for the lifetime of the page, the toString implementation on the other hand could get executed more than once.

My assumption is that this format is intermediate. Meaning it will be parsed only once and transformed to some other structure with models, optimized for updates. This is basically a a structure optimized for read operation.

So even if toString will be done more than once, the underlying models will handle that, not this format.

Though I might be wrong, we don't know that for sure. Its an assumption I was working with and would strongly depend on how it would be used in the wild.

@thysultan
Copy link
Author

thysultan commented Jun 22, 2017

Meaning it will be parsed only once

Yes but the following is not what you want to happen, which is what the bench does.

while (i++ < 100) arr = [...] // operations

but rather

arr = []
while (i++ < 100) // operations

BTW, removed the warmup from this bench https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html and changed the for to while to align both implementations.

392 ops fromIntTypedArray
6 ops stylis
132 ops fromSingleArrayOfStrings

@kof
Copy link
Member

kof commented Jun 22, 2017

Yes but the following is not what you want to happen, which is what the bench does.

I thought thats exactly what I want to happen. Because I want each test to:

  1. declare data
  2. parse data

BTW, removed the warmup from this bench

Ok looks similar on my machine.

@kof
Copy link
Member

kof commented Jun 22, 2017

If I remove data generation in each test and hoist to the top and suddenly one implementation becomes faster than another, it would simply mean that the data format was slowing one of them down, but thats what would happen in real life.

@thysultan
Copy link
Author

thysultan commented Jun 22, 2017

thought thats exactly what I want to happen. Because I want each test to:

In a real implemented the data would be created once, the bench does not mimic this.

To compare it to react components the bench would look like

class Bench extends React.Component {
  render () {
    class Data extends React.Component { 
      render() {return null}
    }

    return <Data>
  }
}

vs how it should be implemented

class Data extends React.Component { 
  render() {return null}
}
class Bench extends React.Component {
  render () {
    return <Data>
  }
}

@kof
Copy link
Member

kof commented Jun 22, 2017

In a real implemented the data would be created once

Yes and the parsing would also happen only once. Thats an assumption I work with. We agree to disagree at this point.

@thysultan
Copy link
Author

It has nothing to do with parsing in this case, in the first example react will remove and create the component Data on every render, This is how a VM would handle it as well, the array will be created and destroyed(GC) on every run, which is an anti pattern in both cases.

@kof
Copy link
Member

kof commented Jun 22, 2017

We are stuck. Anybody help.

@kof
Copy link
Member

kof commented Jun 22, 2017

Ok we had a chat on gitter and now it became clear. The point that hasn't been clearly communicated is that the heap idea is using just one heap for the entire application. Meaning all sheets will be compiled out of one heap array. That means that every cycle in the bench doesn't need to contain both heap and the pointers map, it only needs pointers map.

This approach will probably show much better performance/memory footprint when there is a lot of rules/sheets.

@geelen
Copy link

geelen commented Jun 23, 2017

Yep this sounds like a reasonable optimisation, but it doesn't need to change the semantics of the format itself. As I've said before, talk of optimisation seems pretty premature if we don't even know for sure the total scope of the format.

Once we have a working reference implementation, we can look design some more comprehensive benchmarks to better understand the performance tradeoffs. Personally, I favour a single inline data structure because it simplifies streaming responses and handling the injection of chunks of CSS due to dynamic values. Any kind of heap structure can be built up from that if needed at runtime.

For SC, I've built a pathological performance case (129KB jsx file) which is a conversion of the tachyons.io homepage (64KB html + 83KB css). There's nothing dynamic, I haven't refactored the CSS at all, it's just a simple stress test on a decent-sized DOM. It might be worth doing something similar (or even converting this page if your find/replace fu is strong) to test both parse & serialisation phase on a few hundred components at once.

@kof
Copy link
Member

kof commented Jun 23, 2017

Yeah, I totally agree that we need to understand the scope and tradeoffs better. Though I would continue experiments with performance, because performance is the primary driving force for this project and also I am genuinely curious.

Also I think we need to refactor the bench to some more realistic scenario like your pathological case.

@kof
Copy link
Member

kof commented Jun 23, 2017

Btw. @thysultan built the "standard" generator into stylus so that now we have a tool that can take a decent amount or regular CSS and produce for us a decent standard cssinjs array we can then use for benchmarks.

@geelen
Copy link

geelen commented Jun 23, 2017

Btw. @thysultan built the "standard" generator into stylus

Wait, already? Is he... a wizard?

@kof
Copy link
Member

kof commented Jun 23, 2017

@thysultan can you please produce ~100kb for each variant of the bench? If you post them here I will update. For e.g.

@gaearon
Copy link
Member

gaearon commented Jun 24, 2017

cc @trueadm who's been thinking about similar things on React side and might offer some thoughts on formats

@thysultan
Copy link
Author

@kof ATM the stylis plugin only handles converting to this proposed stack/heap format, will add support for the other formats in the bench.

@kof
Copy link
Member

kof commented Jun 24, 2017 via email

@kof
Copy link
Member

kof commented Jul 5, 2017

@geelen I was thinking regarding streamability and well, fully streamable it can be only without javascript in it, because functions etc are no good for this. We never intended this format to be fully serialized as it is on top of js. On the other hand, heap can be transfered first similar to http headers, and then the body with all the refs. So yeah header needs to be fully loaded before body can be parsed in chunks. Maybe it is not that bad, considered header would contain only unique values.

I think if this gives us serious perf advantages, we should go for it. It will also make up a bit for the many comas and reduce the payload.

Also I can't think of any applications for a streaming parser for this. We will most probably embed this format into js files or into js bundle.

@kof kof changed the title Explore a Stack/Heap approach Optimization: a Stack/Heap approach Oct 8, 2017
@kof kof mentioned this issue Oct 8, 2017
@kof
Copy link
Member

kof commented Oct 8, 2017

I think that the heap optimization could be even done upon consumption, at build stage. Only consumer knows the full list of styles used in the bundle and can produce the a list of properties without any duplicates just once. For e.g. as a webpack plugin.

@evan-scott-zocdoc
Copy link

Reminds me a bit of atomic CSS: breaking apart the constituent parts into a unique'd lookup table of sorts. Cool technique!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants