Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance anomaly and possible related bug #6

Closed
lwdye opened this issue Apr 21, 2016 · 11 comments
Closed

performance anomaly and possible related bug #6

lwdye opened this issue Apr 21, 2016 · 11 comments

Comments

@lwdye
Copy link

lwdye commented Apr 21, 2016

Great package!

I've coded and benchmarked two functions that decode a ByteString of simple packed encoded structures. { -O2 }

decodeMessages' :: Int -> S.ByteString -> V.Vector Trade
decodeMessages' n b = V.generate n (\i -> decodeEx (S.drop (i * tradeTotalBytes ) b ) :: Trade)

-- | Peek a sequenced unit.
decodeMessages'' :: Int -> S.ByteString -> V.Vector Trade
decodeMessages'' n  = decodeExWith (replicateM n (Data.Store.peek :: Peek Trade))

The first function is about 2-3x faster. Question 1: why? I would have guessed that the second would be faster.
Question 2: If you uncomment the line

--           ,  bench "1000'' " $ whnf functionToTime100'' 1000

You get
benchmarking instance/Store/1000''
sbug: PeekException 0 "Attempted to read too many bytes for Int. Needed 8, but only 0 remain."

My guess is that replicateM is overflowing a stack with 1000...but that seems bad and also the error message might be misleading.

Question 3: Any suggestions on making this a fast as possible with Store?
Question 4: In general if you want to encode and decode N simple structures without length field what is recommended approach?

Thanks!

{-# LANGUAGE DefaultSignatures     #-}
{-# LANGUAGE DeriveGeneric         #-}
{-# LANGUAGE FlexibleContexts      #-}
{-# LANGUAGE GADTs                 #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE OverloadedStrings     #-}
{-# LANGUAGE TypeOperators         #-}

import           Control.Monad         (join, replicateM_, (>=>))
import qualified Data.ByteString.Char8 as S
--import Data.Int (Int64)
import           Data.Monoid           ((<>))
-- import System.IO.Streams (InputStream)
import           Prelude               hiding (head)
import           System.IO
import           System.IO.Streams
import qualified System.IO.Streams     as Streams

import           Data.Store
import           Data.Vector           as V
import           GHC.Generics

import           Data.Int
import           System.TimeIt
import           Control.DeepSeq
import           Criterion.Main

benchmarks :: [Benchmark]
benchmarks =
  [
       bgroup "Store"
          [
               bench "10' " $ whnf  functionToTime10' 10
            ,  bench "100' " $ whnf functionToTime100' 100
            ,  bench "1000' " $ whnf functionToTime100' 1000
            ,  bench "10'' " $ whnf functionToTime10'' 10
            ,  bench "100'' " $ whnf functionToTime100'' 100
 --           ,  bench "1000'' " $ whnf functionToTime100'' 1000
          ]
  ]

main :: IO ()
main = --do
        defaultMain
        [
          bgroup "instance" benchmarks
        ]

-- import Control.Monad

data Trade = Trade !Int !Int !Int  deriving (Show, Generic)
instance Store Trade
instance NFData Trade

-- -- | Peek a sequenced unit.
-- peekVectorN :: Store a => Int -> Peek (Vector a)
-- peekVectorN n = replicateM n (Data.Store.peek :: Store a => Peek a)

tradeTotalBytes::Int
tradeTotalBytes = S.length $! encode (Trade 1 2 3)

generateTradeByteStringN::Int->S.ByteString
generateTradeByteStringN n = S.concat $ Prelude.map (\x -> encode (Trade x 2 3) ) [1..n]

decodeMessages' :: Int -> S.ByteString -> V.Vector Trade
decodeMessages' n b = V.generate n (\i -> decodeEx (S.drop (i * tradeTotalBytes ) b ) :: Trade)

-- | Peek a sequenced unit.
decodeMessages'' :: Int -> S.ByteString -> V.Vector Trade
decodeMessages'' n  = decodeExWith (replicateM n (Data.Store.peek :: Peek Trade))

-- | Peek a sequenced unit.
peekVectorMessage :: Int -> Peek (Vector Trade)
peekVectorMessage n = replicateM n (Data.Store.peek :: Peek Trade)

b10 = Control.DeepSeq.force  $ generateTradeByteStringN 10
b100 = Control.DeepSeq.force  $ generateTradeByteStringN 100
b1000 = Control.DeepSeq.force  $ generateTradeByteStringN 1000

functionToTime10'::Int -> Int
functionToTime10' n = V.length $ V.force $ decodeMessages' n b10

functionToTime100'::Int -> Int
functionToTime100' n = V.length $ V.force $ decodeMessages' n b100

functionToTime1000'::Int -> Int
functionToTime1000' n = V.length $ V.force $ decodeMessages' n b1000

functionToTime10''::Int -> Int
functionToTime10'' n = V.length $ V.force $ decodeMessages'' n b10

functionToTime100''::Int -> Int
functionToTime100'' n = V.length $ V.force $ decodeMessages'' n b100

functionToTime1000''::Int -> Int
functionToTime1000'' n = V.length $ V.force $ decodeMessages'' n b1000
@mgsloan
Copy link
Owner

mgsloan commented Apr 21, 2016

Hey Les! Thanks, it's been quite interesting to work on it.

The decode error is because of using whnf functionToTime100'' 1000 instead of whnf functionToTime1000'' 1000

I think the performance difference is two things:

  1. Decoding actually isn't getting forced for the single tick functions. V.force only forces the vector, but not the things it contains. The docs clarify that this is to allow you to remove sharing in cases where you've got a small chunk of a large vector. Not a great name for the function..

I figured this out by doing

import Debug.Trace (trace)

debug x = trace (show x) x

decodeMessages' n b = V.generate n (\i -> decodeEx (debug (S.drop (i * tradeTotalBytes ) b )) :: Trade)

Since no trace messages were emitted, I knew that the value was never being demanded, and this is why we don't get any decode exceptions. It ran so quickly because it just created a vector of thunks. This may well be an interesting thing to explicitly support in Store. It could cause input memory leaks, but could also be quite handy for things that only need to decode part of their input.

  1. replicateM isn't well optimized for vector. See Foldable instance for Vector uses slow default implementations for many methods haskell/vector#112 - which I discovered while implementing store instances for vector.

Note that store already has instances for Vector, so you can peek and poke them directly. I should probably use V.generate instead of the current approach, but I bet the performance is the same. I'll give it a shot!

@lwdye
Copy link
Author

lwdye commented Apr 21, 2016

Hey Michael,

Thanks. That explains the performance difference. My case is always existing binary that is close to the generated (Vector a) but without the beginning length field. Almost need an decodeRawEx or something. Given that how would you code decodeMessages’ ?

Thanks,
Les

From: Michael Sloan
Reply-To: fpco/store
Date: Thursday, April 21, 2016 at 4:50 PM
To: fpco/store
Cc: Lester Dye, Author
Subject: Re: [fpco/store] performance anomaly and possible related bug (#6)

Hey Les! Thanks, it's been quite interesting to work on it.

I think the performance difference is two things:

  1. Decoding actually isn't getting forced for the single tick functions. V.force only forces the vector, but not the things it contains. The docs clarify that this is to allow you to remove sharing in cases where you've got a small chunk of a large vector. Not a great name for the function..

I figured this out by doing

import Debug.Trace (trace)

debug x = trace (show x) x

decodeMessages' n b = V.generate n (\i -> decodeEx (debug (S.drop (i * tradeTotalBytes ) b )) :: Trade)

Since no trace messages were emitted, I knew that the value was never being demanded. It ran so quickly because it just created a vector of thunks. This may well be an interesting thing to explicitly support in Store. It could cause input memory leaks, but could also be quite handy for things that only need to decode part of their input.

  1. 'replicateM' isn't well optimized for vector. See Foldable instance for Vector uses slow default implementations for many methods haskell/vector#112Foldable instance for Vector uses slow default implementations for many methods haskell/vector#112 - which I discovered while implementing store instances for vector.

Note that store already has instances for Vector, so you can peek and poke them directly. I should probably use V.generate instead of the current approach, but I bet the performance is the same. I'll give it a shot!


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-213150032

@mgsloan
Copy link
Owner

mgsloan commented Apr 21, 2016

Note that I edited my comment on github to also explain that the decoding error is from using whnf functionToTime100'' 1000 instead of whnf functionToTime1000'' 1000. The decode error doesn't happen in the single prime benchmark because no actual decoding is performed, due to laziness.

My case is always existing binary that is close to the generated (Vector a) but without the beginning length field. Almost need an decodeRawEx or something. Given that how would you code decodeMessages’ ?

I'd define it pretty much the way you are defining it in decodeMessages''

I was wrong about the replicateM, I thought you were using the one from Traversable, not the one from vector. To make this more polymorphic (but still fast when given a specific type), use replicateM from mono-traversable.

Not sure what decodeRawEx would look like. Perhaps you want decodeExWithOffset? This allows you to decode starting at an offset.

@mgsloan
Copy link
Owner

mgsloan commented Apr 22, 2016

I tried switching to the V.replicateM definition, and got a major performance hit (5x slower! argh!). haskell/vector#113

I suggest something like this definition:

import qualified Data.Vector.Mutable as MV
import qualified Data.Vector as V

peekVectorOfLength
    :: Int
    -> Peek a
    -> Peek (V.Vector a)
peekVectorOfLength n f = do
    mut <- liftIO (MV.new n)
    forM_ [0..n-1] $ \i -> f >>= liftIO . write mut i
    V.unsafeFreeze mut
{-# INLINE peekVectorOfLength #-}

@lwdye
Copy link
Author

lwdye commented Apr 22, 2016

Thanks. Yes just checking and the decodeMessages’ is just broken and when it is called gives an exception. (sorry Late night coding ;)… Yes the ‘’ version does work and I will try the mono-traversable version.

I would love to have decodeExWithOffset but it does not quite solve this problem (it solves another I have). In my case I know the number of messages ahead of the decodeEx call.

In this case and for almost all financial exchange data we have LE binary

[Header bytes count x x ][mid][Msg][mid][Msg][Header bytes count x x][mid][Msg][mid][Msg]… Nasdaq, Bats, NYSE etc very similar. The header and each message easily defined with Store.

Consider,

Data Trade = !Int !Int !Int , this is a good model for the Msg part a Vector Int gives ~ Length Int Int Int But existing skips the Length field as the is in the header record. So I read the header record and know exactly how many “Trade” records to read.

I need something that super efficiently decodes N (Peek a) just like decodeMessages’ “decodeExWithOffsetN” and “decodeExWithN”. I’m not familiar with handle internals but it looked like it might be simple to provide handle support as well. The following idea isn’t cooked but here goes

hGlimpseEx :: Store a => Handle -> a
hGlimpseExN :: Store a => Int->Handle -> m a
hGlimpseExWith :: Peek a -> Handle -> a
hGlimpseExWithN :: Peek a -> Int -> Handle ->m a
hGlimpseExWithOffset :: Peek a -> Handle -> (Offset,a)
hGlimpseExWithOffsetN :: Peek a -> Int->Handle -> (Offset,m a)

bGlimpseEx :: Store a => BS.ByteString -> a
bGlimpseExN :: Store a => Int->BS.ByteString -> m a
bGlimpseExWith :: Peek a -> BS.ByteString -> a
bGlimpseExWithN :: Peek a ->Int -> BS.ByteString -> m a
bGlimpseExWithOffsetN :: Peek a -> Int-> BS.ByteString -> (Offset,m a)

(replace Glimpse with Decode for proper naming). Handle would be very handy as often coming from handle into byte string via S.hGetSome or such then using Store.

bDecodeExN 10 bytes :: Vector Foo — assumes 10 foo payloads but does not look for a leading length
bDecodeEx bytes :: Vector Foo — assumes length and 10 Foo payloads

Thoughts?

From: Michael Sloan
Reply-To: fpco/store
Date: Thursday, April 21, 2016 at 5:20 PM
To: fpco/store
Cc: Lester Dye, Author
Subject: Re: [fpco/store] performance anomaly and possible related bug (#6)

Note that I edited my comment on github to also explain that the decoding error is from using whnf functionToTime100'' 1000 instead of whnf functionToTime1000'' 1000. The decode error doesn't happen in the single prime benchmark because no actual decoding is performed, due to laziness.

My case is always existing binary that is close to the generated (Vector a) but without the beginning length field. Almost need an decodeRawEx or something. Given that how would you code decodeMessages’ ?

I'd define it pretty much the way you are defining it in decodeMessages''

I was wrong about the replicateM, I thought you were using the one from Traversable, not the one from vector. To make this more polymorphic (but still fast when given a specific type), use replicateM from mono-traversablehttps://hackage.haskell.org/package/mono-traversable-0.10.2/docs/Data-Sequences.html#v:replicateM.

Not sure what decodeRawEx would look like. Perhaps you want decodeExWithOffset? This allows you to decode starting at an offset.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-213156702

@lwdye
Copy link
Author

lwdye commented Apr 22, 2016

Thanks! I’ll give that a try.

@lwdye
Copy link
Author

lwdye commented Apr 22, 2016

Perhaps in general?

replicatePeek::Store a => Int -> Peek a -> Peek ( m a )

@mgsloan
Copy link
Owner

mgsloan commented Apr 22, 2016

The question is, how do we create the m? I do have a variety of generic utilities for the case where length is stored:

peekSequence :: (IsSequence t, Store (Element t), Index t ~ Int) => Peek t

peekSet :: (IsSet t, Store (Element t)) => Peek t

peekMap
    :: (Store (ContainerKey t), Store (MapValue t), IsMap t)
    => Peek t

peekMutableSequence
    :: Store a
    => (Int -> IO r)
    -> (r -> Int -> a -> IO ())
    -> Peek r

I can provide equivalents of such utilities that take a known length, if that is convenient.

@lwdye
Copy link
Author

lwdye commented Apr 22, 2016

Yes, that would be a very clean solution. Thanks!

@lwdye
Copy link
Author

lwdye commented Apr 24, 2016

Hey Michael,

_peek n = V.replicateM n (peek :: Store.Peek Foo)

data Foo = Trade !Int !Int !Int deriving (Show, Generic)
instance Store Foo

Using generated Peek (Vector Foo)
theVector b = Data.Store.decodeEx b

In one case and

Data.Store.decodeExWith _peek

We get (192/258 ns) and (14.7 / 21.5 micro-s) for n = 10 and 1000. V.replicateM is in the ballpark (but slower) that your internal approach. I’m not sure how that jives with the V.replicateM result you got earlier but thought you might find this interesting.

Also, by way of sanity check and advertisement for the package documentation it might be interesting to create C version of _peek just as a reference for a simple example or two and to show relative times. Perhaps there are standard benchmarks against C already.

Cheers,
Les

From: Michael Sloan
Reply-To: fpco/store
Date: Thursday, April 21, 2016 at 6:35 PM
To: fpco/store
Cc: Lester Dye, Author
Subject: Re: [fpco/store] performance anomaly and possible related bug (#6)

The question is, how do we create the m? I do have a variety of generic utilities for the case where length is stored:

peekSequence :: (IsSequence t, Store (Element t), Index t ~ Int) => Peek t

peekMutableSequence
:: Store a
=> (Int -> IO r)
-> (r -> Int -> a -> IO ())
-> Peek r
peekMutableSequence new write = do

I can provide equivalents of such utilities that take a known length, if that is convenient.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-213178006

@mgsloan
Copy link
Owner

mgsloan commented May 31, 2016

Closing this in favor of the summarization in #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants