influxdb? #1

petertorelli · 2021-02-05T23:01:19Z

Hi Matt,
I encountered this very issue once I started using the Joulescope, and after Steve suggested the benchmark run for 10 minutes. ;) (My solution was to blow up memory and downsample to max/avg/min based on pixels vs. samples... fast but eats a LOT of memory).

My gut tells me this is not a new problem and I've see you've done some prelim research. InfluxDB looks to be the most attuned to this problem, but I don't fully understand their strategy yet. ... Oh, and the Salae format is not what I expected, but then again, I have a Salae-16 and it is extreeeeeemly slow when plotting analog data, like, "go make a coffee and come back later" slow.

I would definitely sacrifice disk space for performance. I think storing lower-resolution data in the file makes sense (like progressive JPEG): choosing the levels of resolution to store, as well as deciding to intersperse the finer data (as you have shown) vs grouping them is going to be a function of media seek time.

If you did pow10 subsampling down to one sample, the final file would only be 11% larger and that would include all ranges.

It would be interesting to see what is faster to render based on user demand: interleaving smaples, or grouping the views together by time scale. My gut (again) says the latter may be faster since it only requires one seek per resolution change, but if the data is buffered and the user is changing scales quickly, mixing timescales might be more performant.

Maybe we need to make some huge dummy files and create some user access profiles, i.e., is there a UX expectation?

petertorelli · 2021-02-05T23:05:28Z

my 11% number:

size	actual bytes
100TB	109951162777600
10TB	10995116277760
1TB	1099511627776
100GB	109951162778
10GB	10995116278
1GB	1099511628
100MB	109951163
10MB	10995116
1MB	1099512
100KB	109951
10KB	10995
1KB	1100
100B	110
10B	11
1B	1
sum	122167958641778
sum vs 100TB size	11%

mliberty1 · 2021-02-06T15:02:23Z

Hi @petertorelli

Thanks for thinking about this! The Joulescope UI downsamples to pixels today using the data_get function. It computes the data from either the summary (reduction) or the raw samples. I will definitely have this same feature as part of the JLS reader implementation.

I think that InfluxDB stores a database "row" for each entry, which is prohibitive at our sample rates. As awesome as InfluxDB is, I doubt that a general-purpose time-series database will meet all the Joulescope requirements. The Saleae format is their export format. I have no idea what they use normally for the .sai format. However, their model is capture to RAM then save, so I suspect that they don't have the same streaming concerns that we do.

Yes, having summaries (which the existing Joulescope code calls reductions) is critical for performance. The RAM implementation does 3 summaries with downsampling at [200, 100, 50]. The existing JLS file only does a single summary with downsampling at 20,000. I will make the new file format configurable since I have no idea what the "best" answer is!

1 + x + x^2 + x^3 + ...

is the Maclauren series for 1 / (1 - x), so we can compute this directly. A 10% increase is then

1 / (1 - 0.1) = 1.11...

which matches the 11% increase you computed.

In order to compute statistics correctly, even on the summaries, I store mean, min, max, and variance (standard deviation), which means each summary entry is 4 times the size of the initial sample. In the case of downsampling by 10, our actual next size would be 4 / 10 = 0.4 => 1 / (1 - 0.4) = 66% increase. Something more on the order of 100 will likely be the "answer". However, I definitely need to do a better job benchmarking performance with this new file format.

I already know the answer to interleaving. You definitely want to store blocks of the same sample (summary) rate. The software needs to be able to quickly skip the other data. JLS files can be HUGE. I actually fixed a bug where the software was storing too many summaries for downsampled data. The performance ended up being terrible. If I had multiple summary levels, this would likely not have been an issue. I intended to have multiple summary levels in the existing JLS file format, but the file format made this prohibitive. The new file format will contain doubly-linked lists, which allows the format to scale to multiple summary levels.

I like the idea of creating some typical access profiles for benchmarking!

Here are some common scenarios:

Open a file and get the full waveform summary view.
Zoom from full waveform summary view to an individual sample, at various zoom steps.
Zoom out from an individual sample to full waveform summary, at various zoom steps.
Pan right at different zoom levels
Pan left at different zoom levels
Compute statistics over a specified range (dual markers for the Joulescope UI)

File variations:
a. File sizes, perhaps in scales of 10, from 1 MB to whatever the drive can store.
b. Media types: SSD, HDD
c. data types (such as float32, fixed-point integer, binary/digital)
d. number of waveforms per file
e. number of sources per file (shouldn't matter, since performance should be waveform-based, but still)

Anything you would add?

petertorelli · 2021-02-06T19:56:06Z

What are the current issues with the JLS file format? Is the rendering engine fine and you want to think through version 2 JLS with feedback from us, or are you facing legit problems with big datasets that you don't yet have a solution for?

I think your categories cover everything the file access API would encounter:

Item 1: initial reduction full-read
Item 2-3: quantized zooms between reductions
Item 4-5: continuous or random pans within a reduction
Item 6: offset seeks to a single highest-resolution point

Above this, though, there are performance issues not related to the file system (e.g., are you interpolating in the renderer or just doing coordinate transforms? how often are you scanning all reduction data to provide stats? etc.)

To "file variations" I would add:

Number of read file handles for the same file. E.g., what if you want to add multiple viewports in the application (for multiple waveforms), or perhaps a minimap of the entire trace?
And while this seems like simple scaling of existing parameters, perhaps the # of files may have an impact if you store multiple waveforms in the same file vs multiple files.

I don't know how modern HDD/SSD caches work so it might be interesting to throw these two in there for now.

You definitely want to store blocks of the same sample (summary) rate.

I'm confused, why does the sample file show them interleaved?

mliberty1 · 2021-02-06T20:24:20Z

Hi @petertorelli

The JLS file format is alright, but we do have a number of outstanding issues related to the JLS v1 format:

Steve wants to add custom data at the end (done, recently added)
Steve wants to add annotations with hist script after capture completes
Steve wants to use this same format for other sources. The JLS v1 format would need to be modified to handle this.
Dual marker analysis fails to start when JLS read still updating (103)
Ability to add board and additional text information (93)
Add option to display waveform time as UTC (55
Optimize JLS file responsiveness for small files (48
Save annotations applied to JLS recordings (41

These issues, combined with the belief that I can drastically improve performance with a rewrite, is what is motivating this v2 format. I think that I have a handle on all of this, but I definitely want to give you the opportunity to participate since Steve plans on using JLS for everything. I am happy to have you participate as much or as little as you want. I am not sure what you use for your current data storage requirements, but this v2 format could also help address any issues your software may have.

Good point on trying to figure out the renderer impact. How the UI chooses to use this access is very important to performance.

Thanks for the additional file variations. I hadn't thought of simultaneous read file access. Now that you mention it, I can think of some interesting use cases where that could be nice.

Yes, single signal per file vs multiple signals per file would be good to consider. I really like one file for one capture, but it doesn't have to be that way if performance is vastly different.

Yes, they are interleaved by chunks. They are not interleaved sample-by-sample. A chunk can hold lots of samples, which will be configurable. The existing v1 format holds 400,000 samples per chunk, but all 6 fixed signals. The new v2 chunk will hold samples of a single signal.

petertorelli · 2021-02-06T21:15:18Z

I tend toward separate files for separate data, I'm not a metadata-as-kitchen-sink kinda programmer (unless we're creating an actual database), so most of those features aren't that interesting to me: e.g., the metadata file contains pointers to the low level data, not the other way around. I do like the idea of a multi-scale file format to aid faster rendering, as I would someday like to add a faster visualization widget. I expect I'll write an export-to-JLS 2.0 function so that folks like Steve can export my data to your viewer and combine with his own PyJoulescope experiments.

I can probably also provide the TypesScript/ES6 version of the API to this repo at some point so we can benchmark across multiple languages. :)

petertorelli closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

influxdb? #1

influxdb? #1

petertorelli commented Feb 5, 2021

petertorelli commented Feb 5, 2021

mliberty1 commented Feb 6, 2021

petertorelli commented Feb 6, 2021

mliberty1 commented Feb 6, 2021

petertorelli commented Feb 6, 2021

influxdb? #1

influxdb? #1

Comments

petertorelli commented Feb 5, 2021

petertorelli commented Feb 5, 2021

mliberty1 commented Feb 6, 2021

petertorelli commented Feb 6, 2021

mliberty1 commented Feb 6, 2021

petertorelli commented Feb 6, 2021