-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
influxdb? #1
Comments
my 11% number:
|
Thanks for thinking about this! The Joulescope UI downsamples to pixels today using the data_get function. It computes the data from either the summary (reduction) or the raw samples. I will definitely have this same feature as part of the JLS reader implementation. I think that InfluxDB stores a database "row" for each entry, which is prohibitive at our sample rates. As awesome as InfluxDB is, I doubt that a general-purpose time-series database will meet all the Joulescope requirements. The Saleae format is their export format. I have no idea what they use normally for the .sai format. However, their model is capture to RAM then save, so I suspect that they don't have the same streaming concerns that we do. Yes, having summaries (which the existing Joulescope code calls reductions) is critical for performance. The RAM implementation does 3 summaries with downsampling at [200, 100, 50]. The existing JLS file only does a single summary with downsampling at 20,000. I will make the new file format configurable since I have no idea what the "best" answer is!
is the Maclauren series for 1 / (1 - x), so we can compute this directly. A 10% increase is then
which matches the 11% increase you computed. In order to compute statistics correctly, even on the summaries, I store mean, min, max, and variance (standard deviation), which means each summary entry is 4 times the size of the initial sample. In the case of downsampling by 10, our actual next size would be 4 / 10 = 0.4 => 1 / (1 - 0.4) = 66% increase. Something more on the order of 100 will likely be the "answer". However, I definitely need to do a better job benchmarking performance with this new file format. I already know the answer to interleaving. You definitely want to store blocks of the same sample (summary) rate. The software needs to be able to quickly skip the other data. JLS files can be HUGE. I actually fixed a bug where the software was storing too many summaries for downsampled data. The performance ended up being terrible. If I had multiple summary levels, this would likely not have been an issue. I intended to have multiple summary levels in the existing JLS file format, but the file format made this prohibitive. The new file format will contain doubly-linked lists, which allows the format to scale to multiple summary levels. I like the idea of creating some typical access profiles for benchmarking! Here are some common scenarios:
File variations: Anything you would add? |
What are the current issues with the JLS file format? Is the rendering engine fine and you want to think through version 2 JLS with feedback from us, or are you facing legit problems with big datasets that you don't yet have a solution for? I think your categories cover everything the file access API would encounter: Item 1: initial reduction full-read Above this, though, there are performance issues not related to the file system (e.g., are you interpolating in the renderer or just doing coordinate transforms? how often are you scanning all reduction data to provide stats? etc.) To "file variations" I would add:
I don't know how modern HDD/SSD caches work so it might be interesting to throw these two in there for now.
I'm confused, why does the sample file show them interleaved? |
The JLS file format is alright, but we do have a number of outstanding issues related to the JLS v1 format:
These issues, combined with the belief that I can drastically improve performance with a rewrite, is what is motivating this v2 format. I think that I have a handle on all of this, but I definitely want to give you the opportunity to participate since Steve plans on using JLS for everything. I am happy to have you participate as much or as little as you want. I am not sure what you use for your current data storage requirements, but this v2 format could also help address any issues your software may have. Good point on trying to figure out the renderer impact. How the UI chooses to use this access is very important to performance. Thanks for the additional file variations. I hadn't thought of simultaneous read file access. Now that you mention it, I can think of some interesting use cases where that could be nice. Yes, single signal per file vs multiple signals per file would be good to consider. I really like one file for one capture, but it doesn't have to be that way if performance is vastly different. Yes, they are interleaved by chunks. They are not interleaved sample-by-sample. A chunk can hold lots of samples, which will be configurable. The existing v1 format holds 400,000 samples per chunk, but all 6 fixed signals. The new v2 chunk will hold samples of a single signal. |
I tend toward separate files for separate data, I'm not a metadata-as-kitchen-sink kinda programmer (unless we're creating an actual database), so most of those features aren't that interesting to me: e.g., the metadata file contains pointers to the low level data, not the other way around. I do like the idea of a multi-scale file format to aid faster rendering, as I would someday like to add a faster visualization widget. I expect I'll write an export-to-JLS 2.0 function so that folks like Steve can export my data to your viewer and combine with his own PyJoulescope experiments. I can probably also provide the TypesScript/ES6 version of the API to this repo at some point so we can benchmark across multiple languages. :) |
Hi Matt,
I encountered this very issue once I started using the Joulescope, and after Steve suggested the benchmark run for 10 minutes. ;) (My solution was to blow up memory and downsample to max/avg/min based on pixels vs. samples... fast but eats a LOT of memory).
My gut tells me this is not a new problem and I've see you've done some prelim research. InfluxDB looks to be the most attuned to this problem, but I don't fully understand their strategy yet. ... Oh, and the Salae format is not what I expected, but then again, I have a Salae-16 and it is extreeeeeemly slow when plotting analog data, like, "go make a coffee and come back later" slow.
I would definitely sacrifice disk space for performance. I think storing lower-resolution data in the file makes sense (like progressive JPEG): choosing the levels of resolution to store, as well as deciding to intersperse the finer data (as you have shown) vs grouping them is going to be a function of media seek time.
If you did pow10 subsampling down to one sample, the final file would only be 11% larger and that would include all ranges.
It would be interesting to see what is faster to render based on user demand: interleaving smaples, or grouping the views together by time scale. My gut (again) says the latter may be faster since it only requires one seek per resolution change, but if the data is buffered and the user is changing scales quickly, mixing timescales might be more performant.
Maybe we need to make some huge dummy files and create some user access profiles, i.e., is there a UX expectation?
The text was updated successfully, but these errors were encountered: