Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming improvements #14

Merged
merged 7 commits into from
Mar 4, 2019
Merged

Streaming improvements #14

merged 7 commits into from
Mar 4, 2019

Conversation

vsonnier
Copy link
Contributor

@vsonnier vsonnier commented Feb 24, 2019

Hello Charles, @zuckschwerdt and @guruofquality.

Thanks to suggestions and tests by @zuckschwerdt, I've incremented them by opening this PR to add improvements of my own together with the others.
@guruofquality I don't know if @zuckschwerdt can push on this PR ?

Anyway, here are the improvements :

  • Completed the removal of the fill thread initiated by @zuckschwerdt : there is no need of buffer mutex at all finally, atomic counters on buffer sizes are enough. This is even simpler than SoapySDRPlay, since iio_buffers has their own buffer management.
  • Enabled "direct copy" for CS8 and CF32 (the format used by CubicSDR) because thanks to having the same endianness, you only have to scale and cast manually the int16_t into the destination format and don't need iio_channel_convert at all.
  • Removed LUT for CF32 for simplification. Given the SoapySDRPlay and others don't have it, I guess the eventual perfomance gain is not worth the additional code.

Of course, this is a work in progress:

  • I've not touched the TX part,
  • It may be entirely wrong,
  • I have not found the sweet spot for MTU vs. set_buffer_size vs. samplerate, and so far only managed 5Msps on CubicSDR without cracking sound.
  • Maybe the refill_thread would be needed after all to pre-buffer the samples, if the refill_buffer op turns out to be too slow. I hope not.

@vsonnier
Copy link
Contributor Author

BTW, no crash by changing sample rate now on my side. It may not be perfect, but definitely a "less code, less bugs" trend here.

@zuckschwerdt
Copy link
Member

Let me put in the bug fixes to buf first. Then we have a clean base to do bigger restructuring.

@vsonnier
Copy link
Contributor Author

vsonnier commented Feb 24, 2019

@zuckschwerdt No problem, do as you wish, I can always re-merge my changes later.

In the meantime, I've measured the readStream duration on CubicSDR side, to know if the iio_buffer_refill() is the bottleneck or not.

What I measure with the current code having: buffer_size = 8 * MTU with samplerate of 10Msps:

  • SoapySDR::Device::readStream takes 0 ms (not measurable) 7 times / 8. It means the encoding is fast enough :)
  • SoapySDR::Device::readStream takes 10/12 ms 1 time / 8, i.e the time it needs to call iio_buffer_refill().
    ==> So it takes 10 ms to fetch 65568 samples, i.e 6.5 ms long sample at 10Msps.
    ==> This is slower than realtime :(.
    The good news is that another refill_thread wouldn't help in beating realtime.

Another attempt at 5 Msps:

  • SoapySDR::Device::readStream takes 0 ms (not measurable) 7 times / 8.
  • SoapySDR::Device::readStream takes 13 ms 1 time / 8.
    ==> So it takes 12/13 ms to fetch 65568 samples, i.e 13 ms long sample at 5Msps.
    ==> This explains why there is (almost) no crackling sound at that sample rate.

I've also measured iio_buffer_refill() directly and it matches the SoapySDR::Device::readStream refill-is-called case.

So basically irrespective of our processing power, iio_buffer_refill() is already slower than realtime beyond 5Msps on my machine at least.

@zuckschwerdt How high samplerate you could reach on your side without crackling sound on CubicSDR ?

@vsonnier vsonnier mentioned this pull request Feb 24, 2019
@zuckschwerdt
Copy link
Member

I see three topics worth addressing individualy here:

Tightening the copy loop. iio_channel_read() calls iio_channel_convert() on each sample, this isn't inlined and expensive. Your changes for well known cases will vectorize (should confirm that). Wouldn't call it direct copy though, I'm using that for the case which lends itself to DMA.

Default buffer size. I suspect the code is doing something like a log. A comment would be helpful ;)
Testing the code gives (note buffer_size is in simple samples, not I/Q samples):

samplerate n buffer latency FPS
25000 17 4096 81.9 ms 12.2 fps
1000000 12 131072 65.5 ms 15.3 fps
2000000 11 262144 65.5 ms 15.3 fps
4000000 10 524288 65.5 ms 15.3 fps
6000000 9 1048576 87.4 ms 11.4 fps
8000000 9 1048576 65.5 ms 15.3 fps
10000000 8 2097152 104.9 ms 9.5 fps

I might have botched that table, but the code needs a cleanup and documentation anyway.
What is the target latency or fps here? Is 60 fps / 17 ms what other drivers offer?

Dropping the refill_thread. As you mention the current code is just synchronized.
On entry recv() simply blocks on the refill thread when the buffer is empty.
But if we put

if (!items_in_buffer) {
    please_refill_buffer = true;
    cond.notify_all();
}

on the bottom of recv() we get concurrency. recv() will then only block if you call it too fast.
This lends itself to a read-process loop in the consumer. Not sure if that is the anticipated model for SoapySDR, or if a tight read loop (maybe even in a thread) is the preferred model?

@zuckschwerdt
Copy link
Member

The IIO internals guide has some information on buffers.

iio_buffer_push()/iio_buffer_refill() are just the high level interfaces. In the background IIO will allocate multiple buffers which are used to implement a double/triple/...-buffering scheme that allows to have continuous streaming without loosing samples. iio_buffer_push()/iio_buffer_refill() will only block if all buffers are in use. As soon as one buffer becomes available it will wakeup and let the application read/write the buffer. Communication between the libiio and the DMA core is highly asynchronous and mostly interrupt driven.

Note this function will create by default 3 buffer blocks. This default can be changed by unsing iio_device_set_kernel_buffers_count() prior in creating the buffer.

Buffers are allocated on the kernel side, and when calling iio_buffer_refill() the next buffer will be mapped into user space.

But this might all only apply to code running on the Zynq. IIO abstraction through USB or network might always block as I gather?

Using audio stutter in Cubic as test I can go up to 5M without refill_thread, 6M starts to stutter. With 32K buffer size and refill_thread at the tail of recv() I can get 6M without stutter. But nothing above is smooth.

@vsonnier
Copy link
Contributor Author

vsonnier commented Feb 24, 2019

Tightening the copy loop. iio_channel_read() calls iio_channel_convert() on each sample, this isn't inlined and expensive. Your changes for well known cases will vectorize (should confirm that).

Unless I'm wrong, the only thing to add compared to my direct_copy version would be swapping the bytes of the int16_t first before convert them to get the right endianness.

Default buffer size. I suspect the code is doing something like a log. A comment would be helpful ;)
Testing the code gives..

I don't get that part. Which code ?
The target latency on CubicSDR is < 60 fps so that the user can get a realtime display.

Dropping the refill_thread....

The code is all wrong there I'm afraid and need to be re-written based on var conds (is_not_full, is_not empty) principles and must not mix atomic variables in the process. And you need an additional bytes buffer to accumulate the samples, because iio_buffer_refill() overwrites the buffer, which is why we could only call it when it is actually empty. I think.

I was just doing that based on
https://github.com/cjcliffe/CubicSDR/blob/master/src/util/ThreadBlockingQueue.h
when I realized it was usless: no matter how many cascading buffers you got, if your sample generator namely iio_buffer_refill() here is no faster than relatime at producing samples, you will be on famine waiting for samples if your application comsume them in a realtime manner. Like CubicSDR does.
You observe that with the 5M/6M limit on your side. 6M works also barely with my version, too.

Basically we have to find a buffer_size which maximize iio_buffer_refill() throughoutput or we may switch on another kind of I/O call that can assure relatime, if any.

For testing buffer sizes, the most practical way would be exposing the buffer length as a SoapySDR setting that can be changed before starting to stream. This would be done implementing
SoapySDR::ArgInfoList getSettingInfo, readSetting, writeSetting like other SoapySDR module did.

The MTU size is not related to perfomance at all, but is a compromise between not too big reads that would not assure 60fps, and not too-small ones to prevent too many readStream calls.
For instance, we could choose something like MTU = min ( buffer_size / 2, 8196)

But this might all only apply to code running on the Zynq. IIO abstraction through USB or network might always block as I gather?

I suspect the DMA optimization is only valid on the device own memory space as well. You'll always have to go through the USB pipe.

That would be all for this Week-end for me. I think we made good progress already 👍

Regards,

Vincent

@zuckschwerdt
Copy link
Member

The table is about the set_buffer_size_by_samplerate selected sizes and resulting fps/latency.

@zuckschwerdt
Copy link
Member

My comment about "Tightening the copy loop." was that it's a good thing to do and should go in as a separate PR. It's not a simple memcpy "direct copy", but a tighter loop that the compiler might vectorize.

@vsonnier
Copy link
Contributor Author

vsonnier commented Feb 24, 2019

It's not a simple memcpy "direct copy"

Call it "can_do_raw_buffer_handling" if you wish.

should go in as a separate PR

Too late ! I'm already using it. And it works ! :)

The table is about the set_buffer_size_by_samplerate selected sizes and resulting fps/latency.

Ah OK. My tests shows that at least on my machine 131072 for set_buffer size is the fastest setting. The time iio_buffer_refill() take to fetch this amount is around 10ms, and practically independent of sample rate.
As your table shows, tose are way too big buffers which has indeed to be filled and so offer a big apparent "latency".
For instance for 10Msps fetching 2097152 samples, i.e almost 1Ms I/Q, i.e 100 ms worth of time, takes... well 105 ms, close enough. You can't go faster than reality.

Now for more reasonable buffer sizes, a well behaved iio_buffer_refill() would indeed buffer the continuous incoming samples while the application is doing anything else, and would only memcpy them in the user space when called, resulting a fast iio_buffer_refill() call.

Looks like it is not the case, because the amount of time is almost 10 ms for 2Msps to 10 Msps for the same amount of samples, as if the iio_buffer_refill() would indeed only transfer them on-demand.

@zuckschwerdt
Copy link
Member

Why did you pick 16384 as RX full scale? The data is 12 bit aligned to LSB. I.e. exclude the sign bit and you get -2048 – +2047. Scale that to -1 – 1 by dividing with 2048.

@zuckschwerdt
Copy link
Member

Too late ! I'm already using it. And it works ! :)

I favor putting distinct topics into distinct PRs. Just for the sake of others following this, the changelog, and perhaps bisecting changes if needed. (This assumes we squash merge. If you rebase interactive to distinct topic commits that would work too.)
But that's only relevant when you merge. Trying things out in a single PR here is easier for now.

I guess to really nail a perfect buffer size or a formula depending on sample rate we need to test various IIO abstraction: native, USB, network. My hunch is as native already has tripple-buffering and would be best with small buffers. Network might be best aligned around network MTU and USB performs badly with small buffers in my experience.

Btw. SDRPlay, HackRF and AirspyHF have fixed 64k buffers, BladeRF has 4k.
They all return an MTU of the actual buffer length. I think returning MTU < buffers size is safe, but likely not optimal?

@vsonnier
Copy link
Contributor Author

vsonnier commented Feb 24, 2019

But that's only relevant when you merge. Trying things out in a single PR here is easier for now.

Yes, and we are in no hurry to merge in master. It is just a way to make an experimental branch so that others can add commits on it.
For now the changes are not that big. If the refill_thread is to make a glorious return, it would need to be remade from scratch anyway so better get rid of the code now.
Please contribute here instead of your own repo ! Even if you create other intermediate PRs or simply branches it will be easier to compare or merge them in this main canonical repo afterwards.

Btw. SDRPlay, HackRF and AirspyHF have fixed 64k buffers, BladeRF has 4k.
They all return an MTU of the actual buffer length. I think returning MTU < buffers size is safe, but likely not optimal?

That is just fine apparently when I tested. The only difference is that there will be one iio_buffer_refill() per readStream call.
For those who follow the getStreamMTU() suggestion anyway, because we can indeed specified any wanted size to readStream. In principle, the MTU is just the "best" size to choose.
Whatever, buffer_size = MTU = fixed is fine by me.

@zuckschwerdt
Copy link
Member

Using either a refill_thread (with or without the refill at tail of recv) or without the refill_thread or even with triple buffering in the refill thread it always tops out at 6.3 Msps / 25.3 MBps for me (SoapyRateTest).
In fact running a stripped down version of ad9361-iiostream.c shows the same limit.
I guess that's just what IIO can deliver using USB.
With SoapyPlutoSDR running on the Pluto and using SoapyRemote with CS8 we could perhaps achieve 10 Msps at reduced sample depth – not sure if that is worth to support.

@vsonnier
Copy link
Contributor Author

Thanks for the extensive testing. I guess we can definitely strip refill_thread ans simply go synchroneous single-threaed in recv. We don't even need atomic variables at all.

With SoapyPlutoSDR running on the Pluto and using SoapyRemote with CS8 we could perhaps achieve 10 Msps at reduced sample depth – not sure if that is worth to support.

Well technically we could implement SOAPY_SDR_CS12 on the module side, (3 bytes for complex samples) encoded that way, according to SoapyRemote:

//the 3 bytes (in being a uint8_t*)
 uint16_t part0 = uint16_t(*(in++));
 uint16_t part1 = uint16_t(*(in++));
uint16_t part2 = uint16_t(*(in++));
//the 2 resulting I/Q in a int_16 format.
int16_t i = int16_t((part1 << 12) | (part0 << 4));
int16_t q = int16_t((part2 << 8) | (part1 & 0xf0));

in ClientStreamData::convertRecvBuffs

Apparently the SoapySDRServer can stream (packed) CS12 which could be unpacked on the application side by the more normal CS16/CF32.
The Pluto module on Pluto side would use on the Pluto itself some super-great DMA method to fetch samples.

6.3Msps x 32 bits is 200Mbit/s alone and I've read somewhere that USB bandwidth is reserved for TX.
If it is not the case, I dont' see why the custom solution above could not go higher than 200 Mbit/s since the theoritical limit is 400Mb/s for USB2. Other devices can stream 10Msps (Airspy, SDRPlay) but they are RX only.

Out of curiosity I tried SDRSharp with the PlutoSDR plugin, which apparently use the Net interface. Samplerates up to 10Msps are available, but any setting > 5Msps is marked "(not supported)", go figure.

@vsonnier
Copy link
Contributor Author

@zuckschwerdt Charles (@cjcliffe) and I are in contact with Robin Getz of Analog Devices by e-mail maybe we can ask him the maximum Msps we could get from the device before starting a lost quest.

@zuckschwerdt
Copy link
Member

Great! I briefly though about CS12 too, but wasn't sure that's even an option in SoapyRemote. So 8Msps with CS12 and 10Msps at CS8, I'll explore that soon.

Are you ok with me splitting out the fast-copy changes and perparing that for merge, then we can focus on buffer size and (non-)threading here.

@vsonnier
Copy link
Contributor Author

vsonnier commented Feb 25, 2019

Are you ok with me splitting out the fast-copy changes and perparing that for merge,

Excuse my French, do you mean just adding the fast-copy enhancements to master for now ? yes, do it, by all means ! The best way is probably opening a new PR containing just that.

@zuckschwerdt
Copy link
Member

I already have CS12 working (just a hack for now). I'll add that after the fast-copy for testing.

@vsonnier
Copy link
Contributor Author

I just tested at 6.3Msps on ClubicSDR, on a 200Kz FM stereo station. Same as you, with 6.4Msps starts the cracking sound.

I already have CS12 working (just a hack for now). I'll add that after the fast-copy for testing.

So you really plan to run the module on the Pluto itself, together with a SoapySDRServer ? Wow. !

@guruofquality
Copy link
Contributor

Great! I briefly though about CS12 too, but wasn't sure that's even an option in SoapyRemote. So 8Msps with CS12 and 10Msps at CS8, I'll explore that soon.

The client should support the conversion between CS12 over the network to CS16 or CF32: https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.hpp#L11

So if you advertise the native format as CS12, SoapyRemote should be ok. Caveat is that CS12 could be packed differently than expected, so thats something to look out for.

@zuckschwerdt
Copy link
Member

Cross compiling and running stuff on the Pluto isn't hard. It's just a pain to setup the compiler.
I have SoapySDR with module, GPSd, Chrony, rtl_433 and other stuff on there already. I might offer a binary package someday if there is some interest in this.

Will SoapyRemote always use the native format (CS16) for network? If the SoapyPlutoSDR module offers CS12 and CS8, and the client side requests that, will SoapyRemote on the server-side pick that for transport? I'll investigate in a day or so.

@zuckschwerdt
Copy link
Member

Okay, I can answer this now. If SoapyRemote is running on the Pluto and an application requests CS8 or CS12 then SoapyRemote is smart enough to use that over the network:

[INFO] Configured sender endpoint: dgram=1452 bytes, 714 elements @ 2 bytes, window=16 KiB

for CS8, and for CS12:

[INFO] Configured sender endpoint: dgram=1452 bytes, 476 elements @ 3 bytes, window=16 KiB

CS8 works right now and I also got working CS12 code on top of #16. PR soon.

@zuckschwerdt
Copy link
Member

Looking at
https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L83
https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L214

I get the impression that CS12 is ii qI QQ (uppercase MSB).
@guruofquality @vsonnier Is this right? Is there any spec on CS12? It seems to work, looking at the waterfall in CubicSDR. But there is too much gain (noise floor is way too high).

@vsonnier
Copy link
Contributor Author

vsonnier commented Feb 26, 2019

@zuckschwerdt, looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L83 using pen-and-paper lets say types are described in [MSBit ... LSBit],
part0, part1, part2 = [7 .. 0]
and we want to make int16_t = [15 .. 0] for I and Q.

I get:
I = [3 ..0 of part1, 7 .. 0 of part0, Zeroes]
Q = [7 ..0 of part2, 7..4 or part1, Zeroes]

Looks like I/Q are MSBit aligned here.

@vsonnier vsonnier force-pushed the streaming_improvements branch 2 times, most recently from e576421 to 0c3c0d1 Compare March 2, 2019 07:29
- Simplified getting Stream* using the same tricks as SoapySDRPlay,
- Revised some scoped locks
BEWARE: TX has not been modified accordingly yet.
@vsonnier vsonnier force-pushed the streaming_improvements branch from 0c3c0d1 to 327a511 Compare March 2, 2019 07:32
@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 2, 2019

@zuckschwerdt I have rebased the streaming_improvements branch over the current master, so on top of CS12 support.

@zuckschwerdt
Copy link
Member

I'll soon test. First notes: I'd expect the rx/tx streams to be independed. I.e. if I need two setupStream() calls to open streams I would not expect one closeStream() to remove both?

@vsonnier vsonnier force-pushed the streaming_improvements branch from 3c8f126 to ccce098 Compare March 2, 2019 09:32
@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 2, 2019

@zuckschwerdt You are right. I fixed this, plus replaced shared_ptr by unique_ptr because those streamers don't escape the parent object and should not, so shared_ptr is overkill or merely misleading here.

@zuckschwerdt
Copy link
Member

The Pluto is advertised as full-duplex. I don't know how much of that is true, perhaps independent control of both streams would be possible. E.g. running different rx/tx threads, with start/stop at different times.

@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 2, 2019

Indeed. So I made a change to return 2 distinct SoapySDR::Stream* for RX and TX.

@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 2, 2019

E.g. running different rx/tx threads, with start/stop at different times.

If you follow that path, you'll fall into the rabbit hole and soon realize that everything have to be locked by the same device_mutex that have to lock both streaming and settings.

I've tried to put std::lock_guard<std::mutex> lock(device_mutex); into readStream but it had noticable performance impact.

The coward solution is to let readStream unprotected. What about spin-locks though, because we only need to lock when "changing settings" and "streaming" are actually concurrent, which is not very often in practice.

Unfortunatly they are not standard issue in C++11, but strangely it provides the std::atomic_flag construct:
https://en.cppreference.com/w/cpp/atomic/atomic_flag, here complete with a spin-lock example.

So in the last commit, I made my own pluto_spin_mutex to use by device_mutex itself protecting everything.
readStream seems OK as the "unprotected" version, changing settings though seem slower. But I think this is not an issue.
Lets test and see how fast we can get with this safe version.

@zuckschwerdt
Copy link
Member

Part of the headache I had with the threading was concurrent access to stop() and recv(). There might still be a thread draining the buffer when stop() is called, and killing the buffer is a good way to crash while recv() is underway. Also something to look out for. Perhaps removing buf can be defered to the destructor. But that also means once buf is set up it can't safely be changed.

@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 2, 2019

There might still be a thread draining the buffer when stop()

Yes that is the mean issue. With the "lock everything" strategy, a xxxStream operation running can no longer be interrupted by a closeStream or similar, so the crash can't happen.
In the inverse situation if a closeStream is underway, the xxxxStream op will have to wait to proceed.
The only thing we need to assure is that xxxxStream op will cleanly do nothing when the terminating closeStream release the lock.

Basically Settings and Streaming ops get interleaved properly, so I think we may change any buffer sizing at runtime and it would still work.

@zuckschwerdt
Copy link
Member

Sorry, I meant that with a lockless or atomic variant. With the mutex lock it guards nicely, but avoiding a mutex would probably yield better performance.

@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 2, 2019

Well it is a called a mutex but here is just a CAS loop, which will almost always succeed the first time anyway because the actual contention is rare. I don't think we'll see any runing performance impact on streaming itself compared to no lock at all.

The only visible effect would be that a changing setting will wait for the current streaming operation to complete to take effect. At that particular time the spin-lock is likely to be slower than anything else because of its wasting of CPU cycles.

@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 3, 2019

RX and TX are indeed quite independent, so I've made changes to have different RX and TX locks, plus some other cleanups.
Now we could have concurrent readStream and writeStream in theory.
I think I fixed something in the process:
Previously when I was doing Start/Stop device in Cubic, on SoapyRemote connexion on localhost, it would crash both Cubic and the SoapySDRServer. Now it works.

@zuckschwerdt
Copy link
Member

Works very well and fixed spurious crash on start/stop for me.
Neither Clang-9 nor GCC-8 recognise std::make_unique, strangely enough not even in C++14 mode. I had to change that to

    this->rx_stream = std::unique_ptr<rx_streamer>(new rx_streamer(rx_dev, streamFormat, channels, args));

and

    this->tx_stream = std::unique_ptr<tx_streamer>(new tx_streamer(tx_dev, streamFormat, channels, args));

which I gather are equivalent, just not so nice looking.

@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 3, 2019

My bad, std::make_unique is indeed C++14, let's stick on C++11 then.

@zuckschwerdt
Copy link
Member

Tested and works well for me. Can be merged (squash merge please) when you see fit, in my opinion.

@vsonnier vsonnier merged commit 43cda25 into master Mar 4, 2019
@vsonnier
Copy link
Contributor Author

vsonnier commented Mar 4, 2019

Alea jacta est !

@vsonnier vsonnier deleted the streaming_improvements branch March 4, 2019 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants