Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

actual crash fix this time #800

Merged
merged 2 commits into from
May 5, 2023
Merged

Conversation

gisforgirard
Copy link
Contributor

this is the actual crash fix i kept messing up and submitting pull requests for, for real this time

honestly this is probably not the best way to solve it, but there is a case where amt_produce ends up going over 32768 and it crashes the program faithfully every time, and this has appeared to solve it for me without any obvious negative side effects, there is probably a better way to go about doing this and its pretty much a hack, but it's a hack that works at least

this is the actual crash fix i kept messing up, for real this time
@Dygear
Copy link
Contributor

Dygear commented May 2, 2023

When does amt_produce go over the 32768 limit? Sounds like there is another bug lurking there somewhere else. This feels a little like a band aide to the problem. Will for example discarding past the 32767th byte cause significant data loss for voice data or something else?

@robotastic
Copy link
Owner

This is definitely interesting! let me see if I can find some suspects. amt_produced should never get that large.

@tadscottsmith
Copy link
Contributor

I say it's always a good idea to check for overflows, but I'm pretty sure you've identified the reason you're seeing the issue, and most others aren't: #773 (comment).

@robotastic
Copy link
Owner

Tracking some research:
amt_produce is an int which means it can have a range between: -2147483648 to 2147483647
it is set to the value of the queue holding the output - which is returned as a size_t.

The baffling thing is that amt_produce should never be that big. It is however many samples were processed since the last time things were processed.

Could you uncomment line 183 in that file? I am curious what values you are seeing for the output queue...

@robotastic
Copy link
Owner

oh - you maybe right @tadscottsmith That would explain why so much gets built up... I am honestly surprised this fixes things... and also why that value, it seems like an int could hold a lot more. maybe it is something further down the stream.

@faultywarrior
Copy link
Contributor

I suppose it is possible that the queue is getting built up that big due to filesystem writes being held; but it does seem odd. Usually if you start to out-strip your storage throughput, you just start dropping traffic or get spammed with O's. FWIW, I capture about a dozen of calls per second on my T-R instance to a NAS which, like @gisforgirard, I have it mounted as a local folder. I've never seen this crash occur - my TR instance has had months of uptime. However, my NAS is fully solid-state and has a 10Gb connection to the ESXi host where the T-R instance runs; so its possible that despite the large write volume (my T-R logs are witten to the same NAS); my latency is low enough that I don't build the buffer up big enough to trip this bug.

@tadscottsmith
Copy link
Contributor

If my math is right, and 160 samples = 20 ms, then 32768 samples is (32768 / 160 = 204.8 * 20 ms = 4096 ms) ~ 4.1 seconds behind.

The max size of an Int_16 is 32767 and there's an awful lot of those in the area that would make me suspicious.

int16_t *out = (int16_t *)output_items[0];

This line stands out after a quick glance.

@taclane
Copy link
Contributor

taclane commented May 3, 2023

If my math is right, and 160 samples = 20 ms, then 32768 samples is (32768 / 160 = 204.8 * 20 ms = 4096 ms) ~ 4.1 seconds behind.

That sounds about right. A couple years back there was an issue where buffers would back up and it led to call audio falling out of sequence. 60-90k in the queue time-shifted transmissions 8-10 seconds into the next call.

A little curious about "// buffer limit is 32768, see gnuradio/gnuradio-runtime/lib/../include/gnuradio/buffer.h:186". Which version of gnuradio is this from?

@Dygear
Copy link
Contributor

Dygear commented May 3, 2023

Oh, not flushing to the network is a problem. Didn't think that it would hold onto it in a buffer there, but in a file buffer somewhere else. It could save to the file system (even remote file systems) once the device becomes ready for it.

Sounds like the real fix is making a file save function that holds onto a que of data that it's waiting to write to disk somewhere in the main loop so it doesn't backpressure into this buffer.

@robotastic
Copy link
Owner

I added a few more debug values to the message. I am also checking with a friend to see if he has any ideas... either way, this seems like a good fix, I am merging it in.

@robotastic robotastic merged commit 8cabd46 into robotastic:master May 5, 2023
@gisforgirard
Copy link
Contributor Author

gisforgirard commented May 13, 2023

I suppose it is possible that the queue is getting built up that big due to filesystem writes being held; but it does seem odd. Usually if you start to out-strip your storage throughput, you just start dropping traffic or get spammed with O's. FWIW, I capture about a dozen of calls per second on my T-R instance to a NAS which, like @gisforgirard, I have it mounted as a local folder. I've never seen this crash occur - my TR instance has had months of uptime. However, my NAS is fully solid-state and has a 10Gb connection to the ESXi host where the T-R instance runs; so its possible that despite the large write volume (my T-R logs are witten to the same NAS); my latency is low enough that I don't build the buffer up big enough to trip this bug.

yeah i have a 10GBASE-T link straight to my NAS but for whatever reason it has weird latency issues where packets won't drop but it will freeze the entire system at the kernel whenever it happens, i have tried different network cards, SFP+ adapters, ethernet cords, kernel settings, and it drives me crazy trying to track down what is going on but still haven't been able to figure it out, it's probably some kind of problem with synology's software but i'm already like $3000 into this stupid thing so maybe i'll go back to truenas for the next build instead...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants