Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non zero-copy dma_test errors after recent DMA refactor #73

Open
maleadt opened this issue Dec 10, 2021 · 7 comments
Open

Non zero-copy dma_test errors after recent DMA refactor #73

maleadt opened this issue Dec 10, 2021 · 7 comments
Labels

Comments

@maleadt
Copy link
Contributor

maleadt commented Dec 10, 2021

After #65, non zero-copy dma_test occasionally results in errors:

+ ./user/litepcie_util dma_test
DMA_SPEED(Gbps) TX_BUFFERS      RX_BUFFERS      DIFF    ERRORS
          7.03       21569           21441         128  127116032
          7.02       43105           42977         128        0
          7.02       64641           64513         128        0
          7.03       86209           86081         128        0
          7.02      107745          107617         128  121705984
          7.02      129281          129153         128        0
          7.03      150849          150721         128        0
          7.02      172385          172257         128        0
          7.03      193953          193825         128  87809568
          7.02      215489          215361         128        0

FWIW, I tried with the various error checking PRs I have created here, and it isn't a read or ioctl that's silently failing.
cc @sergachev

@sergachev
Copy link
Contributor

Hi!

  1. Are you sure it's the only change (fb1513e vs 7f1e393) ?
  2. Anything in dmesg?
  3. 7 Gb/s data rate looks like PCIe 2x, I don't have any hardware now to test > 1x. I have concerns that with the current circular buffer queue approach hardware can start overwriting the previous data if the host is not quick enough - would be good to rewrite this with a normal queue approach when hardware writes/reads each queue entry only once and stops if the queue is empty.

Could you try to run it under nice -n -20?

@maleadt
Copy link
Contributor Author

maleadt commented Dec 10, 2021

  1. Yes
  2. No
  3. It seems load-related indeed. Running together with stess -c $(nproc) results in many more errors:
DMA_SPEED(Gbps) TX_BUFFERS      RX_BUFFERS      DIFF    ERRORS
          5.56       17729           17569         160  308108288
          8.18       42817           42657         160  2262433792
          7.31       65345           65185         160  1593311232
          7.01       86849           86689         160  1325662208
          7.00      108321          108193         128  1343553536
          7.02      130081          129953         128  1431568384
          6.59      150305          150177         128  1027342336
          6.60      172577          172449         128  1566310400
          5.62      191521          191393         128  690487296
          9.47      220705          220577         128  3385327616

This does not happen on fb1513e:

DMA_SPEED(Gbps) TX_BUFFERS RX_BUFFERS  DIFF  ERRORS
          6.72    1116033    1115905    128       0
          7.52    1139105    1138945    160       0
          6.98    1160513    1160417     96       0
          6.54    1180769    1180641    128       0
          7.16    1203713    1203585    128       0
          7.23    1225889    1225761    128       0
          7.20    1247969    1247841    128       0
          6.92    1269313    1269185    128       0
          6.44    1290529    1290401    128       0
          6.92    1313761    1313633    128       0

Lower niceness does not help.

@sergachev
Copy link
Contributor

I can cause the same problem on both old and new versions with some CPU load and some positive niceness, but can't detect any difference between them yet. Still trying.

Most likely you don't see anything in dmesg because overflows and underflows are reported with dev_dbg() which is not compiled by default - I do see them after enabling, easiest way is

#undef dev_dbg
#define dev_dbg dev_err

inside the module main.c (and probably these should be promoted to errors permanently).

While reviewing the refactoring I found that the test logic should be corrected a bit - read verification should only start after all buffers are written: sergachev@52a814d - but that should only affect the beginning of the test.

@sergachev
Copy link
Contributor

One more problem I spotted is that dma_test is not so good at detecting errors - even when the module reports overflows/underflows dma_test can show 0 errors. That's most likely because of static content and arrangement of buffers. Changing the data pattern and buffer sequence continuously would be best, but also ok is to at least zero read buffers - sergachev@be856eb - performance penalty looks unnoticeable.

This is true both pre and post #65 .

@enjoy-digital
Copy link
Owner

With 444e995#diff-6098cfbe69612f9e6f1723251f5efca9734bc9861bf188158b0787b361e7ce6e, writes are now done continously (as it was before in fact). We'll now just have to ensure that the length of the LFSR is different from the total length of a DMA loop.

@sergachev
Copy link
Contributor

writes are now done continously (as it was before in fact)

Was it? Look at something old enough:

write_pn_data((uint32_t *) buf_wr, DMA_BUFFER_TOTAL_SIZE/4, &seed_wr);

@enjoy-digital
Copy link
Owner

Sorry, this was indeed the case before but in older version (c4c8705) I'm the one who simplified it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants