Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sr_watch is leaking in flow test because of pclean tests #208

Closed
benlapETS opened this issue May 27, 2019 · 7 comments
Closed

sr_watch is leaking in flow test because of pclean tests #208

benlapETS opened this issue May 27, 2019 · 7 comments
Assignees
Labels
bug Something isn't working likely-fixed likely fix is in the repository, success not confirmed yet.

Comments

@benlapETS
Copy link
Contributor

benlapETS commented May 27, 2019

I discovered that after fixing error and warning summary(#206), as I am now more sure of the counts, with big samples tests. It seems that pclean f90 generate inotify events, that are translated to sr_watch posts, when it is executing random tests which creates new files with new extension (S P C (this one is disable), hlink, slink, ...) which are then propagated and checked in other pclean tests...
Here is the output of flow check:

ERROR Summary:

     1440201 sr_sender_tsource2send_f50 (40 file) [ERROR] Delivery failed /local/ben/sarra_devdocroot/sent_by_tsource2send/2019052
     19194 sr_shovel_pclean_f90.2019-05-24 (1 file) [ERROR] file /local/ben/sarra_devdocroot/20190524/WXO-DD/meteocode/ont/cmml/TR
     10558 sr_subscribe_q_f71.2019-05-24 (10 file) [ERROR] util/writelocal mismatched file length writing FCUK54_EGRR_242000___799
     6225 sr_shovel_pclean_f90.2019-05-24 (1 file) [ERROR] sr_shovel/run going badly, so sleeping for 0.01
     735 sr_subscribe_cp_f61.2019-05-24 (20 file) [ERROR] Download failed sftp://ben@localhost//local/ben/sarra_devdocroot/sent_by
     282 sr_subscribe_t_f30.2019-05-24 (5 file) [ERROR] Server couldn't fulfill the request. Error code: 404, File not found
     130 sr_watch_f40 (2 file) [ERROR]   File "/local/ben/anaconda3/envs/base_sarra/bin/sr_watch", line 11, in <module>
     78 sr_poll_f62.2019-05-24 (26 file) [ERROR] /local/ben/anaconda3/envs/base_sarra/lib/python3.7/site-packages/paramiko/kex_ecd
     26 sr_poll_f62.2019-05-24 (26 file) [ERROR]   self.curve, Q_S_bytes
     6 sr_watch_f40 (2 file) [ERROR] load_entry_point('metpx-sarracenia', 'console_scripts', 'sr_watch')()

More than 10 TYPES OF ERRORS found... for the rest, have a look at /local/ben/.cache/sarra/log/flow_check_ERRORs_logged.txt for details

WARNING Summary:

     5406120 sr_report_tsarra_f20.2019-05-24 (63 file) [WARNING] total: Excessive lag! Messages posted a minute ago 
     1890596 sr_shovel_pclean_f90.2019-05-24 (3 file) [WARNING] /20190524/WXO-DD/observations/swob-ml/20190524/CTBF/2019-05-24-192
     479873 sr_sender_tsource2send_f50 (60 file) [WARNING] attempt 2
     479839 sr_sender_tsource2send_f50 (60 file) [WARNING] attempt 3
     161061 sr_shovel_pclean_f90.2019-05-24 (1 file) [WARNING] propagated = 0
     94761 sr_sarra_download_f20.2019-05-24 (5 file) [WARNING] truncating toolong header at 206 characters (to fit 255 byte AMQP l
     19078 sr_subscribe_ftp_f70.2019-05-24 (10 file) [WARNING] onfly_checksum d,680a8a01e839aa216e5fbe6343c07ac2 differ from messa
     3551 sr_subscribe_cp_f61.2019-05-24 (10 file) [WARNING] could not move /local/ben/sarra_devdocroot/downloaded_by_sub_cp/20190
     638 sr_poll_pulse (4 file) [WARNING] hb_pulse connection problem...reconnecting
     547 sr_poll_f62 (4 file) [WARNING] sr_poll sleep too low (10) secs is less than 10% of poll time (357.696395)

More than 10 TYPES OF WARNINGS found... for the rest, have a look at /local/ben/.cache/sarra/log/flow_check_WARNINGs_logged.txt for details

	Maximum of the shovels is: 100488

		TEST RESULTS

                 | dd.weather routing |
test  1 success: sr_shovel	 (100488) t_dd1 should have the same number of items as t_dd2	 (100470)
test  2 success: sr_winnow	 (200958) should have the same of the number of items of shovels	 (200958)
test  3 success: sr_sarra	 (94761) should have the same number of items as winnows'post	 (100207)
test  4 success: sr_subscribe	 (94691) should have the same number of items as sarra		 (94761)
                 | watch      routing |
test  5 FAILURE: sr_watch		 (3662882) should be 4 times subscribe t_f30		  (94691)
test  6 FAILURE: sr_sender		 (2855260) should have the same number of items as sr_watch  (3662882)
test  7 FAILURE: sr_subscribe u_sftp_f60 (209649) should have the same number of items as sr_sender (2855260)
test  8 FAILURE: sr_subscribe cp_f61	 (209645) should have the same number of items as sr_sender (2855260)
                 | poll       routing |
test  9 FAILURE: sr_poll test1_f62	 (149293) should have half the same number of items of sr_sender	 (2855260)
test 10 success: sr_subscribe q_f71	 (149293) should have the same number of items as sr_poll test1_f62 (149293)
                 | flow_post  routing |
test 11 FAILURE: sr_post test2_f61	 (137319) should have half the same number of items of sr_sender 	 (2855260)
test 12 success: sr_subscribe ftp_f70	 (136429) should have the same number of items as sr_post test2_f61 (137319)
test 13 success: sr_post test2_f61	 (137319) should have about the same number of items as shim_f63	 (137309)
                 | py infos   routing |
test 14 FAILURE: sr_shovel pclean_f90 (64780) should have the same number of watched items winnows' post	 (100207)
test 15 FAILURE: sr_shovel pclean_f92 (18452) should have the same number of removed items winnows' post	 (100207)
test 16 success: 0 messages received that we don't know what happened.
test 17 success: count of truncated headers (94761) and subscribed messages (94761) should have about the same number of items
                 | C          routing |
test 18 success: cpump both pelles (c shovel) should receive about the same number of messages (156417) (156417)
test 19 success: cdnld_f21 subscribe downloaded (33405) the same number of files that was published by both van_14 and van_15 (33405)
test 20 FAILURE: veille_f34 should post twice as many files (3911) as subscribe cdnld_f21 downloaded (33405)
test 21 FAILURE: veille_f34 should post twice as many files (3911) as subscribe cfile_f44 downloaded (33405)
test 22 FAILURE: Overall 11 of 21 passed (sample size: 100207) !
@benlapETS benlapETS added the bug Something isn't working label May 27, 2019
@benlapETS benlapETS self-assigned this May 27, 2019
@benlapETS
Copy link
Contributor Author

Note1: It is easy to reproduce just make pclean fail (ie in f91) by not adding the next header (ie in f90) then let sr_watch and pclean shovel continue to run. One can also run a big sample test (which is much longer to reproduce).
Note2: I refactored all f9x plugins today and started both tests to try to reproduce the problem in debug mode.

@benlapETS
Copy link
Contributor Author

benlapETS commented May 29, 2019

Note1: Finally, the issue was not that easy to reproduce as both test couldnt suply clues on the problem, I will restart the same test but without the .inc reject enabled

@benlapETS
Copy link
Contributor Author

benlapETS commented May 31, 2019

Note1: I managed to reproduce the problem with a 100 000 limit and a 2 days run.
Note2: Here is a script to count the distribution per date of different posted message from watches (dont know how it could be useful yet):

for i in 28 29 30 31; do
 echo -e $i
 grep -c "201905${i}T.*\..*" sr_watch_f40_01.log* | cut -d":" -f2- | paste -sd+ | bc;
done

Note3: The diff result of flow checks numbers done with a few hours of intervals, still unsure of the relevance of those results, except that senders and pclean module are still active after everything is done. Digging in the logs of those components will help me to understand.

   195739  sr_sender_tsource2send_f50       (30 file)  [ERROR] |	  205551  sr_sender_tsource2send_f50       (30 file)  [ERROR]
  37      sr_cpump_xvan_f14                (2 file)   [ERROR] |	  40      sr_cpump_xvan_f14                (2 file)   [ERROR]
  858375  sr_shovel_pclean_f90.2019-05-29   (46 file)  [WARNI |	  884055  sr_shovel_pclean_f90.2019-05-29   (46 file)  [WARNI
  62088   sr_sender_tsource2send_f50        (30 file)  [WARNI |	  64549   sr_sender_tsource2send_f50        (30 file)  [WARNI
  62058   sr_sender_tsource2send_f50        (30 file)  [WARNI |	  64522   sr_sender_tsource2send_f50        (30 file)  [WARNI
  425     sr_poll_pulse                     (3 file)   [WARNI |	  435     sr_poll_pulse                     (3 file)   [WARNI
  124     sr_poll_f62                       (3 file)   [WARNI |	  129     sr_poll_f62                       (3 file)   [WARNI
test  5 FAILURE: sr_watch (813377) should be 4 times subscrib |	test  5 FAILURE: sr_watch (837830) should be 4 times subscrib
test  6 success: sr_sender (751386) should have the same numb |	test  6 success: sr_sender (773308) should have the same numb
test  7 FAILURE: sr_subscribe u_sftp_f60 (343649) should have |	test  7 FAILURE: sr_subscribe u_sftp_f60 (343649) should have
test  8 FAILURE: sr_subscribe cp_f61 (343649) should have the |	test  8 FAILURE: sr_subscribe cp_f61 (343649) should have the
test  9 FAILURE: sr_poll test1_f62 (181521) should have half  |	test  9 FAILURE: sr_poll test1_f62 (181526) should have half 
test 10 success: sr_subscribe q_f71 (181536) should have the  |	test 10 success: sr_subscribe q_f71 (181541) should have the 
test 11 FAILURE: sr_post test2_f61 (181136) should have half  |	test 11 FAILURE: sr_post test2_f61 (181136) should have half 

Note3: After I stopped all processes I counted 837830 post_log, thus it should mean that there has been 837830 inotify events (unsure?)
Note4: I noticed that winnows posted only half of what it should have posted. Shovel were at 186k while winnows posted only 92k. I also counted the flow balance manually with this:

> grep -c 'msg_total' sr_*winnow*.log* | cut -d":" -f2- | paste -sd+ | bc;
186708
> grep -c 'Ignored' sr_*winnow*.log* | cut -d":" -f2- | paste -sd+ | bc;
93761
> expr 186708 - 93761
92947

Then, close to 94K msg have been ignored and I stil don't know why. Here is the code being executed in sr_winnow(aka sr_subscribe):

        if self.caching and not self.msg.isRetry :
           new_msg = self.cache.check_msg(self.msg)

           if not new_msg :
              if self.reportback : self.msg.report_publish(304,'Not modified1')
              self.logger.info("Ignored %s not modified" % (self.msg.notice))
              if self.msg.isRetry: self.consumer.msg_worked()
              return True

@benlapETS
Copy link
Contributor Author

benlapETS commented Jun 3, 2019

Note1: Here is a script to trace the propagation of one file:

grep "\[INFO\].*20190529T\?182216.600" *.log* > foo
echo $p | awk '{print $2}' foo | sort | uniq > bar
while read p; do 
 grep "$p" foo
done < bar > foobar
cut -c -160 foobar | less

Here is the route of that file with timing:

sr_cpump_pelle_dd2_f05_01.log.2019-05-29:2019-05-29 14:21:30,507 [INFO] received: 20190529182216.600 http://dd5.weather.gc.ca /bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 topic=v02.post.bulletins.alphanumeric.20190529.UB.KMSC.18 sum=d,19ddb348155d43a403231f3a5fcfcd08 source=WXO-DD to_clusters=DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM from_cluster=DDSR.CMC mtime=20190529182216.600 atime=20190529182216.600 mode=0644 parts=1,569,1,0,0 sundew_extension=cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214 filename=msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214
sr_cpump_pelle_dd2_f05_01.log.2019-05-29:2019-05-29 14:21:30,509 [INFO] published: 20190529182216.600 http://dd5.weather.gc.ca /bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 topic=v02.post.bulletins.alphanumeric.20190529.UB.KMSC.18 sum=d,19ddb348155d43a403231f3a5fcfcd08 source=WXO-DD to_clusters=DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM from_cluster=DDSR.CMC mtime=20190529182216.600 atime=20190529182216.600 mode=0644 parts=1,569,1,0,0 sundew_extension=cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214 filename=msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214
sr_cpump_xvan_f15_01.log.2019-05-29:2019-05-29 14:21:30,540 [INFO] received: 20190529182216.600 http://dd5.weather.gc.ca /bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 topic=v02.post.bulletins.alphanumeric.20190529.UB.KMSC.18 sum=d,19ddb348155d43a403231f3a5fcfcd08 source= to_clusters=DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM from_cluster=DDSR.CMC mtime=20190529182216.600 atime=20190529182216.600 mode=0644 parts=1,569,1,0,0 filename=msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214 sundew_extension=cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214
sr_cpump_xvan_f15_01.log.2019-05-29:2019-05-29 14:21:30,549 [INFO] published: 20190529182216.600 http://dd5.weather.gc.ca /bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 topic=v02.post.bulletins.alphanumeric.20190529.UB.KMSC.18 sum=d,19ddb348155d43a403231f3a5fcfcd08 source= to_clusters=DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM from_cluster=DDSR.CMC mtime=20190529182216.600 atime=20190529182216.600 mode=0644 parts=1,569,1,0,0 filename=msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214 sundew_extension=cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214
sr_shovel_t_dd1_f00_01.log.2019-05-29:2019-05-29 14:21:35,577 [INFO] post_log notice=20190529182216.600 https://hpfx.collab.science.gc.ca /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'mtime': '20190529182216.600', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'atime': '20190529182216.600', 'from_cluster': 'DDSR.CMC', 'parts': '1,569,1,0,0', 'source': 'WXO-DD'}
sr_shovel_t_dd2_f00_01.log.2019-05-29:2019-05-29 14:21:36,141 [INFO] post_log notice=20190529182216.600 https://hpfx.collab.science.gc.ca /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'mtime': '20190529182216.600', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'atime': '20190529182216.600', 'from_cluster': 'DDSR.CMC', 'parts': '1,569,1,0,0', 'source': 'WXO-DD'}
sr_winnow_t00_f10_01.log.2019-05-29:2019-05-29 14:21:43,906 [INFO] post_log notice=20190529182216.600 https://hpfx.collab.science.gc.ca /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'mtime': '20190529182216.600', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'atime': '20190529182216.600', 'from_cluster': 'DDSR.CMC', 'parts': '1,569,1,0,0', 'source': 'WXO-DD'}
sr_winnow_t00_f10_01.log.2019-05-29:2019-05-29 14:21:44,362 [INFO] Ignored 20190529182216.600 https://hpfx.collab.science.gc.ca /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 not modified
sr_sarra_download_f20_01.log.2019-05-29:2019-05-29 14:21:53,225 [INFO] post_log notice=20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'mtime': '20190529182216.600', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'atime': '20190529182216.600', 'from_cluster': 'DDSR.CMC', 'parts': '1,569,1,0,0', 'source': 'WXO-DD'}
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:21:54,868 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:05,522 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:05,523 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:11,434 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:11,436 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:19,152 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:19,161 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:26,884 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:26,886 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:32,490 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:32,490 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:38,940 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f90_01.log.2019-05-29:2019-05-29 14:22:38,943 [INFO] post_log notice=20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'atime': '20190529182216.600', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'from_cluster': 'DDSR.CMC', 'mtime': '20190529182216.600', 'parts': '1,569,1,0,0', 'source': 'WXO-DD', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'pclean_f90': '.hlink'}
sr_shovel_pclean_f91_01.log.2019-05-29:2019-05-29 14:22:40,511 [INFO] appended to retry list file 20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f91_01.log.2019-05-29:2019-05-29 14:22:51,530 [INFO] Retrying notice  20190529182216.600 http://localhost:8001/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045
sr_shovel_pclean_f91_01.log.2019-05-29:2019-05-29 14:22:51,531 [INFO] post_log notice=20190529182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'atime': '20190529T182216.600', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'from_cluster': 'DDSR.CMC', 'mtime': '20190529T182216.600', 'parts': '1,569,1,0,0', 'source': 'WXO-DD', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'pclean_f91': '.hlink', 'pubTime': '20190529T182216.600', 'baseUrl': 'http://localhost:8001', 'relPath': '/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045', 'integrity': {'method': 'md5', 'value': 'Gd2zSBVdQ6QDIx86X8/NCA=='}, 'size': 569}
sr_shovel_pclean_f92_01.log.2019-05-29:2019-05-29 14:22:55,716 [INFO] post_log notice=20190529T182216.600 http://localhost:8001 /20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045 headers={'atime': '20190529T182216.600', 'filename': 'msg_ddsr-WXO-DD3_ec5b52df0e6a57033493ef056cc97ab9:cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'from_cluster': 'DDSR.CMC', 'mtime': '20190529T182216.600', 'source': 'WXO-DD', 'sundew_extension': 'cvt_nws_bulletins-sr:KMSC:UB:3:Direct:20190529182214', 'to_clusters': 'DDSR.CMC,DDI.CMC,CMC,SCIENCE,EDM', 'pubTime': '20190529T182216.600', 'baseUrl': 'http://localhost:8001', 'relPath': '/20190529/WXO-DD/bulletins/alphanumeric/20190529/UB/KMSC/18/UBUS01_KMSC_291800___69045', 'sum': 'd,19ddb348155d43a403231f3a5fcfcd08', 'parts': '1,569,1,0,0', 'pclean_f92': '.hlink'}

The routing seems ok in this particular case as we see that winnows ignores a msg that is already there and it goes through the flow test in 1m30. Ignored msgs do not seem to be related to the current issue. I should now find one that does not complete pclean to see how the system behave in this case.

@benlapETS
Copy link
Contributor Author

benlapETS commented Jun 4, 2019

Note1: I added a switch to disable the summaries in the flow_check and tried to accelerate the computation
Note2: I tried to trace where the flow test watch is leaking. For that, I tried to segregate watch events with file extensions (from relpath):

  1. Here is a count from events that watch posted normal inotify events:

The snippet

grep '\[INFO\] post_log' sr_watch_f40_*.log* > post_logs.txt
grep -c -v "relPath': '/.*\.slink'\|relPath': '/.*\.hlink'\|relPath': '/.*\.moved'" post_logs.txt

The results

476317
  1. Here are watch's posts logged because of pclean tests:

The snippet

for i in \.hlink \.slink \.moved; do
   echo $i
   grep -c -h "relPath': '/.*$i'" post_logs.txt
done

The results

.hlink
121215
.slink
120725
.moved
119573

@benlapETS
Copy link
Contributor Author

benlapETS commented Jun 5, 2019

Note1: I continued yesterday's analysis:

  • Here is the repartition over the dates
grep -c '\[INFO\] post_log' sr_watch_f40_*.log*
sr_watch_f40_01.log:151428
sr_watch_f40_01.log.2019-05-29:351422
sr_watch_f40_01.log.2019-05-30:334980
  • As we see, the watch should have got enough early in 2019-05-30, but it continued even with the shovels dd* stopped at '2019-05-30 00:01:14,631'.
  • In f40 we have this reject .*\.tmp so I dont understand why I am seeing post with .tmp in sr_watch, this represents 80k events.

Note2: I refined my script to classify watch posts with file extensions

Note3: What I see now is that it is not only a problem with pclean tests and I need to investigate what was going on after shovels stopped

@benlapETS
Copy link
Contributor Author

benlapETS commented Jun 11, 2019

Note 1: I think I am done with this task. Basically I removed the retry functionnality from the pclean plugin and add a msg delay in the config and it stabilized the flow test which was unmanageable because of the retry popping up constantly and duplicating a lot of msgs.
Note2: Just need to test it with at least one high volume sample to see if we have some progress.
Note3: I will also need after that fix is merge to wrap up other issues like #182 and #179.

@benlapETS benlapETS added the likely-fixed likely fix is in the repository, success not confirmed yet. label Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working likely-fixed likely fix is in the repository, success not confirmed yet.
Projects
None yet
Development

No branches or pull requests

2 participants