Skip to content
This repository has been archived by the owner on Apr 17, 2023. It is now read-only.

Refactor Metrics Collection to be deterministic and complete #671

Merged
merged 14 commits into from
Mar 17, 2016

Conversation

lfryc
Copy link
Contributor

@lfryc lfryc commented Jan 14, 2016

https://issues.jboss.org/browse/AGPUSH-1512

refactor metric collection logic so that it is triggered externally in cycles until everything is processed

  • this involves new TriggerMetricCollectionQueue that is used to store a "request for metrics processing"
  • will allow us to update metrics more efficiently since one update cycle will cover not only several BatchLoaded events but also several VariantMetricInformation events

See diagram for more clarity: https://docs.google.com/drawings/d/12cUrYf7ACLYEoYsIWul2szN3lmQcYFMjjaZDq2fr9qQ/edit?usp=sharing

🚢 deployed to https://upsstaging-aerogearz.rhcloud.com/ag-push/ (feel free to test it there, ping me for a password)
🎯 targets master

@lfryc
Copy link
Contributor Author

lfryc commented Jan 15, 2016

Tests fail because of WildFly Creaper is not in Maven Central wildfly-extras/creaper#28

Locally tests should pass with JBoss Users Maven Settings: https://developer.jboss.org/wiki/MavenGettingStarted-Users

@lfryc
Copy link
Contributor Author

lfryc commented Jan 20, 2016

Waiting for engineering operations to set artifacts synchronization up, see linked JIRA.

@lfryc lfryc force-pushed the AGPUSH-1512-refactor-metrics-rebase branch from b90f147 to b3442b2 Compare January 25, 2016 12:36
@Ladicek
Copy link

Ladicek commented Jan 27, 2016

Small hint: OnlineManagementClient is Closeable, yet you're not closing it. You really want to :-) (Also, it's fairly heavy to create, so creating it once and reusing is preferable.)

@lfryc
Copy link
Contributor Author

lfryc commented Jan 27, 2016

Nice catch, @Ladicek , should be fixed now.

@lfryc
Copy link
Contributor Author

lfryc commented Jan 28, 2016

@sebastienblanc @edewit could you please 👀 this?

@sebastienblanc
Copy link
Member

Ok, after testing this PR , I still have the hanging Pending status issue but seems to happen only after I start the server for the first psuh notifications, after that it works. If I stop the server and restart, again, the first 2 push notifications are stuck on pending.

I can see that in the log for the one pending :

16:09:09,458 WARN  [org.hornetq.core.server] (Thread-18 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=a2af8517-cf3b-11e5-aa9d-ad3ca46f2f82-312435197-500839064)) HQ222149: Message Reference[2758]:RELIABLE:ServerMessage[messageID=2758,durable=true,userID=0047c1b2-cf3f-11e5-bc76-275315df99b3,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.BatchLoadedQueue,properties=TypedProperties[__HQ_CID=ffe6dcf5-cf3e-11e5-bc76-275315df99b3,variantID=7448ae17-d711-44cc-b4b5-9ba23a7c5b9e]]@1434059439 has reached maximum delivery attempts, sending it to Dead Letter Address jms.queue.DLQ from jms.queue.BatchLoadedQueue
16:09:09,459 WARN  [org.hornetq.core.server] (Thread-18 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=a2af8517-cf3b-11e5-aa9d-ad3ca46f2f82-312435197-500839064)) HQ222149: Message Reference[2857]:RELIABLE:ServerMessage[messageID=2857,durable=true,userID=109f4263-cf3f-11e5-bc76-275315df99b3,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.BatchLoadedQueue,properties=TypedProperties[__HQ_CID=ffe6dcf5-cf3e-11e5-bc76-275315df99b3,variantID=7448ae17-d711-44cc-b4b5-9ba23a7c5b9e]]@1028020992 has reached maximum delivery attempts, sending it to Dead Letter Address jms.queue.DLQ from jms.queue.BatchLoadedQueue

@sebastienblanc
Copy link
Member

Ok , I can really confirm the pattern : 1.start AS 2. the 2 first notifications got stuck on pending 3. the third and following notifications works. 4. stop and restart the AS 5. Go back to step 2

@lfryc
Copy link
Contributor Author

lfryc commented Feb 10, 2016

@sebastienblanc as I mentioned on hangout, it is necessary to run the CLI script above to set address-settings for TriggerMetricCollection.

However I believe I could overlook a setup for BatchLoaded and AllBatchesLoaded queues - it is likely that these two queues will require an extra setting as well to avoid sendings them to DLQ.

@lfryc
Copy link
Contributor Author

lfryc commented Feb 14, 2016

I have tested this setup locally:

[standalone@localhost:9999 /] /subsystem=messaging/hornetq-server=default/address-setting=jms.queue.TriggerMetricCollectionQueue:add(redelivery-delay=1000, max-delivery-attempts=300)
[standalone@localhost:9999 /] /subsystem=messaging/hornetq-server=default/address-setting=jms.queue.MetricsQueue:add(max-delivery-attempts=-1)
[standalone@localhost:9999 /] /subsystem=messaging/hornetq-server=default/address-setting=jms.queue.BatchLoadedQueue:add(max-delivery-attempts=-1)
[standalone@localhost:9999 /] /subsystem=messaging/hornetq-server=default/address-setting=jms.queue.AllBatchesLoadedQueue:add(max-delivery-attempts=-1)

Let me update the PR.

@lfryc
Copy link
Contributor Author

lfryc commented Feb 15, 2016

Applied patch that sets max-delivery-attempts to infinite for queues that we pull from in MetricCollector.

connect
batch

/subsystem=messaging/hornetq-server=default/address-setting=jms.queue.TriggerMetricCollectionQueue:add(redelivery-delay=1000, max-delivery-attempts=300)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wonder if a hard limit 5 mins for delivery (300x 1000ms) is enough here. We could either consider much highter values (hours) or even set it to infinite (-1).

For newer WildFly/EAP versions (WF 10, EAP7), we can set up exponential backoff so that if delivery is really long, it does not try redeliver each second, but rather each 15s or so,

such as: redelivery-delay=1000, redelivery-multiplier=1.5, max-redelivery-delay=15000, max-delivery-attempts=300)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfryc can you create a new JIRA for including this in the installation guide?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfryc the file needs to be included in the DIST bundle (via our maven assembly)

Something else, is that in database we already some .cli scripts - is this something we can combine?
and perhaps just rename them?)

@matzew
Copy link
Contributor

matzew commented Mar 1, 2016

@lfryc @sebastienblanc does this require some extra configuration steps to be run? (wondering because of the CLI changes).

If yes, where are they documented?

@sebastienblanc
Copy link
Member

yes you need to run the jms-setup-wildfly.cli script.

On Tue, Mar 1, 2016 at 11:25 AM, Matthias Wessendorf <
[email protected]> wrote:

@lfryc https://github.com/lfryc @sebastienblanc
https://github.com/sebastienblanc does this require some extra
configuration steps to be run? (wondering because of the CLI changes).

If yes, where are they documented?


Reply to this email directly or view it on GitHub
#671 (comment)
.

@matzew
Copy link
Contributor

matzew commented Mar 1, 2016

This causes a new concurrency issue..

  • I submit a loop of two: works (on the UI)
  • I submit another loop of two: stuck in pending
  • a bit later... I submit one more loop of two: the previous two are now succeeding :) but the current onces are stuck in pending

Also, I see something like this:

11:49:04,544 WARN  [org.hornetq.core.server] (Thread-8 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1130858087-136844650)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=53982,durable=true,userID=374dd275-df9b-11e5-a01d-7322db26dd1f,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1456829345040,__HQ_CID=374d5d42-df9b-11e5-a01d-7322db26dd1f,_HQ_DUPL_ID=4d3f0636-234d-4009-8e12-773cd0769c1a]]@737050820

in the logs

PS: all push notifications have been delivered, but the code that deals with the analytic UI is broken and has (concurrency) issues

@matzew
Copy link
Contributor

matzew commented Mar 1, 2016

@lfryc regarding the analytic UI updates, I think the current code is broken when there are concurrent push requests.

For instance: have a loop of two push requests (via java sender), sending the same payload. Internally the endpoint issues to different 'submit' to the router, but than, the Message Queue is bound to the variantID. Of course, this is the same for different push jobs.

RESULT:
in case of the loop of two, the data for the first request is 2 btaches and 0 for the second.
If you do a sleep(1000) between the sending of the requests, you get 1 batch for each (as expected). I think... we need to have something like variantID+pushjobID to make this more robust

@lfryc
Copy link
Contributor Author

lfryc commented Mar 14, 2016

Thanks for testing @matzew , going to reproduce this locally...

@matzew
Copy link
Contributor

matzew commented Mar 14, 2016

@lfryc the concurrency is issue is already fixed, but I think the other problem here remains.

@lfryc lfryc force-pushed the AGPUSH-1512-refactor-metrics-rebase branch from e628c12 to 4a8844f Compare March 15, 2016 18:04
@lfryc
Copy link
Contributor Author

lfryc commented Mar 15, 2016

rebased on latest master


@matzew this message rather reflects a feature:

11:49:04,544 WARN  [org.hornetq.core.server] (Thread-8 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1130858087-136844650)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=53982,durable=true,userID=374dd275-df9b-11e5-a01d-7322db26dd1f,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1456829345040,__HQ_CID=374d5d42-df9b-11e5-a01d-7322db26dd1f,_HQ_DUPL_ID=4d3f0636-234d-4009-8e12-773cd0769c1a]]@737050820

if you look at the diagram [1] CollectTriggerEvent can be queued from several places - in order to make sure there is just one of them processed, HornetQ deduplicates it. Regrettably it also prints Warning for that case. We could recommend users to supress that log if it should be annoying.

[1] https://docs.google.com/drawings/d/12cUrYf7ACLYEoYsIWul2szN3lmQcYFMjjaZDq2fr9qQ/edit

@matzew
Copy link
Contributor

matzew commented Mar 16, 2016

👀

<dependency>
<groupId>org.wildfly.extras.creaper</groupId>
<artifactId>creaper-core</artifactId>
<version>0.9.2</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0.0 is out, and on maven central http://search.maven.org/#search%7Cga%7C1%7Ccreaper-core

@lfryc
Copy link
Contributor Author

lfryc commented Mar 16, 2016

Upgraded WildFly Creaper to 1.0.0.

Btw I can squash all the previous commits, I just wanted to keep the specific commits around, presumably it may make it simpler for review.

@matzew
Copy link
Contributor

matzew commented Mar 16, 2016

@squash: sounds good

On Wed, Mar 16, 2016 at 3:23 PM, Lukáš Fryč [email protected]
wrote:

Upgraded WildFly Creaper to 1.0.0.

Btw I can squash all the previous commits, I just wanted to make the
specific commits around, presumably it may make it simpler for review.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#671 (comment)

Matthias Wessendorf

blog: http://matthiaswessendorf.wordpress.com/
sessions: http://www.slideshare.net/mwessendorf
twitter: http://twitter.com/mwessendorf

@matzew
Copy link
Contributor

matzew commented Mar 16, 2016

2016-03-16 15:46:18,121 INFO  [org.jboss.aerogear.unifiedpush.message.TokenLoader] (Thread-75 (HornetQ-client-global-threads-1371010113)) Loaded batch #2880, containing 1 tokens, for android variant (ae359806-9f6f-4367-8165-e27bcae65f87)
2016-03-16 15:46:18,122 SEVERE [org.jboss.aerogear.unifiedpush.message.sender.GCMPushNotificationSender] (Thread-86 (HornetQ-client-global-threads-1371010113)) Error sending payload to GCM server
2016-03-16 15:46:18,122 WARNING [org.jboss.aerogear.unifiedpush.message.NotificationDispatcher] (Thread-86 (HornetQ-client-global-threads-1371010113)) Error on &apos;android&apos; delivery
2016-03-16 15:46:18,123 WARN  [org.hornetq.core.server] (Thread-62 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1307557425-1088932442)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=1043515,durable=true,userID=d75fde12-eb85-11e5-9714-670252db5bfe,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1458139578622,__HQ_CID=d75fb6fc-eb85-11e5-9714-670252db5bfe,_HQ_DUPL_ID=4e74a97d-0a5f-41ae-b991-2de4e07fee0a]]@1871682950
2016-03-16 15:46:18,125 SEVERE [org.jboss.aerogear.unifiedpush.message.sender.GCMPushNotificationSender] (Thread-73 (HornetQ-client-global-threads-1371010113)) Error sending payload to GCM server
2016-03-16 15:46:18,125 WARNING [org.jboss.aerogear.unifiedpush.message.NotificationDispatcher] (Thread-73 (HornetQ-client-global-threads-1371010113)) Error on &apos;android&apos; delivery
2016-03-16 15:46:18,126 WARN  [org.hornetq.core.server] (Thread-62 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1307557425-1088932442)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=1043550,durable=true,userID=d7605352-eb85-11e5-9714-670252db5bfe,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1458139578625,__HQ_CID=d760534d-eb85-11e5-9714-670252db5bfe,_HQ_DUPL_ID=4e74a97d-0a5f-41ae-b991-2de4e07fee0a]]@220335884
2016-03-16 15:46:18,128 WARN  [org.hornetq.core.server] (Thread-61 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1307557425-1088932442)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=1043576,durable=true,userID=d760a178-eb85-11e5-9714-670252db5bfe,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1458139578628,__HQ_CID=d760a175-eb85-11e5-9714-670252db5bfe,_HQ_DUPL_ID=4e74a97d-0a5f-41ae-b991-2de4e07fee0a]]@2011038242
2016-03-16 15:46:18,136 INFO  [org.jboss.aerogear.unifiedpush.message.TokenLoader] (Thread-79 (HornetQ-client-global-threads-1371010113)) Preparing message delivery and loading tokens for the ANDROID 3rd-party Push Network (for 1 variants)
2016-03-16 15:46:18,136 SEVERE [org.jboss.aerogear.unifiedpush.message.sender.GCMPushNotificationSender] (Thread-84 (HornetQ-client-global-threads-1371010113)) Error sending payload to GCM server
2016-03-16 15:46:18,136 WARNING [org.jboss.aerogear.unifiedpush.message.NotificationDispatcher] (Thread-84 (HornetQ-client-global-threads-1371010113)) Error on &apos;android&apos; delivery
2016-03-16 15:46:18,141 WARN  [org.hornetq.core.server] (Thread-61 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1307557425-1088932442)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=1043674,durable=true,userID=d7629d54-eb85-11e5-9714-670252db5bfe,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1458139578640,__HQ_CID=d7627641-eb85-11e5-9714-670252db5bfe,_HQ_DUPL_ID=4e74a97d-0a5f-41ae-b991-2de4e07fee0a]]@1098199559

@matzew
Copy link
Contributor

matzew commented Mar 16, 2016

Sending one push, to a variant (default config for batchSize etc), containing one iOS device, gives me:

16:31:35,564 INFO  [org.jboss.aerogear.unifiedpush.message.NotificationRouter] (default task-38) Internal dispatching of push message for one ios variant
16:31:35,674 INFO  [org.jboss.aerogear.unifiedpush.message.TokenLoader] (Thread-15 (HornetQ-client-global-threads-387973010)) Preparing message delivery and loading tokens for the IOS 3rd-party Push Network (for 1 variants)
16:31:35,729 INFO  [org.jboss.aerogear.unifiedpush.message.NotificationDispatcher] (Thread-14 (HornetQ-client-global-threads-387973010)) Received UnifiedPushMessage from JMS queue, will now trigger the Push Notification delivery for the ios variant (ce1ae9d9-536b-4076-8dac-b93a5432c334)
16:31:35,742 INFO  [org.jboss.aerogear.unifiedpush.message.TokenLoader] (Thread-15 (HornetQ-client-global-threads-387973010)) Loaded batch #1, containing 1 tokens, for ios variant (ce1ae9d9-536b-4076-8dac-b93a5432c334)
16:31:35,780 WARN  [org.hornetq.core.server] (Thread-18 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1373525482-701823084)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=1049029,durable=true,userID=2b391c8a-eb8c-11e5-8b4c-ad4cf54d44f6,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1458142296275,__HQ_CID=2b38ce67-eb8c-11e5-8b4c-ad4cf54d44f6,_HQ_DUPL_ID=9ff56197-e586-4d36-9a6e-06516de38a48]]@688328975
16:31:36,892 INFO  [org.jboss.aerogear.unifiedpush.message.sender.APNsPushNotificationSender] (Thread-14 (HornetQ-client-global-threads-387973010)) Sent push notification to the Apple APNs Server for 1 tokens
16:31:36,907 WARN  [org.hornetq.core.server] (Thread-19 (HornetQ-remoting-threads-HornetQServerImpl::serverUUID=aac76191-2158-11e5-b1f2-09b1f5772f56-1373525482-701823084)) HQ222059: Duplicate message detected - message will not be routed. Message information:
ServerMessage[messageID=1049045,durable=true,userID=2be56226-eb8c-11e5-8b4c-ad4cf54d44f6,priority=4, bodySize=1500,expiration=0, durable=true, address=jms.queue.TriggerMetricCollectionQueue,properties=TypedProperties[_HQ_SCHED_DELIVERY=1458142297405,__HQ_CID=2be53b13-eb8c-11e5-8b4c-ad4cf54d44f6,_HQ_DUPL_ID=9ff56197-e586-4d36-9a6e-06516de38a48]]@149082573

Not sure why we see the Duplicate message warnings;

@squash
Copy link

squash commented Mar 16, 2016

@matzew did you pull me into this discussion on purpose?

@matzew
Copy link
Contributor

matzew commented Mar 16, 2016

@squash haha - sorry :) I was talking to @lfryc about squashing :)

@lfryc
Copy link
Contributor Author

lfryc commented Mar 17, 2016

@matzew I've sanitized those warnings by our own de-duplication process. The process is also pretty readable from debug log (level=fine).

@matzew
Copy link
Contributor

matzew commented Mar 17, 2016

I start server with these sys props:

 ./bin/standalone.sh -b 0.0.0.0 --server-config=standalone-full.xml -Daerogear.android.batchSize=1 -Daerogear.ios.batchSize=1

but I am seeing this:

14:51:44,446 INFO  [org.jboss.aerogear.unifiedpush.message.TokenLoader] (Thread-6 (HornetQ-client-global-threads-737541343)) Loaded batch #1, containing 2 tokens, for ios variant (205077df-25ef-4466-84b2-e7bc935ff5ef)

there should not be two tokens in the batch, @lfryc

/subsystem=messaging/hornetq-server=default/address-setting=jms.queue.BatchLoadedQueue:add(max-delivery-attempts=-1)
/subsystem=messaging/hornetq-server=default/address-setting=jms.queue.AllBatchesLoadedQueue:add(max-delivery-attempts=-1)

run-batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfryc does the latest queue belong in here too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matzew they don't need any special handling, I believe,

the MetricsProcessingStarted topic is just "optimization" for cluster, so that it does not have to go to the DB so often...

and TriggerVariantMetricCollection will be triggered each time the notification is dispatched by PushSender, so it will eventually succeed and push TriggerMetricCollection event as a result

@lfryc
Copy link
Contributor Author

lfryc commented Mar 17, 2016

@matzew I can see the two batches are loaded when aerogear.ios.batchSize=1, to be sure that the property is loaded correctly, I've created a new test

@matzew
Copy link
Contributor

matzew commented Mar 17, 2016

@lfryc awesome!

this also fixes the concurrency issue I noticed earlier this week 🌹 💋

@matzew matzew merged commit 7c2f2c2 into aerogear:master Mar 17, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants