Question about obtaining the benchmark result #7

cherishwsx · 2020-05-27T23:40:54Z

Thank you for all the amazing work you've done!

I successfully ran through the training and predicting process of deeplog model using the same HDFS data file that you are using (from loghub).

And I'm using Drain as my parsing tool to get the structured log data. I ended up having 48 unique event ID in the template. And I'm using around 5000 sessions for the training and the train loss and validation loss converged to 0.2 (start from 0.8) around 300+ epochs. I didn't change the default parameter setting in the deeplog.py file except for the number of classes (48 in my case).

The result that I got from prediction is shown below. It does not look as promising as the benchmark.

I'm not sure why but is it because of the parsing tool?

And idea or suggetions of improving the model results are welcome!!

cherishwsx · 2020-05-28T00:05:23Z

And forgot to ask, could you breifly explain what is the num_candidates parameter for in the prediction?

Thank you!!!!

d0ng1ee · 2020-05-28T01:50:24Z

It depend on your parsing tool, my benchmark result is depend on "the ground truth" number of the template(28) in dataset"
num_candidates means the label in top num_candidates is labeled as normal log.
(you need to read the deeplog paper to get a better understanding of num_candidates...)

try to finetune num_candidates to get a better F1 score.
try to modify your parsing code to get a result close to the Ground truth(28 templates)

cherishwsx · 2020-05-28T03:24:40Z

Thank you so much for the suggestions! That's really helpful!

One follow up question I have is that, this may sounds a naive question, but do we always know the ground truth number of the log? And when we are using the parsing tool, we want to have the result/template as close as possible to the ground truth number we know by modifying the parsing code?

d0ng1ee · 2020-05-28T03:43:31Z

In industrial applications, the constantly updated log has no definite ground truth templates, you need to continuously optimize the model based on performance indicators :)

cherishwsx · 2020-05-29T02:01:18Z

Got it! Thank you! I don't have further question for now! :))

cherishwsx closed this as completed May 29, 2020

tongxiao-cs mentioned this issue Dec 9, 2021

In HDFS templates count is 28? #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about obtaining the benchmark result #7

Question about obtaining the benchmark result #7

cherishwsx commented May 27, 2020

cherishwsx commented May 28, 2020

d0ng1ee commented May 28, 2020

cherishwsx commented May 28, 2020

d0ng1ee commented May 28, 2020

cherishwsx commented May 29, 2020

Question about obtaining the benchmark result #7

Question about obtaining the benchmark result #7

Comments

cherishwsx commented May 27, 2020

cherishwsx commented May 28, 2020

d0ng1ee commented May 28, 2020

cherishwsx commented May 28, 2020

d0ng1ee commented May 28, 2020

cherishwsx commented May 29, 2020