Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data generation #4

Closed
iamgroot42 opened this issue Jun 16, 2021 · 12 comments
Closed

Synthetic data generation #4

iamgroot42 opened this issue Jun 16, 2021 · 12 comments

Comments

@iamgroot42
Copy link
Contributor

Hi.

Fascinating work! Nice to finally see a usable collection of network-based datasets for GNN driven. Could you please add the code used for synthetic graph generation? I'm looking to generate more graphs with some additional constraints, so I'd appreciate even any useful links or pointers to generation code used by the authors 😸

@jzhou316
Copy link
Collaborator

Hi @iamgroot42 thanks for your interest! We are working to sort some code for the graph generation part. Will point to you asap!

@iamgroot42
Copy link
Contributor Author

@jzhou316 thanks! Looking forward to it :D

@jzhou316
Copy link
Collaborator

Hi @iamgroot42 we have added the code for generating botnets for detection (overlaying synthetic botnet topologies on real network traces) here. Hope that can give you some ideas!

@iamgroot42
Copy link
Contributor Author

Thanks a lot, @jzhou316 ! Really appreciate the effort :)

@iamgroot42
Copy link
Contributor Author

@jzhou316 I had a follow-up question on the exact process used to generate data used in the paper (and available via download links).

  • What timestamp (I can see a particular one is used as the default value in botnetGenerator.py) files were used?
  • What time ranges are used?

@iamgroot42 iamgroot42 reopened this Jun 26, 2021
@jzhou316
Copy link
Collaborator

jzhou316 commented Jun 29, 2021

looping in @xuzhiying9510 for better explanation

@xuzhiying9510
Copy link
Contributor

xuzhiying9510 commented Jun 30, 2021

@iamgroot42 Thanks for the question!

@iamgroot42
Copy link
Contributor Author

iamgroot42 commented Jun 30, 2021

@xuzhiying9510, thanks for the clarification!

  • Each month's directory has 2 files per minute: dirA and dirB. Do you combine both their pcap files to generate the graph for that particular minute?
  • I was looking at the dataset, and in total, there are more than 1200 graphs in total for 2018, whereas the current dataset has less than 1000. Did you perform some form of thresholding to not consider some graphs (I saw some of the pcap files were only a few MBs, while others were in GBs)?

@iamgroot42
Copy link
Contributor Author

Bumping this up again @xuzhiying9510 @jzhou316 :) I would greatly appreciate it if you could help with these clarifications!

@iamgroot42
Copy link
Contributor Author

@jzhou316 would it be possible to get the underlying .pcap files (the ones before botnets are overlayed on top of them)? I've been trying to recreate the graphs based on instructions and the provided code for generation, but the generated graphs are FAR from the current graphs - they have around 272K nodes and 949K edges, and are much more sparse than the graphs provided as part of the dataset.

@jzhou316
Copy link
Collaborator

jzhou316 commented Jul 5, 2021

Hi @iamgroot42 I see. There might be some processing there. We'll check about it and get back to you.

@iamgroot42
Copy link
Contributor Author

Clarification (fix) added via #14 Thanks a lot for looking into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants