Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python data compression algorithm #172

Merged
merged 14 commits into from
Feb 5, 2021

Conversation

rmaganza
Copy link
Contributor

I implemented Andrea Sciabà's data compression algorithm in a new Python script as we had previously discussed and also added a new section in the README based on the visualization script one.

The algorithm could surely be expanded (i.e. by providing some information on the original series' variance) but it seems like a good starting point to me.

Script has been run through black and flake8.

Riccardo Maganza added 2 commits October 20, 2020 15:20
Add data compression script explanation to README

Add option to delete original uncompressed file
@graeme-a-stewart graeme-a-stewart self-requested a review October 27, 2020 13:31
@amete amete self-requested a review January 27, 2021 10:34
Copy link
Collaborator

@amete amete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rmaganza,

Thanks a lot for this PR and apologies for the belated review. I think we have a few things that need to be sorted out before we move ahead with this.

Let me attach an example output from a production job (memory_monitor_output.txt). In the current implementation, it seems as if we're a bit too aggressive and having a bit of a problem interpreting the plateaus:

Original: Text file size = 50 KB
full

Default: Text file size = 4.4 KB
default

Default + interpolate: Text file size = 15 KB
interpolate

I think we should interpolate by default as it seems to provide a nice balance between the disk-space and the level of detail. A simple tar of the output reduces the size to 23 KB. So, there will still be a sizable gain in that configuration.

Many thanks.

Best,
Serhan

package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved
package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@rmaganza
Copy link
Contributor Author

rmaganza commented Jan 27, 2021

Thank you for you suggestions @amete and don't worry about the timing. I added some comments to your correct remarks.
Let me know what you think about the problem with the interpolation I mentioned.

Regards,
Riccardo.

@amete
Copy link
Collaborator

amete commented Jan 27, 2021

Thanks a lot Riccardo. Please see above for my response and let me know if there are any loose ends.

@rmaganza
Copy link
Contributor Author

Hi Serhan @amete,
the points we talked about should have been fixed.

Please take a last look and let me know if anything is off.

Regards,
Riccardo

@amete amete self-requested a review January 28, 2021 13:16
Copy link
Collaborator

@amete amete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @rmaganza. It looks like we're almost there. I have two minor comments. Then we should be good to go. Please let me know.

README.md Outdated Show resolved Hide resolved
package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved
@rmaganza
Copy link
Contributor Author

@amete Fixed last points in the README as well as the default precision value.

Copy link
Collaborator

@amete amete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few follow-ups/suggestions.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
package/scripts/prmon_compress_output.py Outdated Show resolved Hide resolved
@rmaganza
Copy link
Contributor Author

Thank you for your suggestions @amete.
I added a comment for the skip-interpolation case, which I think it's the only thing to iron out.

Regards,
Riccardo

@rmaganza
Copy link
Contributor Author

rmaganza commented Feb 1, 2021

Hi @amete,
I added the suggestions to the README.

I also added back the utime and stime variables. I appended them together with the CPU, GPU and thread numbers, since they work on a lower scale than the rest of the metrics and perhaps linearly interpolating would not be very suited in that case.

Let me know if that works for you.

@amete
Copy link
Collaborator

amete commented Feb 3, 2021

Hi @rmaganza,

Could you quickly check the most up-to-date script along with the file I posted above? Locally, I see we get no size reduction anymore. We seem to drop only two rows, and replace all the metrics w/ the interpolated values instead. Am I missing something?

Best,
Serhan

@rmaganza
Copy link
Contributor Author

rmaganza commented Feb 3, 2021

Hi @amete,
unfortunately, it seems like adding back utime and stime causes almost every row not to be deleted.
Every data point of the utime and stime series is marked as a changepoint and so it's not deleting any of the rows.

If you run the same code but remove utime and stime from the required metrics the output file size is 17 kb.

Let me know if you have any ideas about this, as I am not sure how we could keep them in the output file while maintaining the same pruning logic.

Regards,
Riccardo

@amete
Copy link
Collaborator

amete commented Feb 3, 2021

Yes, utime and stime monotonically increase, typically at least. I think you can treat them as the other metrics, e.g. memory. This way you should be able to keep certain points where there are large enough changes but drop most of them. The other trick you can try is to take the first order derivative, i.e. df['utime'].diff() and use that to determine if a point should be dropped or not, instead of df['utime']. If you play around w/ these a little bit I'm sure you can find a good compromise but please let me know how it goes.

@rmaganza
Copy link
Contributor Author

rmaganza commented Feb 3, 2021

Thanks,
I was afraid interpolating a time variable would not have had much sense logically, but you're right that moving them together with the memory variables does result in compression.

The test file size after compression is now 18kb.

@amete amete self-requested a review February 5, 2021 12:57
Copy link
Collaborator

@amete amete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @rmaganza. I think this looks good to go in as the first version. I think there are a few things that we can improve but given this is an optional script that's not going to affect the core functionality, let's follow-up on those later.

@amete amete merged commit 1858e7e into HSF:master Feb 5, 2021
@amete amete mentioned this pull request Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants