Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we repeat our backing up in R-Instat if nothing has changed in the log file? #9236

Open
rdstern opened this issue Nov 6, 2024 · 8 comments · May be fixed by #9283
Open

Why do we repeat our backing up in R-Instat if nothing has changed in the log file? #9236

rdstern opened this issue Nov 6, 2024 · 8 comments · May be fixed by #9283

Comments

@rdstern
Copy link
Collaborator

rdstern commented Nov 6, 2024

@ChrisMarsh82 and @N-thony and @Patowhiz I am writing the help and have got as far as the log file. Here is a bit of my latest:

image

I often have R-Instat open, while I am doing other things. I notice, from the figure above that it backs up every 10 minutes, even if I have not used it, since the last one? Couldn't it check that, and only backup if something has changed? I can imagine situations (with large datasets) where this could be very annoying?

@Patowhiz
Copy link
Contributor

Patowhiz commented Nov 7, 2024

@rdstern I agree.

I'm surprised by that as well, but what surprises me even more is why the back up operations are written on the log file. I think this was not the case in previous versions. I think the back up operations should not be in the 'users' log file. It should be silent and something that a user never needs to be aware of until when the application crushes or user unintentionally kills the application.

@rdstern
Copy link
Collaborator Author

rdstern commented Nov 7, 2024

@Patowhiz happy with your support of my first point. The times of backing up have been included recently - and I'm afraid at my request! That's by @N-thony and after discussions with Stephen. They form part of our recent "ultimate undo system" for those who can cope - and I hope you'll like it.
That's also why I was so pleased to have Antoine's new "spreadsheet style" undo system in the latest versions. That's in contrast to the R-user type undo, which uses the information in the log file!
Remember we back-up the data every 10 minutes (by default). Now suppose we have a problem and need to go back a bit.
a) Use Tools > Backup to restore the data up to the last backup.
b) Use the log file, from the date/time of the last backup, as much as you need to complete your data rescue.

Neat eh, for those who can cope?

@rdstern rdstern changed the title Why do we repeate our backing up in R-Instat if nothing has changed in the log file? Why do we repeat our backing up in R-Instat if nothing has changed in the log file? Nov 7, 2024
@rdstern
Copy link
Collaborator Author

rdstern commented Nov 27, 2024

@N-thony after my experience in the Bangladesh workshop I would like to take the backing-up a stage further.
a) As above, it should only backup the data book, if something has changed.
b) Could there be a record kept, perhaps in the data frame metadata, of the last time each data frame is changed. Then, after the initial backup, could a data sheet only be backed up differently, if that sheet has changed?

This is particularly important for our climatic data. The main data file is usually there. It is often used to get summaries, etc, without being changed. And soon we will be dealing with larger files from sub-dialy data. Then often the daily data will be generated. It will still be useful to keep the primary data available, but it will rarely be changed and hence will need backing up anew.

@Patowhiz
Copy link
Contributor

@rdstern,

What has been the current backup experience with all the data frames (hourly, daily, summaries, etc.) open? I recall hearing that you were working with sub-daily data containing 7 million records, which I believe will become increasingly common in the next 5–10 years.

I'm also curious about how this operation performed on users' machines. Did they notice any performance degradation during the process?

I’m asking these questions with the understanding that R is single-threaded and might face input/output bottlenecks. The fact that no users have complained about the backup process is encouraging news to me and shows how personal computing has improved in developing countries.

@rdstern
Copy link
Collaborator Author

rdstern commented Nov 27, 2024

@Patowhiz good questions and I'll hope to have some partial answers soon. We didn't get far with these data in the course, except to note the current backing-up process was becoming a bottleneck. Hence this issue, that @N-thony is working on, and I hope will soon come to you to check.

@Patowhiz
Copy link
Contributor

@rdstern, I’m looking forward to your answers and @jkmusyoka's experience as well.

It would also be helpful to check with @N-thony if this bottleneck could be related to the recent undo feature. Memory issues can sometimes slow down an application. My suggestion would be to follow the same steps using a version of the application without the undo feature (or switch it off in the current feature, I hope that's possible) and see if the same performance degradation occurs.

The reason I think this should be tested is that it might be simpler to optimise the undo functionality first, before addressing the optimisation of saving the data book contents.

@rdstern
Copy link
Collaborator Author

rdstern commented Nov 27, 2024

@Patowhiz thank you for the words of caution here.

There are two distinct levels to the tasks proposed here. The first is that we don't backup a new copy of the data book, if nothing has changed. I suggest this is urgent, and probably simple and not related to your point above? So this could be done soon, and merged?

Then the second task is more ambitious and does relate to the backing up of individual sheets, which is what is done in undo. Following your words of caution, let's leave that for now. I am happy that it is not in the december release, and we come back to the backing up/undo area in 2025.

And for reference, the current undo has limits of the data frames it works on and there is an option to turn it off completely. So undo is only possible on data sheets where backing up is not an issue - except that the data book may contain large data frames, that are not possible to undo, but will of course, be backed up.

@Patowhiz
Copy link
Contributor

@rdstern Thank you for the feedback! I've now understood your separate points.

Regarding point 1,

I'm not sure if addressing it will lead to any significant improvement in the performance bottleneck. Over 90% of R-Instat operations involve changes to the data book, and it's rare for a user to go 10 minutes without performing an operation that alters it (in my opinion). I think that's why the developer of the original implementation went for a simpler solution. By "changes to the data book," I mean modifications to the data objects (e.g., data frames) or output objects (e.g., additions or deletions).

If we define data book changes more narrowly as modifications to the data frames alone, there might be a very small performance improvement. However, I could be completely wrong and missing the point here. Looking forward to seeing the implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants