-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HSDS server with omas_h5.py function #313
Conversation
Fantastic work Sunjae! The 3.9 test was failing, but it was just a fluke. I re-triggered the test and it passed, no problem. Could you please add the Also, for me to understand better, could you please give an example of how using from omas import *
ods = ODS()
ods['equilibrium.time_slice.0.global_quantities.ip'] = 6
ods. save_omas_h5("???", hsds=True)
ods1 = load_omas_h5("???", hsds=True) How do you pass the HSDS server information? Thanks! |
Hello, first, please check the added parameters. For now the information related to the HSDS server info should be set up before using omas.
The server info only needs to be provided correctly at the beginning, so I didn’t add it to the omas function. Since HSDS currently do not offer speed optimization programs such as multithreading, we plan to use the our own VEST module additionally to provide dynamic and static data separately until the speed optimization is achieved.(And this contains the config function and pretty other functions so I didn't add on omas.) HSDS multimanager Below is an example of the server we configured. (The omas save/load functions the server info is just optional since it's already stored in config file.) from omas import *
ods = ODS()
ods['equilibrium.time_slice.0.global_quantities.ip'] = 6
ods. save_omas_h5("http://127.0.0.0:5101/home/sample.h5", hsds=True) # same with ods. save_omas_h5("/home/sample.h5", hsds=True) and the server info is diff
ods1 = load_omas_h5("/home/sample.h5", hsds=True) # or ods1 = load_omas_h5("http://127.0.0.0:5101/home/sample.h5", hsds=True)) |
I'll let other comments on this PR, but it does look good to me. @smithsp ? @torrinba ? Once OMAS saves the data in HDF5 hierarchical format, using HSDS seems like a great way to serve IMAS data!!! Based on your example this should work: from omas import *
ods = ODS()
ods['equilibrium.time_slice.0.global_quantities.ip'] = 6
omas.save_omas_h5(ods,"http://127.0.0.0:5101/home/sample.h5")
import h5pyd as h5py
h5_file = h5py.File("http://127.0.0.0:5101/home/sample.h5", 'r')
ip_data = h5_file['equilibrium/time_slice/0/global_quantities/ip'].value
print(ip_data) OMAS does not yet support dynamic loading (ie. lazy loading) for h5 files, like it does for NetCDF, IMAS, and machine mappings. It should not be too difficult to add though. See how it's done for NetCDF here: https://github.com/gafusion/omas/blob/master/omas/omas_nc.py#L106-L146 Once that's done, you should be able to dynamically load from a (local or remote) HDF5 only the data that you access. Something like this: with ods.open("http://127.0.0.0:5101/home/sample.h5")
print(ods['equilibrium.time_slice.0.global_quantities.ip']) # after implementing `dynamic_omas_h5` this will only load the data that is accessed, not everything in the h5 file |
Yes, I confirmed that it works well by just changing the IP address in the example to our address. Just to emphasize, the example will work only if the username and password have been set in advance using hsconfigure. Once that’s done, you can use h5pyd exactly the same way as h5py. I checked the link you provided and it seems straightforward to implement. I will request improvements related to loading and saving speeds in h5pyd, and since they are also working on speed improvements like MultiManager, I will request modifications at that time. Thanks |
@satelite2517 can you please comment on the performance of If you find that it is slow, are you sure that's not an OMAS problem? Substituting With this in mind, I would suggest that you first try to benchmark the HSDS performance directly using h5_file = h5pyd.File("http://127.0.0.0:5101/home/sample.h5", 'r')
datasets = []
def visitor_func(name, obj):
if isinstance(obj, h5pyd.Dataset):
datasets.append(name)
f.visititems(visitor_func)
for dataset in datasets:
h5_file[dataset].value
h5_file.close() I also found this post that goes into details about Local HSDS performance vs local HDF5 files |
As you noticed from the link you sent link, using h5pyd to access data takes more than 20 times longer compared to using h5py. I tested uploading the same file using h5py in omas_h5, which took about 20 seconds, whereas using h5pyd took around 50 minutes. Thus, I concluded that it is impossible to include static data in HSDS. Before proceeding with this pull request, I tried to optimize the omas function to increase speed. Here are the attempts I made:
Thank you for informing me about the h5_file = h5pyd.File("/public/dynamic_test.h5", 'r')
datasets = []
def visitor_func(name, obj):
print(name)
if isinstance(obj, h5pyd.Dataset):
datasets.append(name)
h5_file.visititems(visitor_func)
for dataset in datasets:
h5_file[dataset].value
h5_file.close()
datasets I tested this function just in case, and it took an average of 2 minutes and 37 seconds. On the other hand: import omas
filename = '/public/dynamic_test.h5'
ods = omas.ODS()
ods = omas.load_omas_h5(filename) This function took 2 minutes and 40 seconds, so there isn't a significant difference in speed. The options I could think to improve speed through chunking or using Additionally, the biggest issue is the time taken to upload to HSDS. I have been in continuous communication with the HSDS developer, and I will share part of his response: John Readey It seems that the current HSDS system is not well-prepared to handle such kind of dataset. Additionally, due to these issues, HSDS developer asked me a meeting to discuss system upgrades and potential solutions and what will be the optimized way with ods and HSDS. I will update you with any progress after the meeting next Monday. |
@satelite2517 could you please provide me with a copy of the h5 file you are using for your benchmarks? If it's not crazy big it would be great if you simply upload it as part of this issue. |
Sending you the file does not matter but git issue does not support h5 file format. Would you want me to send you in another way? |
For the record, I moved the files that you sent here: You mentioned that using this file (I assume reading it?) took about 70 minutes on your computer, and with HSDS it took 108 minutes. I am surprised it's taking so long. When I run this: from omas import *
from time import *
tic=time(); ods=load_omas_h5("/Users/meneghini/Downloads/39020_16.h5"); toc=time();
print(toc-tic) I get the data back in 24.6 seconds How are you running your tests? |
Did you load the file from your HSDS? Using the file in local took about 20 seconds to me too. The 70 min I told you was to load the file from HSDS server. |
Ok. I now understand. And what was the 108 minutes? |
I sent this same file to the HSDS developer(another HSDS server) and he told me that it took about 108 min. I I may have made a slight mistake in conveying my message due to my limited English proficiency. Sorry |
This is no good :( but perhaps not surprising. Ideally one would want to be able to reduce the number of queries that are made to the server. For example, one could make one single query to request what data is available under a specific location in the data structure. With that meta-data in hand the client could then request all the data it needs in a single request. I bet this would speed up the data fetching enormously. By the way, reading the same file in Julia with IMASDD.jl takes less than 3 seconds. Perhaps you can ask the HSDS developer that you are in contact with if there's a way to do what I described above. 1. Retrieve metadata about the structure of the HDF5 file with one single query, and 2. request multiple nodes in the HDF5 file also with one single query. |
Okay. I understand what you said. I will request to see if improvements are possible. Thanks, and I will update the new issue. |
This PR has not seen any activity in the past 60 days. It is now marked as stale and will be closed in 7 days if no further activity is registered. |
dict2hdf5
function to accept anhsds
parameter to switch betweenh5py
andh5pyd
.save_omas_h5
to pass thehsds
parameter todict2hdf5
.convertDataset
to pass thehsds
parameter recursively.load_omas_h5
to accept anhsds
parameter and useh5pyd
whenhsds
is True.These changes enable the use of HSDS (Highly Scalable Data Service) for handling HDF5 files, �which can directly connect HSDS service and OMAS. (HSDS is the hdf5 based database system https://github.com/HDFGroup/hsds)