Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart issue #63

Closed
srozario121 opened this issue Sep 24, 2018 · 6 comments
Closed

Restart issue #63

srozario121 opened this issue Sep 24, 2018 · 6 comments

Comments

@srozario121
Copy link

Hello,
I'm trying to start a simulation from the output of a previous simulation. I have successfully restarted a simulation from two previous checkpoints. However on the third time I obtain a rather odd error.
''
READING fields and particles for restart
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-0 is not 1D but -1D
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-1 is not 1D but -1D
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-1 is not 1D but -1D
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-0 is not 1D but -1D
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-0 is not 1D but -1D
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-0 is not 1D but -1D
ERROR src/Checkpoint/Checkpoint.cpp:588 (restartPatch) Number of species differs between dump (0) and namelist (3)
ERROR src/Tools/H5.h:324 (getVect) Reading vector Position-0 is not 1D but -1D
ERROR src/Tools/H5.h:324 (getVect) Reading vector Momentum-0 is not 1D but -1D
``

I don't understand this error as its trying to read data that smilei has output itself? I've attached all the log and input deck files that I've used for this simulation.

log_1.txt
log_2.txt
log_3.txt
log_4.txt
SimulationFiles.zip

Thanks
Savio Rozario

@srozario121 srozario121 changed the title Restart dump issue Restart issue Sep 24, 2018
@jderouillat
Copy link
Contributor

Dear Savio,
This behavior is surprising.
In my opinion the 1st thing to do is to check the integrity of the checkpoint.
If you didn't can you confirm the result of the following command :

$ h5dump  -a /patch-000000/species   ../ClusterSim_2/checkpoints/dump-00000-0000000000.h5

According to your error, it should return :

... {
ATTRIBUTE "species" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SCALAR
   DATA {
   (0): 0
   }
}
}

If it's the case don't you have an error file from the third simulation ?

Regards.

Julien

@iclaserplasma
Copy link

image

Seems like it might be the checkpoint file which is corrupted. I've attached the logfile from the final run to show the error that it observes.
log5.txt

If ClusterSim2 does need to be rerun can you recommend how I avoid this error?
Thanks
Savio

@iltommi
Copy link
Contributor

iltommi commented Sep 25, 2018

looks like h5dump command is not properly installed. So we still don't know if the file is corrupted and why it was corrupted.

Since the simulation did several checkpoints, a wise thing to try is keep on disk more than one checkpoint. You can achieve this with keep_n_dumps : https://smileipic.github.io/Smilei/namelist.html#keep_n_dumps

set it to 2 an even if the latest checkpoint is corrupted you will still have the previous one.

@srozario121
Copy link
Author

Ah I forgot to load some of the mpi modules before. Here is the results from the h5dump file:
[svr11@cx2-login checkpoints]$ h5dump -a /patch-000000/species dump-00000-0000000000.h5
HDF5 "dump-00000-0000000000.h5" {
ATTRIBUTE "species" {
DATATYPE H5T_STD_U32LE
DATASPACE SCALAR
DATA {
(0): 3
}
}
}

[svr11@cx2-login checkpoints]$ h5stat dump-00000-00000000*.h5
Filename: dump-00000-0000000000.h5
File information
# of unique groups: 48501
# of unique datasets: 336547
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 12126
File space information for file metadata (in bytes):
Superblock: 96
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 9013816/0
Datasets(exclude compact data): 91540784/43684480
Datatypes: 0/0
Groups:
B-tree/List: 51970648
Heap: 9471152
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Free-space managers:
Header: 0
Amount of free space: 0
Small groups (with 0 to 9 links):
# of groups with 0 link(s): 11106
# of groups with 9 link(s): 25269
Total # of small groups: 36375
Group bins:
# of groups with 0 link: 11106
# of groups with 1 - 9 links: 25269
# of groups with 10 - 99 links: 12125
# of groups with 10000 - 99999 links: 1
Total # of groups: 48501
Dataset dimension information:
Max. rank of datasets: 1
Dataset ranks:
# of dataset with rank 1: 336547
1-D Dataset information:
Max. dimension size of 1-D datasets: 57353
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 4: 7
# of datasets with dimension sizes 7: 14
# of datasets with dimension sizes 9: 21
Total # of small datasets: 42
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 42
# of datasets with dimension size 10 - 99: 52716
# of datasets with dimension size 100 - 999: 146748
# of datasets with dimension size 1000 - 9999: 131273
# of datasets with dimension size 10000 - 99999: 5768
Total # of datasets: 336547
Dataset storage information:
Total raw data size: 2459043134
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 336547
Dataset layout counts[CHUNKED]: 0
Dataset layout counts[VIRTUAL]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 336547
GZIP filter: 0
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 3
Dataset datatype #0:
Count (total/named) = (260739/0)
Size (desc./elmt) = (22/8)
Dataset datatype #1:
Count (total/named) = (25269/0)
Size (desc./elmt) = (14/2)
Dataset datatype #2:
Count (total/named) = (50539/0)
Size (desc./elmt) = (14/4)
Total dataset datatype count: 336547
Small # of attributes (objects with 1 to 10 attributes):
# of objects with 1 attributes: 12125
# of objects with 2 attributes: 36375
Total # of objects with small # of attributes: 48500
Attribute bins:
# of objects with 1 - 9 attributes: 48500
# of objects with 10 - 99 attributes: 1
Total # of objects with attributes: 48501
Max. # of attributes to objects: 12
Free-space persist: FALSE
Free-space section threshold: 1 bytes
Small size free-space sections (< 10 bytes):
Total # of small size sections: 0
Free-space section bins:
Total # of sections: 0
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 161996496 bytes
Raw data: 2459043134 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 1747760 bytes
Total space: 2622787390 bytes

I'll try with two dump files. Maybe one will work.

@mccoys
Copy link
Contributor

mccoys commented Oct 4, 2018

One note about this issue. If you terminate your job too early after the time of the checkpoint, then the storage of data into checkpoint files may be interrupted, causing corrupt files. To avoid this, you should let the simulation at least 5 minutes to complete the checkpoint. In some cases, 5 minutes is not sufficient.

@iclaserplasma
Copy link

The issue was in fact that I had run out of space and the checkpoint file couldn't complete its save! I think it is running fine now.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants