Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New AFQMC test failures #5235

Closed
prckent opened this issue Nov 20, 2024 · 30 comments
Closed

New AFQMC test failures #5235

prckent opened this issue Nov 20, 2024 · 30 comments
Labels

Comments

@prckent
Copy link
Contributor

prckent commented Nov 20, 2024

Describe the bug

A few new AFQMC test failures resulted in the nightlies after PRs from the last couple of days, presumably after #5234 or #5228. Configurations such as GCC14-Complex-Mixed-Release. Looks to be changes in error tolerance handling in some cases e.g. https://cdash.qmcpack.org/tests/8777436 but more serious in others https://cdash.qmcpack.org/tests/8777462

To Reproduce

Nightly complex builds, develop.

Expected behavior

Tests pass

System:

sulfur

@prckent prckent added the bug label Nov 20, 2024
@prckent
Copy link
Contributor Author

prckent commented Nov 20, 2024

@correaa , @ye-luo : Ideas for likely cause?

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

The changes shouldn't have introduced any semantic change. (should compile to the same machine code).

My changes in qmcpack, should also work with previous version of Multi, that is the first thing I would try, to revert back Multi.

What is different in these tests with respect to the CI?

@prckent
Copy link
Contributor Author

prckent commented Nov 20, 2024

What is different in these tests with respect to the CI?

The only differences are supposed to be compiler and library versions, since we do more combinations in the nightlies. However (can't look fully now) I couldn't quick spot an AFQMC enabled build in the current CI tests (!), implying we lost one important configuration in the last round of CI updates. Do you see one?

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

I don't know, I edited .cpp files only. How is a configuration "lost"?

(BTW, the changes I made should compile to the same code)

@prckent
Copy link
Contributor Author

prckent commented Nov 20, 2024

Lost: probably missing for months, at least if I did not miss the AFQMC results.

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

If these tests were lost for months, perhaps it was broken before #5228 .

I am not sure what #5234 was about.
It seems that it removed std:: from std::real and std::imag, in loose analogy to removing std from std::get in the other PR.
But the reasons for remove std:: from std::get in Multi-related code was not just simplification, the reason is that Multi cannot customize std::get for the custom tuples in uses for the sizes.
In other words, it was not just cosmetic; I didn't have a choice.

(Multi had to reimplement tuples, and not simple use std::tuple to make them work in the GPU, i.e. compile with CUDA for the GPU)

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

Manually checking shows tests started to fail since #5228
2f8ef74 bad
3bc3c85 good

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

thank you for trying, can you try reverting the version of Multi by itself?

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

thank you for trying, can you try reverting the version of Multi by itself?

Cannot compile the code by merely reverting boost_multi part.

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

That is strange, a red flag too.

Ok, feel free to revert the whole of #5228 (including reverting multi) or until it compiles.
(I don't know how to do that and I don't want to mess the repository).

I can try again later with the necessary changes, the changes I have to make in qmcpack should be mechanical.

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

change the code

diff --git a/src/AFQMC/Utilities/kp_utilities.hpp b/src/AFQMC/Utilities/kp_utilities.hpp
index 7d73b35149..13237d9614 100644
--- a/src/AFQMC/Utilities/kp_utilities.hpp
+++ b/src/AFQMC/Utilities/kp_utilities.hpp
@@ -70,7 +73,16 @@ bool get_nocc_per_kp(Vector const& nmo_per_kp, CSR const& PsiT, Array&& nocc_per
       }
     }
     ++nocc_per_kp[Q];
   }
+
+  {
+    for(int i=0; i < nocc_per_kp.size(); i++)
+      std::cout << nocc_per_kp[i] << std::endl;
+    int nocca_tot = std::accumulate(nocc_per_kp.begin(), nocc_per_kp.begin() + nkpts, 0);
+    std::cout << "nocc_per_kp.size " << nocc_per_kp.size() << " get_nocc_per_kp nocca_tot " << nocca_tot << std::endl;
+  }
+
   return true;
 }

In the build directory

$ cd ~/opt/qmcpack/build_gnu_cplx_MP/src/AFQMC/Propagators/tests
$ make -j32
$ ctest --output-on-failure -R deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf
...
1
1
1
1
1
1
1
1
nocc_per_kp.size 8 get_nocc_per_kp nocca_tot 0

The expected printout of nocca_tot is 8.
So. std::accumulate stops working as expected. It seem caused by boost_multi internal changes.

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

shouldn't the loop go from 0 to nkpts (instead of nocc_per_kp.size()) for this to be actually testing accumulate ?

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

Let me check.

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

nkpts is also 8.

int nocca_tot = std::accumulate(nocc_per_kp.begin(), nocc_per_kp.begin() + nocc_per_kp.size(), 0);

makes no difference.

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

great, thanks. Let me see what happens with accumulate (I am adding a test). Can you check that nocc_per_kp contains integers (and not doubles for example)?

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

yes. integers. I did additional check

    std::cout << "operator check (==) " << (nocc_per_kp.begin() == nocc_per_kp.begin() + nkpts) << ", (!=) " << (nocc_per_kp.begin() != nocc_per_kp.begin() + nkpts) << std::endl;

I got both

operator check (==) 0, (!=) 0

Clearly wrong.

@correaa
Copy link
Contributor

correaa commented Nov 20, 2024

mmm, this is very strange. The only thing I can imagine is that the internal pointer in .begin() is (erroneously) nullptr and the arithmetic doesn't work.

I did a bunch of paranoid tests of iterator arithmetic and it seems to work. If you can catch the exact type of nocc_per_kp.begin perhaps it can give a clue.

	{
		int const count = 8;

		multi::array<int, 1> arr({24}, 1);
		BOOST_TEST( arr.size() == 24 );

		BOOST_TEST(  arr.begin() == arr.begin() );
		BOOST_TEST(!(arr.begin() != arr.begin()));

		BOOST_TEST(  arr.begin() != arr.end() );
		BOOST_TEST(!(arr.begin() == arr.end()));

		BOOST_TEST(  arr.begin() != arr.begin() + count );
		BOOST_TEST(!(arr.begin() == arr.begin() + count));

		BOOST_TEST( std::accumulate(arr.begin(), arr.begin() + count, 0) == count );

		auto const& arr_strided = arr.strided(3);

		BOOST_TEST(  arr_strided.begin() == arr_strided.begin() );
		BOOST_TEST(!(arr_strided.begin() != arr_strided.begin()));

		BOOST_TEST(  arr_strided.begin() != arr_strided.end() );
		BOOST_TEST(!(arr_strided.begin() == arr_strided.end()));

		BOOST_TEST(  arr_strided.begin() != arr_strided.begin() + count );
		BOOST_TEST(!(arr_strided.begin() == arr_strided.begin() + count));

		BOOST_TEST( std::accumulate(arr_strided.begin(), arr_strided.begin() + count, 0) == count );
	}

@ye-luo
Copy link
Contributor

ye-luo commented Nov 20, 2024

nocc_per_kp is of type boost::multi::subarray<int, 1, shm::shm_ptr_with_raw_ptr_dispatch<int>, boost::multi::layout_t<1> >

nocc_per_kp.begin() is of type ‘boost::multi::const_subarray<int, 1, shm::shm_ptr_with_raw_ptr_dispatch<int>, boost::multi::layout_t<1> >::iterator’ {aka ‘boost::multi::array_iterator<int, 1, shm::shm_ptr_with_raw_ptr_dispatch<int>, false, false>’}

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

what is exactly missing in this configuration to reproduce the error?
https://gitlab.com/correaa/boost-multi/-/jobs/8430051596#L2501

@ye-luo
Copy link
Contributor

ye-luo commented Nov 21, 2024

what is exactly missing in this configuration to reproduce the error? https://gitlab.com/correaa/boost-multi/-/jobs/8430051596#L2501

Need -DQMC_COMPLEX=ON

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

do you know why this was not catch
by the qmcpack CI?

@ye-luo
Copy link
Contributor

ye-luo commented Nov 21, 2024

We only have AFQMC in CI CUDA builds. Probably we should it on in a few CPU builds.

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

I do a CI with CPU and GPU of qmcpack on my own, but unfortunately not with QMC_COMPLEX=1

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

Fails in Mac clang too:

OMPI_MCA_btl=^tcp ctest --output-on-failure -R deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf                         ─╯
Test project /Users/correatedesco1/qmcpack/build
    Start 59: deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf
1/1 Test #59: deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf ...***Failed    0.24 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.
Assertion failed: (Gw.num_elements() == nwalk * (nocca_tot + noccb_tot) * npol * nmo_tot), function vbias, file KP3IndexFactorization.hpp, line 1371.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_prop_factory is a Catch v2.13.10 host application.
Run with -? for options

-------------------------------------------------------------------------------
propg_fac_shared
-------------------------------------------------------------------------------
/Users/correatedesco1/qmcpack/src/AFQMC/Propagators/tests/test_propagator_factory.cpp:418
...............................................................................

/Users/correatedesco1/qmcpack/src/AFQMC/Propagators/tests/test_propagator_factory.cpp:418: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases: 1 | 1 failed
assertions: 8 | 7 passed | 1 failed

[caladan:64855] *** Process received signal ***
[caladan:64855] Signal: Abort trap: 6 (6)
[caladan:64855] Signal code:  (0)
[caladan:64855] [ 0] 0   libsystem_platform.dylib            0x000000018c432584 _sigtramp + 56
[caladan:64855] [ 1] 0   libsystem_pthread.dylib             0x000000018c401c20 pthread_kill + 288
[caladan:64855] [ 2] 0   libsystem_c.dylib                   0x000000018c30ea30 abort + 180
[caladan:64855] [ 3] 0   libsystem_c.dylib                   0x000000018c30dd20 err + 0
[caladan:64855] [ 4] 0   test_afqmc_prop_factory             0x0000000104eb7690 _ZN11qmcplusplus5afqmc21KP3IndexFactorization5vbiasIN5boost5multi9array_refINSt3__17complexIdEELl2EPKS8_EERNS5_IS8_Ll2EPS8_EEvvEEvRKT_OT0_ddi + 2836
[caladan:64855] [ 5] 0   test_afqmc_prop_factory             0x0000000104eb6b6c _ZN11qmcplusplus5afqmc21KP3IndexFactorization5vbiasIN5boost5multi9array_refINSt3__17complexIdEELl1EPS8_EERNS4_5arrayIS8_Ll1ENS6_9allocatorIS8_EEEEvvvEEvRKT_OT0_ddi + 276
[caladan:64855] [ 6] 0   test_afqmc_prop_factory             0x0000000104eb6a4c _ZZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRN5boost5multi9array_refINSt3__17complexIdEELl1EPS8_EERNS4_5arrayIS8_Ll1ENS6_9allocatorIS8_EEEEEEEvDpOT_ENKUlOT_E_clIRNS0_21KP3IndexFactorizationEEEDaSL_ + 52
[caladan:64855] [ 7] 0   test_afqmc_prop_factory             0x0000000104eb6a0c _ZNK5boost6detail7variant15result_wrapper1IZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRNS_5multi9array_refINSt3__17complexIdEELl1EPSB_EERNS7_5arrayISB_Ll1ENS9_9allocatorISB_EEEEEEEvDpOT_EUlOT_E_RS5_EclIRNS4_21KP3IndexFactorizationEEEvSO_ + 36
[caladan:64855] [ 8] 0   test_afqmc_prop_factory             0x0000000104eb69dc _ZN5boost6detail7variant14invoke_visitorINS1_15result_wrapper1IZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRNS_5multi9array_refINSt3__17complexIdEELl1EPSC_EERNS8_5arrayISC_Ll1ENSA_9allocatorISC_EEEEEEEvDpOT_EUlOT_E_RS6_EELb0EE14internal_visitIRNS5_21KP3IndexFactorizationEEENS_12disable_if_cIXaaLb0Esr7is_sameISO_SO_EE5valueEvE4typeESP_i + 40
[caladan:64855] [ 9] 0   test_afqmc_prop_factory             0x0000000104eb69a8 _ZN5boost6detail7variant27visitation_impl_invoke_implINS1_14invoke_visitorINS1_15result_wrapper1IZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRNS_5multi9array_refINSt3__17complexIdEELl1EPSD_EERNS9_5arrayISD_Ll1ENSB_9allocatorISD_EEEEEEEvDpOT_EUlOT_E_RS7_EELb0EEEPvNS6_21KP3IndexFactorizationEEENSP_11result_typeEiRSP_T0_PT1_N4mpl_5bool_ILb1EEE + 60
[caladan:64855] [10] 0   test_afqmc_prop_factory             0x0000000104eb44ac _ZN5boost6detail7variant22visitation_impl_invokeINS1_14invoke_visitorINS1_15result_wrapper1IZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRNS_5multi9array_refINSt3__17complexIdEELl1EPSD_EERNS9_5arrayISD_Ll1ENSB_9allocatorISD_EEEEEEEvDpOT_EUlOT_E_RS7_EELb0EEEPvNS6_21KP3IndexFactorizationENS_7variantINS6_5dummy10dummy_HOpsEJNS6_6THCOpsENS6_12SparseTensorISD_SD_EESW_NS6_29KP3IndexFactorization_batchedINSH_INSC_IfEELl2ENSI_IS14_EEEEEENS13_INSH_IS14_Ll2EN3shm39allocator_shm_ptr_with_raw_ptr_dispatchIS14_EEEEEEEE18has_fallback_type_EEENSP_11result_typeEiRSP_T0_PT1_T2_i + 52
[caladan:64855] [11] 0   test_afqmc_prop_factory             0x0000000104eb40dc _ZNR5boost7variantIN11qmcplusplus5afqmc5dummy10dummy_HOpsEJNS2_6THCOpsENS2_12SparseTensorINSt3__17complexIdEES9_EENS2_21KP3IndexFactorizationENS2_29KP3IndexFactorization_batchedINS_5multi5arrayINS8_IfEELl2ENS7_9allocatorISF_EEEEEENSC_INSE_ISF_Ll2EN3shm39allocator_shm_ptr_with_raw_ptr_dispatchISF_EEEEEEEE13apply_visitorINS_6detail7variant15result_wrapper1IZNS2_21HamiltonianOperations5vbiasIJRNSD_9array_refIS9_Ll1EPS9_EERNSE_IS9_Ll1ENSG_IS9_EEEEEEEvDpOT_EUlOT_E_RSU_EEEENS16_11result_typeERS16_ + 332
[caladan:64855] [12] 0   test_afqmc_prop_factory             0x0000000104eb3f50 _ZN5boost13apply_visitorIZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRNS_5multi9array_refINSt3__17complexIdEELl1EPS9_EERNS5_5arrayIS9_Ll1ENS7_9allocatorIS9_EEEEEEEvDpOT_EUlOT_E_RS3_EEDcSM_OT0_NS_10disable_ifINS_6detail7variant15has_result_typeISL_EEbE4typeE + 56
[caladan:64855] [13] 0   test_afqmc_prop_factory             0x0000000104eae938 _ZN11qmcplusplus5afqmc21HamiltonianOperations5vbiasIJRN5boost5multi9array_refINSt3__17complexIdEELl1EPS8_EERNS4_5arrayIS8_Ll1ENS6_9allocatorIS8_EEEEEEEvDpOT_ + 60
[caladan:64855] [14] 0   test_afqmc_prop_factory             0x0000000104eacc4c _ZN11qmcplusplus5afqmc5NOMSDIN2ma6sparse10csr_matrixINSt3__17complexIdEEiiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchIS7_EENS3_7is_rootENS9_IiEESC_EEE3vMFIRN5boost5multi5arrayIS7_Ll1ENS5_9allocatorIS7_EEEEEEvOT_ + 1136
[caladan:64855] [15] 0   test_afqmc_prop_factory             0x0000000104eac7d0 _ZZN11qmcplusplus5afqmc12Wavefunction3vMFIJRN5boost5multi5arrayINSt3__17complexIdEELl1ENS6_9allocatorIS8_EEEEEEEvDpOT_ENKUlOT_E_clIRNS0_5NOMSDIN2ma6sparse10csr_matrixIS8_iiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchIS8_EENSM_7is_rootENSP_IiEESS_EEEEEEDaSH_ + 36
[caladan:64855] [16] 0   test_afqmc_prop_factory             0x0000000104eac7a0 _ZNK5boost6detail7variant15result_wrapper1IZN11qmcplusplus5afqmc12Wavefunction3vMFIJRNS_5multi5arrayINSt3__17complexIdEELl1ENS9_9allocatorISB_EEEEEEEvDpOT_EUlOT_E_RS5_EclIRNS4_5NOMSDIN2ma6sparse10csr_matrixISB_iiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchISB_EENSR_7is_rootENSU_IiEESX_EEEEEEvSK_ + 36
[caladan:64855] [17] 0   test_afqmc_prop_factory             0x0000000104eac72c _ZN5boost6detail7variant14invoke_visitorINS1_15result_wrapper1IZN11qmcplusplus5afqmc12Wavefunction3vMFIJRNS_5multi5arrayINSt3__17complexIdEELl1ENSA_9allocatorISC_EEEEEEEvDpOT_EUlOT_E_RS6_EELb0EE14internal_visitIRNS5_5NOMSDIN2ma6sparse10csr_matrixISC_iiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchISC_EENST_7is_rootENSW_IiEESZ_EEEEEENS_12disable_if_cIXaaLb0Esr7is_sameISK_SK_EE5valueEvE4typeESL_i + 40
[caladan:64855] [18] 0   test_afqmc_prop_factory             0x0000000104eac6d0 _ZN5boost6detail7variant27visitation_impl_invoke_implINS1_14invoke_visitorINS1_15result_wrapper1IZN11qmcplusplus5afqmc12Wavefunction3vMFIJRNS_5multi5arrayINSt3__17complexIdEELl1ENSB_9allocatorISD_EEEEEEEvDpOT_EUlOT_E_RS7_EELb0EEEPvNS6_5NOMSDIN2ma6sparse10csr_matrixISD_iiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchISD_EENSU_7is_rootENSX_IiEES10_EEEEEENSL_11result_typeEiRSL_T0_PT1_N4mpl_5bool_ILb0EEE + 80
[caladan:64855] [19] 0   test_afqmc_prop_factory             0x0000000104eac3e0 _ZN5boost6detail7variant22visitation_impl_invokeINS1_14invoke_visitorINS1_15result_wrapper1IZN11qmcplusplus5afqmc12Wavefunction3vMFIJRNS_5multi5arrayINSt3__17complexIdEELl1ENSB_9allocatorISD_EEEEEEEvDpOT_EUlOT_E_RS7_EELb0EEEPvNS6_5NOMSDIN2ma6sparse10csr_matrixISD_iiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchISD_EENSU_7is_rootENSX_IiEES10_EEEENS_7variantINS6_5dummy18dummy_wavefunctionEJS12_NSS_INSA_ISD_Ll2ESY_EEEENS6_5PHMSDEEE18has_fallback_type_EEENSL_11result_typeEiRSL_T0_PT1_T2_i + 52
[caladan:64855] [20] 0   test_afqmc_prop_factory             0x0000000104eac058 _ZNR5boost7variantIN11qmcplusplus5afqmc5dummy18dummy_wavefunctionEJNS2_5NOMSDIN2ma6sparse10csr_matrixINSt3__17complexIdEEiiN3shm39allocator_shm_ptr_with_raw_ptr_dispatchISB_EENS7_7is_rootENSD_IiEESG_EEEENS5_INS_5multi5arrayISB_Ll2ESE_EEEENS2_5PHMSDEEE13apply_visitorINS_6detail7variant15result_wrapper1IZNS2_12Wavefunction3vMFIJRNSK_ISB_Ll1ENS9_9allocatorISB_EEEEEEEvDpOT_EUlOT_E_RST_EEEENS12_11result_typeERS12_ + 276
[caladan:64855] [21] 0   test_afqmc_prop_factory             0x0000000104eabf04 _ZN5boost13apply_visitorIZN11qmcplusplus5afqmc12Wavefunction3vMFIJRNS_5multi5arrayINSt3__17complexIdEELl1ENS7_9allocatorIS9_EEEEEEEvDpOT_EUlOT_E_RS3_EEDcSI_OT0_NS_10disable_ifINS_6detail7variant15has_result_typeISH_EEbE4typeE + 56
[caladan:64855] [22] 0   test_afqmc_prop_factory             0x0000000104ea632c _ZN11qmcplusplus5afqmc12Wavefunction3vMFIJRN5boost5multi5arrayINSt3__17complexIdEELl1ENS6_9allocatorIS8_EEEEEEEvDpOT_ + 48
[caladan:64855] [23] 0   test_afqmc_prop_factory             0x0000000104ea5c98 _ZN11qmcplusplus5afqmc17PropagatorFactory20buildAFQMCPropagatorERNS0_10TaskGroup_EP8_xmlNodeRNS0_12WavefunctionERNS_10RandomBaseIdEE + 2096
[caladan:64855] [24] 0   test_afqmc_prop_factory             0x0000000104b571fc _ZN11qmcplusplus5afqmc17PropagatorFactory15buildPropagatorERNS0_10TaskGroup_EP8_xmlNodeRNS0_12WavefunctionERNS_10RandomBaseIdEE + 324
[caladan:64855] [25] 0   test_afqmc_prop_factory             0x0000000104b12250 _ZN11qmcplusplus5afqmc17PropagatorFactory13getPropagatorERNS0_10TaskGroup_ERKNSt3__112basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEERNS0_12WavefunctionERNS_10RandomBaseIdEE + 300
[caladan:64855] [26] 0   test_afqmc_prop_factory             0x0000000104b0ecc4 _ZN11qmcplusplus16propg_fac_sharedERN5boost4mpi312communicatorE + 6000
[caladan:64855] [27] 0   test_afqmc_prop_factory             0x0000000104b163a0 _ZN11qmcplusplusL19C_A_T_C_H_T_E_S_T_0Ev + 156
[caladan:64855] [28] 0   test_afqmc_prop_factory             0x0000000104d0f1d0 _ZNK5Catch21TestInvokerAsFunction6invokeEv + 28
[caladan:64855] [29] 0   test_afqmc_prop_factory             0x0000000104d082e8 _ZNK5Catch8TestCase6invokeEv + 40
[caladan:64855] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 64855 on node caladan exited on
signal 6 (Abort trap: 6).
--------------------------------------------------------------------------


0% tests passed, 1 tests failed out of 1

Label Time Summary:
afqmc              =   0.24 sec*proc (1 test)
deterministic      =   0.24 sec*proc (1 test)
quality_unknown    =   0.24 sec*proc (1 test)
unit               =   0.24 sec*proc (1 test)

Total Test time (real) =   0.29 sec

The following tests FAILED:
         59 - deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf (Failed)
Errors while running CTest

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

I did a bisection and the error was introduced in July.

ef8043d4bf37518731356a14e658489aa951a875 is the first bad commit
commit ef8043d4bf37518731356a14e658489aa951a875
Author: Alfredo Correa <[email protected]>
Date:   Sun Jul 14 00:26:45 2024 -0700

    fix element access with paren

 include/boost/multi/array_ref.hpp | 6 ++++++
 test/element_access.cpp           | 3 +++
 2 files changed, 9 insertions(+)

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

One of the error happens here:

  BufferAllocatorGenerator(MemoryResource const& a, long initial_size = 0, Constructor const& c = {})
      : base_mr(a),
        _size(initial_size),
        _start(static_cast<pointer>(base_mr.allocate(_size, Align))),   // <----- HERE
        mr_({_start, _size}, std::addressof(base_mr)),
        constr_(c)
  {}

This is getting complicated, it is going to take time.
Feel free to roll back the changes in #5228 including the update of Multi

@ye-luo
Copy link
Contributor

ye-luo commented Nov 21, 2024

#5228 has been reverted and new CI coverage added.

@ye-luo ye-luo closed this as completed Nov 21, 2024
@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

sorry for the confusion, is it possible to revert back the Multi update, but not the using std::get updates in QMCPACK?

@ye-luo
Copy link
Contributor

ye-luo commented Nov 21, 2024

sorry for the confusion, is it possible to revert back the Multi update, but not the using std::get updates in QMCPACK?

Didn't work. I tried #5235 (comment)

Once you have a fix in the upstream multi, first revert my revert commit and then update multi in QMCPACK.

@correaa
Copy link
Contributor

correaa commented Nov 21, 2024

can you put a branch in which the std::get replacements do not work, with the old version of Multi? (I don't know how to do that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants