Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event builder v2 tools #307

Open
wants to merge 5 commits into
base: Application
Choose a base branch
from

Conversation

fengyvoid
Copy link
Contributor

This PR includes tools for two toolchains: LAPPD_EB and EventBuilderV2, and one tool for BeamClusterAnalysis tool chain.

  1. LAPPD_EB generates a ROOT file containing LAPPD timestamps and CTC timestamps for fitting. A script is then used to fit these timestamps and generate an offset file for LAPPD event building (LAPPDTreeMaker and offsetFit_MultipleLAPPD.cpp).

  2. EventBuilderV2 is based on the data decoders from version 1 of event building, which primarily used ANNIEEventBuilder. It introduces a tool similar to the LAPPD data decoder (LAPPDLoadStore). Version 2 adopts a different strategy for loading raw data, discards the build type implementation (in EBLoadRaw), and separates the event-building processes for each subsystem into distinct tools (the other EB* tools).

  3. For BeamClusterAnalysis, I provide a updated version of the PhaseIITreeMaker: ANNIEEventTreeMaker. This tool will generate tree based on event, rather than CTC-MRD or CTC-cluster pairs. It also includes more information like beam PoT.

More details can be found in the slides from the collaboration meeting and the analysis workshop.
https://annie-docdb.fnal.gov/cgi-bin/sso/ShowDocument?docid=5633

fengyvoid and others added 5 commits October 11, 2024 00:30
Tools for Event Builder version 2, contains EB tools, and two tools for LAPPD_EB tool chain, and one script for LAPPD timestamp fitting
ANNIEEventTreeMaker tool for processed data from EventBuilderV2 tool chain, like the PhaseIITreeMaker
@marc1uk
Copy link
Collaborator

marc1uk commented Nov 26, 2024

Hi Yue. Sorry for the delay on this.

First some top level comments on the PR - as always, please try to address just one thing with each PR. If this addresses 3 things, it would be better as 3 smaller PRs. If bugs spring up later someone can check out a commit before/after each PR merge, check does the code work/not work, and relatively easily pinpoint where the bug was introduced. Minimal PRs = less changes = less code to look for the bug in. Keeping all related changes together also makes it easier to spot when a change that impacts other tools has been introduced but not all impacts accounted for. etc etc.

Secondly, validation is good (especially for any PRs that affect "standard" toolchains), but I feel like it's really critical for event-building. Have you processed the same runs with the old and new toolchains compared the outputs?

If you can run in an 'equivalent' kind of configuration, a simple diff would be ideal, but probably such a simple apples-to-apples comparison isn't possible. In which case (or for other event building "modes") can we get a validation tool that confirms the same number of events in a run, same PMT / TDC hits in those events, the same trigger information (or equivalent as expected), etc etc. We could also check that building in 'PMT data only' and 'PMT+TDC' gives the same PMT data, and so on.
Attaching the corresponding comparisons here, specifying files compared, git commit used for the 'current' toolchain, with histograms of event metrics - numbers of hit PMTs in events, total charge in events, trigger times of different types, numbers of extended readouts, etc. Statistical comparisons to cover multiple files, but also event-level comparisons to check that all detector information is properly recorded (triggers, channels, times, charges), etc would be ideal.

That may be a lot of work, but changes to event building impact all analysis of data out of the detector - I do think we should validate before we 'tag' this as the new offcial (especially for a rewrite of this magnitude, but the same tools should be used for any EB changes going forwards).

@marc1uk
Copy link
Collaborator

marc1uk commented Nov 27, 2024

Related but indepdent: I also think it would be good to benchmark performance.
This should be simple enough, but gives us a heads up on things like memory leaks and steps that introduce significant processing slowdowns in the future. Process a few runs of different types (specifying which files for later comparison), and use the time shell builtin (or binary) to check processing time. Also check memory usage over the processing to make sure it's stable and not gradually increasing (indicating a memory leak). Something simple for the latter might be this script to wrap pmap.

@marc1uk
Copy link
Collaborator

marc1uk commented Nov 27, 2024

Regarding 1. in your PR comment specifically

Tool X generates a file containing X... A script is then used to Y....

is a sentence I dread reading. Is there really no way to linearise this? Running a toolchain, then a script, then another toolchain is really antithetical to the ToolChain process. We've now had this with trigger overlap, with beam information, with LAPPD timing, with muon fitting.... The process to analyse data becomes a long string of scripts, executables, a bunch of intermediate files....
What exactly is the reason this step needs to be run independently? Our event building, by definition, is asynchronous - buffer events from multiple sources, match things up, combine and push them to an output buffer. What prevents that kind of situation being possible for LAPPD timing? ML model training, file overlap, and IFDB information not being accessible from grid worker nodes are all things we can't really work around, but we should really check that every situation we're doing this it's unavoidable.

n.b. is this the LoadOffsetsAndCorrections function in the LAPPDLoadStore tool? I see you're opening a TFile called offsetFitResult.root. Is there any check that the file being read does in fact relate to the data being processed (e.g. checking the run/subrun numbers match)? I think it would be good to ensure the file used is applicable, and not just a stale one (in case someone forgot to do that multi-step process - a good reason to avoid this requirement).

@marc1uk
Copy link
Collaborator

marc1uk commented Nov 27, 2024

OK, some specific comments on the actual code:

  • In ANNIEEventTreeMaker:
m_variables.Get("isData", isData);
m_variables.Get("HasGenie", hasGenie);

can we get these from input file? Surely we should not need to ask the user to tell us if it's data or not.
LoadWCSim, for example, sets m_data->CStore.Set("WCSimVersion", WCSimVersion); so you could check if the file being read is from WCSim by m_data->CStore.Has("WCSimVersion");. I would guess there's probably some appropriate equivalent in LoadRATPAC and LoadGenie, or something appropriate in data. Less config options = less for users to configure wrong. :)

  • Does the closing } at line 486 need to be moved to 479? shouldn't cuts be applied even if fillCleanEventsOnly is true, for clean events?

  • line 970 - replace cout usage.

  • Why is there a Get into an unsigned int then static cast into int? BoostStore::Get uses the streamer operator>>, should be fine to populate a signed integer directly. This is done a few times throughout the code.

  • Line 988, 993 - can just use fTankMRDCoinc = pmtmrdcoinc for each ot these. The compiler will implicitly cast bool to int.

  • Line 1023 seems redundant (done by ResetVariables call).

  • Line 1136 - here and several other places, i don't know why we do this:

    unsigned long channel_key_data = channel_key;
    if (!isData)
    {
      int wcsimid = channelkey_to_pmtid.at(channel_key);
      channel_key_data = pmtid_to_channelkey[wcsimid];
    }

surely channelkey_to_pmtid and pmtid_to_channelkey are inverses of each other, which would make channel_key_data == channel_key, which is already the case at line 1136, making this if block redundant?

  • ANNIEEventTreeMaker.cpp lines 1114 to 1189, and a bunch of other places where data and MC are being handled essentially identically by two different blocks of code, could be consolidated without the need for the split if/else, casting and weird loop_tank checks etc. For exmple:
	// processing function - agnostic of Hit or MCHit
	auto thefunc = [](auto it){
		
		unsigned long channel_key = it->first;
		
		Detector *this_detector = geom->ChannelToDetector(channel_key);
		Position det_position = this_detector->GetDetectorPosition();
		unsigned long detkey = this_detector->GetDetectorID();
		
		if(ChannelKeyToSPEMap.find(channel_key)==ChannelKeyToSPEMap.end()) return true;
		
		auto&& thehits = it->second;  // will be either vector<Hit>& or vector<MCHit>& as appropriate
		fNHits += thehits.size();
		
		for(auto&& ahit : thehits){
			
			// ahit will be either Hit or MCHit as appropriate,
			// but we only call base class functions so we don't care which it is.
			// Could use `dynamic_cast<MCHit*> hitp = &ahit;` to check which if desired
			
			double hit_charge = ahit.GetCharge();
			double hit_PE = hit_charge / ChannelKeyToSPEMap.at(channel_key);
			// etc etc
			
		};
		
	};
	
	// call processing function either with Hit or MCHit vector as appropriate
	if(isData){
		std::for_each(Hits->begin(), Hits->end(), thefunc);
	} else {
		std::for_each(MCHits->begin(), MCHits->end(), thefunc);
	}
  • probably similar thing could be applied at 1465.

  • Similarly looks like a lot of overlap between LoadTankClusterHits and LoadTankClusterHitsMC - probably same remark here. Seems like the only difference between these functions is again the odd treatment of channel ID mapping:

	unsigned long detkey = cluster_detkeys.at(i);
	int channel_key = (int)detkey;                 << redundant type cast
	int tubeid = cluster_hits.at(i).GetTubeId();
	unsigned long utubeid = (unsigned long)tubeid; << redundant type cast
	int wcsimid = channelkey_to_pmtid.at(utubeid);             << channel key -> pmt id
	unsigned long detkey_data = pmtid_to_channelkey[wcsimid];  <<                pmt id -> channel key
	int channel_key_data = (int)detkey_data;       << redundant type cast
	std::map<int, double>::iterator it = ChannelKeyToSPEMap.find(channel_key_data);

if there's some reason this cyclic mapping is needed, perhaps we should wrap it in a single function in the datamodel, rather than needing to repeat this multi-step operation.

  • Line 1731 - if this is a fatal error (genuinely only if a user has not included a necessary Tool) would be good to set m_data->vars.Set("stop_loop",1) to terminate the toolchain. Printouts can often be missed.

  • same repetition of code at 1771 for TDCData / TDCData_MC.

  • Feels like the checks and corrections at 1974 should not be in this Tool - should be part of the reconstruction...

  • EBLoadRaw line 295 - maybe relax this check to if line[0]=='#' otherwise trailing comments would result in the file being omitted.

  • can we translate EBPMT.cpp line 386 to english please?

  • EBSaver has a lot of cout usage. Can we switch these for Log calls?

  • The block of lines 163-214 seems to be printing out all removed events, is this needed? Maybe we can put this all in a debug verbosity check.

  • line 215,242,267, - english please.

  • EBSaver finalise - seems like this is printing out some information about unbuilt event information, with MRD information specifically written to file. It seems like this could be useful information to track over time. Perhaps all that information (maybe also with the summary information on what was built too) could be put into a text summary for that event-builder output, and that could be stored as meta-information with the output file. It seems like this would be a nice way to track information about file contents and runs. We could read the contents of those meta-info files to track those metrics over time, or possibly later put them into SAM dataset meta info.

  • LAPPDLoadStore around line 674/705 looks similar code to that on the monitoring toolchain which crashes and has to be worked around. There are some try/catches here too - does this throw? Perhaps these were added at my request to investigate - did you compare what this is doing with the online monitoring code and see if there are differences that may explain why the monitoring code doesn't like it?

  • I see a new without delete - in fact is there any reason PedestalValues needs to be on the heap at all? As far as i can tell it could just be a member vector not a pointer.

@fengyvoid
Copy link
Contributor Author

Regarding 1. in your PR comment specifically

Tool X generates a file containing X... A script is then used to Y....

is a sentence I dread reading. Is there really no way to linearise this? Running a toolchain, then a script, then another toolchain is really antithetical to the ToolChain process. We've now had this with trigger overlap, with beam information, with LAPPD timing, with muon fitting.... The process to analyse data becomes a long string of scripts, executables, a bunch of intermediate files.... What exactly is the reason this step needs to be run independently? Our event building, by definition, is asynchronous - buffer events from multiple sources, match things up, combine and push them to an output buffer. What prevents that kind of situation being possible for LAPPD timing? ML model training, file overlap, and IFDB information not being accessible from grid worker nodes are all things we can't really work around, but we should really check that every situation we're doing this it's unavoidable.

n.b. is this the LoadOffsetsAndCorrections function in the LAPPDLoadStore tool? I see you're opening a TFile called offsetFitResult.root. Is there any check that the file being read does in fact relate to the data being processed (e.g. checking the run/subrun numbers match)? I think it would be good to ensure the file used is applicable, and not just a stale one (in case someone forgot to do that multi-step process - a good reason to avoid this requirement).

Thanks for the reply!
For this process, the reason is that the LAPPD timing offset need a set of PPS events to be fitted (usually from one partfile). If I include this in the tool chain running, the pairing of LAPPD events will be delayed to the end of the processing of a part file, and introduce other mess.
I do have checks to ensure the file used is correct, the run number, part file number, events numbers are saved in the fit result independently and loaded then indexed, so there are checked when loading and using the offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants