I've had many questions over the years about PRESTO. This FAQ is an attempt at getting some of them -- and their answers -- down on digital paper. If you have any additions, please let me know!
Scott Ransom [email protected]
I've read the tutorial, but there seem to be a lot of other programs in $PRESTO/bin
. What are some useful ones and what do they do?
Here are a few. Remember that for most PRESTO routines, if you run the command with no additional options, it gives you a brief usage statement.
-
cal2mjd
andmjd2cal
: These compute a UTC MJD from a calendar date and time, or vise-versa. -
psrfits_quick_bandpass.py
: If you have PSRFITs searchmode data, this routine will quickly compute (and plot if requested) the average and standard deviation of the bandpass of your data. You can ignore theDAT_SCL
andDAT_OFFS
if you want, as well (i.e. to check bit occupation). -
rfifind_stats.py
: Do a kind of integrated statistics onrfifind
result files, including creating files that have an ASCII bandpass (".bandpass"), the channels recommended to zap (".zapchans"), and recommended weights (".weights") for each channel. -
weights_to_ignorechan.py
: Convert a ".weights" file fromrfifind_stats.py
into a compact format that can be passed to PRESTO routines using the-ignorechan
flag, or to PSRCHIVE'spaz
routine. -
combine_weights.py
: If you use the above routines to get multiple ".weights" files per observation (i.e. by runningrfifind
on subsets of a long observation), this will combine the files (via logical "or") to make a "combined.weights" file. -
toas2dat
: If you have events (e.g. photon arrival times), this simple routine will bin those into a time series (i.e. a ".dat" file) that can be processed by PRESTO. -
show_pfd
: Do various things to aprepfold
".pfd" file, including zap interference as a function of time (-killparts
) or freq (-killsubs
), partially fixing corrupted$\chi^2$ values due to interference (-fixchi
), and make publication-quality phase-vs-time plots (-justprofs
). -
pygaussfit.py
: A interactive gaussian fitter that reads in a ".pfd.bestprof" file and outputs text that can be stored in a ".gaussians" file, which can be used as a template to get TOAs withget_TOAs.py
. -
pyplotres.py
: An interactive timing residuals viewer that works with original TEMPO. -
sum_profiles.py
: A routine that will let you correctly sum various ".pfd" files, including rotating them so that they are aligned, to make a high signal-to-noise profile. Can also be used to give radiometer equation estimates of flux density for the same profiles. -
dat2sdat
andsdat2dat
: Not used much anymore, but a way to shrink the size of ".dat" files by a factor of 2, by saving them as short integers (and vise-versa). -
gotocand.py
: Would need to be modified for your own computer system(s), but a way to much more easily foldaccelsearch
candidates, even if they are potentially on other nodes (viampiprepsubband
). The-local
option is useful if you are just using a single machine. -
orbellipsefit.py
: Get an initial orbit fit using the ellipse-fitting technique as described in Freire, Kramer, & Lyne 2001. Note that this does not coherently connect observations via the orbital period. -
fit_circular_orbit.py
: Fit a circular orbit to data using ".bestprof" files (or daily ".par" files) as input, where the orbit is coherently connected between observations. This has helped to solve many binary pulsars! -
fitorb.py
: Similar to the above routine, but it can fit eccentric orbits if needed (once again uses ".bestprof" or daily ".par" files as input). -
quickffdots.py
: Quickly plot a small portion of the f/fdot plane in an ".fft" file (including summing up to 4 harmonics). -
quick_prune_cands.py
: A quick sifting script for individual accelsearch result files. It is quite good at throwing out non-pulsars. -
pfd_for_timing.py
: Returns true or false for a ".pfd" file to let you know if it can be used to get TOAs or not (i.e. was it used to search for the best p/pdot or not). -
makedata
: A crude (but effective!) routine that lets you generate a time series with simulated pulsations. This was the very first routine in PRESTO, and my first routine in "C" -- the code is particularly gross. :-)
There are currently two different versions of GPU-accelerated accelsearch
,
although to my knowledge, neither of them do jerk searches.
I believe that they can both be used as drop-in replacements for the regular
accelsearch
, but with everything else coming from the current, standard
PRESTO.
From Jintao Luo: https://github.com/jintaoluo/presto_on_gpu
From Chris Laidler: https://github.com/ChrisLaidler/presto
And slightly further afield, there is a new GPU implementation of the Fourier Domain Acceleration Search here: https://github.com/AstroAccelerateOrg/astro-accelerate
I've read the tutorial, but this is just too complicated. Is there any easier way to run this software and find pulsars?!
Yes! Thanks to Alessandro Ridolfi, there is PULSAR
MINER,
which uses PRESTO under the hood, but with a mostly automated pipeline that you
set up with a simple configure script. It can even use Jintao Luo's
GPU-accelerated accelsearch
by default.
PULSAR MINER has been used to find many of the newly discovered globular cluster MSPs from MeerKAT.
Many of these routines are really slow, and they don't seem to get faster using the -ncpus
option? Why is that?
Most of PRESTO was written before multi-core processing was really much of a
thing. And so it was designed to be mostly serial, with parallelism coming via
independent searches of different DMs at the same time (which is why
mpiprepsubband
was written: to speed-up de-dispersion and to distribute the
time series among many processors/nodes).
I've added some OpenMP pragmas over the years, and they make some small
improvements in, for example, dedispersion loops. But the improvement is not
great. For prepdata
, prepsubband
, and rfifind
, I'd recommend not using
more than 3-4 CPUs currently with -ncpus
. And even then, you won't see
anything close to a 3-4x speedup (probably only a couple tens of percent).
accelsearch
is the exception, and it does fairly well with -ncpus
, although
the performance got much worse after the addition of the jerk-search code.
In summary, the OpenMP stuff is very much a work in progress, and I'd love to work with someone on this if there is any interest. I suspect that significant speed-ups could be had without a ton of work. It is on my ToDo list!
One of the biggest problems with dedispersion is the I/O bottleneck -- you are taking data in one big raw-data file and producing hundreds or even many thousands of time series. Most computer systems don't do well with many parallel output streams, and so it can take a long time to write all of those dedispersed files.
One way to help the output is to split it over many different machines. That
way, each machine gets a fraction of the time series and there is less
bottleneck on each machine. This is what mpiprepsubband
does. A single CPU
reads the raw data, it distributes it over the network to different nodes, and
each node (using many cpus on each node) then dedisperses a fraction of the DMs.
In order to translate a dedispersion plan (as generated by DDplan.py
using the
-s
subband option) into a proper call for mpiprepsubband
, there are several
things to remember:
- You will probably not see much of an advantage unless you dedisperse using multiple different machines. Many cores on the same machine does not solve the I/O problem.
- You need one CPU on a machine that can see the raw data. And then N other CPUs on each of M other machines/nodes to do the dedispersion. That means a total of M * N + 1 CPUs. And the bigger M is (i.e. number of dedispersing nodes), the better your I/O performance will be.
- A single
mpiprepsubband
call is like a full line of theDDplan.py
output: each of the M * N dedispersing CPUs is equivalent to an independentprepsubband
call. - The total number of DMs from an
mpiprepsubband
run must be divisible by M * N (i.e. each CPU generates the same number of output DMs). DDplan.py
knows about these constraints and will use them if you use the-p numprocs
flag.
If your data have multiple frequency channels, you first need to integrate them (with de-dispersion, if necessary) in order create a 1-D time series. A very good way to do that, which will let you use almost all of PRESTO's tools is to convert the data to the relatively simple SIGPROC filterbank format (there are some tools in PRESTO to help with that -- see below).
If your data are events, you need to put them into a binned 1-D time series if
you are going to search them with accelsearch
. There is a PRESTO utility for
that called toas2dat
. Note that prepfold
can fold events directly using the
-events
option, if you specify their format and have a ".inf" file that
describes the observation.
Finally, your 1-D time series need to be saved in binary form as 32-bit floating point numbers. You should give it a ".dat" suffix. You then need an ASCII ".inf" file to go with it. You can use the PRESTO tutorial and de-disperse one of the test datasets to see what a radio ".inf" file looks like and simply edit it for your data.
Note that you probably need accelsearch
. And even then, you need to be very
careful about the frequency limits -flo
and -fhi
if they are outside the
normal pulsar-search regime!
I have some baseband data (perhaps from a Software-defined Radio or SDR setup), can I get that into a format for PRESTO to process?
I'd recommend that you convert your data to the simple (except for the header...ugh) SIGPROC filterbank format after "detecting" and channelizing.
To get data in SIGPROC format, you will need to write a SIGPROC header, and then simply append the binary-format channelized Stokes I (i.e. total intensity) data right after that.
The DSPSR software suite does have the
capability to do some of these things (i.e. channelizing and detecting baseband
and writing it to SIGPROC or PSRFITS format), but it might not know about the
exact format of data you are using. The routine in DSPSR you would want is
called digifil
.
Alternatively, there is some python code in PRESTO that you can hack so that you
can write a filterbank header and then stream the data into the file from
numpy
arrays (for instance). The code is called sigproc.py
in
$PRESTO/python/presto/
. And there is some related code in the same directory
called filterbank.py
.
Do the p and pdot (or f and fdot) values returned by PRESTO refer to beginning of the dataset or the middle of the dataset?
Depends on the search tool:
-
For
accelsearch
, the f, fdot, (and fdotdot, if requested) returned are average values during the observation, so they apply to the midpoint of the dataset. -
For
prepfold
, the reference time is always the beginning of the observation.
Can I get the average barycentric velocity of an observatory during an observation in a nicer way than running (e.g.) prepdata
on the raw data file?
Yes, there is a python convenience function available in the 2nd-level presto
module (i.e. from presto import presto
):
get_baryv(ra, dec, mjd, T, obs="PK"):
Determine the average barycentric velocity towards 'ra', 'dec'
during an observation from 'obs'. The RA and DEC are in the
standard string format (i.e. 'hh:mm:ss.ssss' and 'dd:mm:ss.ssss').
'T' is in sec and 'mjd' is (of course) in MJD. The obs variable
is the standard two character string from TEMPO: PK, GB, AO, GM, JB, ...
Note: there is also a general-purpose barycentering code you can run called
bary
, where you feed it topocentric UTC MJDs in ASCII from STDIN (and give it
a single '.' to tell it you are done).
Some of my plot *.ps
outputfiles are not being written correctly. The filenames are chopped off! What's up with that?
PGPLOT has a hardcoded limit on the length of a filename of something like 90 characters. So if you are including a long absolute path, or specifying a very long output filename, you might run into this and get a truncated filename!
If an absolute path is the issue, I recommend writing the output files to the current directory, and then moving the files once they are written. This isn't something I can really fix in PRESTO.
RFI is handled multiple ways in PRESTO, most of which are optional. There is a pretty good discussion about them in Lazarus et al 2015.
The quick summary is that RFI is handled in at least six different ways:
- Any processing of a raw data file (i.e. SIGPROC filterbank or PSRFITS
search-mode) by default has clipping turned on, which will zap zero-DM
short-duration pulses or data dropouts. It is controlled with the flag
-clip
which sets the threshold time-domain S/N to clip, or turned off completely with-noclip
. I highly recommend that you not turn this off! It is almost always useful and will not harm (the vast majority of) dispersed astrophysical pulses. rfifind
finds and masks short duration and/or narrow band periodicities or time-domain statistical anomalies in the data. There are many options. You can also generate channel zaplists usingrfifind_stats.py
andweights_to_ignorechan.py
, which can used handled by the variousprep*
routines.- Red-noise suppression in the power spectrum of de-dispersed data (i.e.
".fft" file) using the
rednoise
command. - ".birds"-file zapping of known periodic signals during Fourier searches via
zapbirds
,simple_zapbirds.py
, oraccelsearch
itself (if searching ".dat" files directly). This is used to zap broad-band periodic signals such as known pulsars, power line modulation, etc, directly in the power spectrum (i.e. ".fft" file). ACCEL_sift.py
can remove anomalous candidates during search sifting (user configurable).- You can use
-zerodm
processing (see Eatough et al, 2009) after runningrfifind
and using the resulting mask files with theprep*
routines. Be careful with this as it definitely removes power from mildly-dedispersed pulsar signals! (But it can also be extremely useful.)
What is the difference between using -ignorechan
and explicitly including channels that you want to zap in an rfifind
mask using -zapchan
? Is one preferred over the other?
-ignorechan
is a recent-ish addition to PRESTO, and, in general, gives
better performance for most cases if know know for certain what channels
you want to zap for the full observation. You pass the list of bad
channels to rfifind
and all of the other raw-data routines in PRESTO
(i.e. prepdata
, prepsubband
, prepfold
, and mpiprepsubband
), and
those routines then completely ignore those channels. If those channels
have bad data (i.e. they are at band edges or something like that), this
will allow rfifind
to do its job on the channels where there is real
signal and real noise.
The difference between using -ignorechan
and -zapchan
is that the
former completely ignores those channels in all processing (as if they
were pure zeros), while the latter, when used with an rfifind
mask, will
effectively replace the channel values with a smooth-ish running median
(as determined from the rfifind
"stats" file) for that channel over the
observation.
If the data are well behaved, then they should both do about the same thing. However, if the input data are badly behaved, for example if the power levels change a lot during the observation, and especially if the statistics of the channels in question are highly variable, then masking the data can end up leaving low-frequency artifacts in the resulting time series or folds. And those often show up in your searches or folds and are not good.
If you don't know what channels might be bad, and which you might want to
use -ignorechan
on, you could do a quick first pass with rfifind
on a
portion of the data (for instance) to identify the bad channels, and then
make an -ignorechan
list from that. That's exactly what rfifind
+
rfifind_stats.py
+ weights_to_ignorechan.py
does. And if you are
lucky, you can simply use the resulting -ignorechan
and its values and
not even use an rfifind
mask.
The way I use -zapchan
(which is only rarely) is almost always after I
do a search (or after closely looking at my rfifind
mask results) where
I see that some periodic RFI is leaking through into certain channels. So
I change the mask (via rfifind -nocompute
) to explicitly add those
channels (they may have been partly masked but didn't pass the -intfrac
cutoff to have them zapped completely).
I'm seeing strong 60 (or 50!) Hz signals in my accelsearch
results which are obviously from the power mains. Why doesn't rfifind
filter that out?
If the 50Hz signal is strong enough to show up in individual frequency
channels for relatively short periods of time, then rfifind will flag it (and
mask away that channel). However, usually power mains signals are weaker and
show up across the band (i.e. they are too weak to be detected in a single
channel after only integrating for a few seconds). We usually zap those in the
FFT searches using a ".birds" file (with the zapbirds
or simple_zapbirds.py
commands). That zeros out portions of the power spectrum where there are known
RFI signals (and their harmonics!). Run simple_zapbirds.py
with no command
line options for more information.
This is discussed a little bit in the PRESTO tutorial, and in more detail in Lazarus et al 2015.
Yes, at least to get a pretty good removal of the rednoise, try this:
- Prepare and de-disperse your data as per usual to get a ".dat"/".inf" file pair.
- Run
realfft
on the ".dat" file to get a ".fft" file. - Run
rednoise
on the ".fft" file. That will create a "_red.fft" file that will have been de-reddened. - If you simply want to search the data, you can search the "_red.fft" file.
- If you want a de-reddened time series, perhaps for folding with
prepfold
, runrealfft
on the "_red.fft" file, which will inverse-FFT it and create a new "_red.dat" file that will be the de-reddened time series.
This tends to work quite well!
Each small chunk (in time) of data from each frequency channel is FFT'd,
converted to powers, and then normalized before being searched by the routine
search_fft()
which is in minifft.c
. The normalization constant is computed
from the standard deviation of the time series (and the number of points) for
the small chunk of data that is being FFT'd.
rfifind
then takes the result of that search and checks if any of the powers
in the short FFT (in one interval in one channel) are above the equivalent
gaussian significance -freqsig
if you search an FFT of that length (including
trials factors).
Yes. There is a rfifind
module in PRESTO's Python tools. That at least lets you read and play with mask and statistics values from rfifind
. It is not currently possible to write "mask" files, but that could change (and would not be difficult). Here is an example of usage:
In [1]: import presto.rfifind as r
In [2]: r = r.rfifind("GBT_Lband_PSR_rfifind.mask")
In [3]: r.nint
Out[3]: 37
In [4]: r.read_mask()
In [5]: shape(r.mask_zap_chans_per_int)
Out[5]: (37,)
In [6]: r.mask_zap_chans_per_int
Out[6]:
[array([56, 59, 94], dtype=int32),
array([18, 32, 53, 94], dtype=int32),
...
array([44, 50, 94], dtype=int32),
array([94], dtype=int32)]
Note that the first r.rfifind()
call does a read_mask()
automatically. So
that is already available in r.mask_zap_chans_per_int
as soon as you load the
first file.
So PRESTO always processes data so that the lowest frequency channel is channel 0. If the bandwidth is reversed in a filterbank or PSRFITS file, it will flip it in memory automatically.
You can tell if this is happening by running readfile
on the file. If the BW
is lower sideband (i.e. needs flipped) you will see:
Invert the band? = True
If you see that, you likely will need to change your channel ordering if you want to zap channels using other software (i.e. PSRCHIVE).
You can view the band and channel mask by using the command rfifind_stats.py MYFILE_rfifind.inf
Maybe! I highly recommend you browse the code in $PRESTO/python
, and
especially the highly useful psr_utils.py
. I normally load that with: import presto.psr_utils as pu
.
You can also read in rfifind
result files using the presto.rfifind
module
and prepfold
files using the presto.prepfold
module. The ATNF pulsar
database is available in presto.pypsrcat
. And there are tools for reading
filterbank and PSRFITs files into PRESTO as well (i.e. psrfits.py
,
filterbank.py
, and spectra.py
). You can read TEMPO residuals using
residuals.py
.
What are reasonable ranges to use for -zmax
and/or -wmax
if I want to find binary millisecond pulsars?
This was discussed a bit near the end of section 2.2 in Andersen & Ransom
2018. I ran a
bunch of simulations on MSPs around stellar-mass-type companions, and the vast
majority can be detected with -zmax
< 200 or so, and -wmax
< 600, if the
orbital period is longer than about 10 times observation duration.
Note that all acceleration and jerk effects affect higher harmonics of a pulsar
signal more than they do the fundamental. So you can typically drop the number
of harmonics you search (-numharm
) down to 4 or so if you are looking for
really accelerated systems.
If you are looking for massive companions (i.e. big black holes), then the
zmax
and wmax
values can be quite a bit larger (i.e. in the thousands). The
problem there lies in the fact that as you go to very large zmax
values, it is
likely that the acceleration isn't constant during the observation, and so you
need to use some jerk searching to compensate. The same thing happens for the
constant jerk assumption in jerk searches, but PRESTO doesn't have "snap"
searches yet... ;-)
In short, sigma is the probability that a given signal might be due to noise, but expressed in terms of equivalent gaussian sigma, despite what the true probability distribution for noise values is. In other words, it is shorthand for a probability.
The way that sigmas are computed in searches is a tricky thing. And even in an
individual accelsearch
"cands" file, the sigma term means different things
depending where you are looking in the file.
The "summary" candidate list at the top of the file is from the search through the full f/fdot plane, where the Fourier powers are normalized by block running medians.
In the bottom, where the harmonics of each candidate are analyzed, each harmonic is individually maximized in the f/fdot plane, and the average local power level is measured using powers around the harmonic but not including the 5 closest Fourier bins on each side.
If the data are statistically very well behaved and if the harmonic is not super strong, so that the sinc-function sidelobes don't mess up the local power computation, then the normalized powers measured in both ways should be quite similar.
But if there is a lot of RFI or red noise or some other reason why the power spectrum isn't "nice" (as you would get from pure and unchanging gaussian noise in the time domain), then the two different normalizations might be quite different.
The normalization used in the search is faster, while the normalization used for optimization is "more correct" (since it doesn't at all use the frequency in question, but only nearby frequencies).
But that is just the power. The sigmas as calculated from the powers have another important difference:
- The significance at the top of the files in the summary candidate list includes a correction for the number of independent frequencies/fdots searched (for that single run of accelsearch, i.e. not including others DM trials, for instance). That is the so-called "look-elsewhere" effect.
- The significances at the bottom of the candidates file (where you get detailed information about each harmonic of each candidate), assume that you are searching only a single Fourier frequency (i.e. single trial, so there is no trials correction). That means that for a single-harmonic search, even with pure gaussian data, you would see different significances top and bottom.
Note, however, that ACCEL_sift.py
is made for comparing results between
different searches (i.e. over DMs, for instance), and so the sigmas that it
returns are all single-trial, so that they can be compared!
Finally, as to what the sigmas mean, in general, they are the equivalent
gaussian significance (i.e. meaning as if the powers were gaussian distributed,
which they are not, they are
How can one obtain the spectral signal-to-noise (S/N) from a given sigma value for an accelsearch candidate?
I actually recommend that you don't do that. I think that S/N in the Fourier domain (and especially in powers) is not a really useful quantity. You can't directly compare the S/Ns of different numbers of harmonics, for example. But you can compare their equivalent gaussian significances. That is precisely why I use sigma (i.e. probabilities) and not S/N.
However, if you need to do it for some reason, basically, the power spectrum is
normalized so that the mean noise values have a level of one. But these are
powers. S/N should be in amplitude. So you would take the normalized power
level of a harmonic, take the sqrt of it (to convert it to an amplitude), and
subtract 1 from the result (since the mean is 1). You can sum the S/N of each
harmonic together to get a rough idea of the S/N of the signal as a whole. But
once again, beware that you cannot properly compare the S/N of a sum of 16
harmonic powers with the S/N of a sum of 8 harmonics, for instance. They have
very different probability distributions (
If you simply run accelsearch
you will see all of the default options. One of
those is -sigma
, with a default value of 2. That number is the equivalent
gaussian significance of a candidate, using the initial median-block
normalization (or normalization via number of photons if using -photon
), and
including the number of independent trials searched for that time series.
Given that that is already quite low, you probably don't want to go much below 1.5-sigma or so, or you'll start getting a lot of false positives.
I've restricted the frequency range of my search using -flo
/-fhi
or -rlo
/-rhi
, but accelsearch
is still returning candidates outside of that range! Why is that?
Those options definitely work, however, they correspond to the frequencies of
the highest harmonic searched. The subharmonics of those frequencies will
also get searched, and those subharmonics might be below the frequency of
-flo
. If you are searching only with a single harmonic (-numharm 1
), then
the frequency ranges work exactly as you would expect.
The reason accelsearch
does this is that harmonic summing is done based on the
highest harmonic in order to save operations and memory. If we search a region
around frequency N, that corresponds to a single-harmonic search. If we add
to that the frequencies around N/2, then it is a two-harmonic search. To get
a four harmonic search, we only need to then add the frequencies around N/4
and 3N/4. And all of these subharmonics are smaller "chunks" of the f/fdot
plane that can be interpolated to exactly correspond to the full-resolution
search around frequency N.
Note that the frequency ranges also change the number of independent trials
searched, and so they will affect the sigmas returned by accelsearch
.
No. Very simply, you run search_bin
on ".fft" files, and manually examine the
resulting candidate files. If there are good candidates, you can try to get a
phase-coherent model of the signal using the bincand
command (which is quite
time-consuming). But both of these codes are really stale. I don't think I've
run them in 8+ years, so there could easily be bugs. It has been on my to-do
list to update for a long time...
For reference, the details as to what is going on in these codes is described in Ransom, Cordes, & Eikenberry, 2003.
Since prepfold
is based on folding data, which by definition needs to account
for pulse phase correctly, the search ranges are based on a accounting for error
in phase drift over the duration of the observation for p and p-dot and over the
bandwidth of the observation for DM. In general, we allow for 1 or 2 full phase
wraps of error (meaning one or two pulse periods, which we have divided into the
number of bins that are in the pulse profile,
For period and p-dot, it is easier to think in terms of spin frequency
So prepfold
allows usually +/-2 full phase wraps of error, which is
quantized for each pulse period by the number of bins in the profile, -pstep
and -pdstep
and the number of rotations we allow
using -npfact
.
It works very similarly with DM in that we quantize the DM delays as a function
of observing frequency in the -dmstep
and the number of rotations using -ndmfact
. This is one of the main
reasons why if you are folding a slow pulsar (which means you don't have good
resolution in DM) that you want to use a larger number of bins (100 or more) in
your pulse profile in order to better resolve the DM curve. Note that we don't
plot or search negative DMs, so DM=0 is always a hard minimum.
There are -fine
and -coarse
options that give you finer or coarser
resolution in these dimensions, as well as a -slow
option that helps you out
for folding slow pulsars.
prepfold
uses the reduced chi-squared (or
Basically, this is like a normal
Because of the exact way that prepfold
folds binned data, the statistics are
slightly more complicated than that, in reality (see below), but the bottom line
is that the larger the
Note that for the statistics to be reasonably correct (in the face of strong
RFI, for instance), the off-pulse (i.e. away from the periodicity in question)
In a prepfold
plot of a search candidate, there is a significance provided by the chi-square. Does that include the number of trials searched?
No, the prepfold
is single-trial significance. And it is only
valid if the mean of the off-pulse (i.e. away from the periodicity in question)
Just as for sigma in accelsearch
, sigma in prepfold
is the probability
that a given pulsed signal might be due to noise, but expressed in terms of
equivalent gaussian sigma, despite that the true probability distribution for
prepfold
trials is
As described above and below, prepfold
uses reduced chi-squared to determine
pulsation significance (which, if all due to noise, would be distributed as a
For real pulsar signals, that probability can be a tiny number, so I convert it
to the equivalent gaussian significance in sigma (i.e. meaning as if the
significance distribution was gaussian distributed rather than
The folding algorithm in prepfold
is different than in many other pulsar
folding codes. Instead of assuming that each datapoint is a delta function in
time, and therefore corresponding to an infinitely short portion of a pulsar's
rotation (which can be put all in one pulse profile bin), prepfold
assumes
that the data point is finite in duration (i.e. integrated in time), which most
data actually are, especially search-mode data which PRESTO is focused on.
prepfold
knows the instantaneous start and end phases of each input data bin,
and it effectively "drizzles" the content of that data bin uniformly over as
many pulse profile bins as are needed to contain it in phase. If the duration of
the data bins are similar to or longer than the duration of the pulse profile
bins, the "drizzling" causes significant correlations between neighboring pulse
profile bins. And that leads to effectively smoother profiles (i.e. less RMS
from the noise) and fewer effective degrees of freedom if you use prepfold
does.
There is a semi-analytic correction for this effect that has been tested using
large numbers of simulations with fake data. You can see it in the DOF_corr
function/method in src/fold.c
or python/presto/prepfold.py
, respectfully.
That correction can also be used to correct for the noise level in the pulse
profile, for more accurate flux densities estimates via the radiometer equation,
for example (sum_profiles.py
uses the correction; newrms = oldrms /
sqrt(DOF_corr)).
The effect and the correction are described in detail in Appendix E of the recent paper Bachetti et al., 2021.
-
In general, if you are folding candidates from an
accelsearch
run:- Use the
-accelcand #
and-accelfile FILE
options to specify the basic p/pdot/pdotdot values of the fold, directly from the search. - If you are folding raw data, make sure you also specify the
-dm
! - If you find that your candidate is actually a harmonic (or if you want to
check), you can adjust the folding parameters automatically to account for
that using
-pfact
or-ffact
, which are multiplicative factors on the period or spin frequency.
- Use the
-
If you are folding topocentric time series data in order to get quick-and-dirty TOAs for solving a timing solution, for example, make sure that you fold with
-nosearch
and-n
which is a power-of-two so that you will be able to get TOAs usingget_TOAs.py
. -
For folding raw data, in general, I recommend setting
-nsub
to be between 50-250, with good values for most pulsar instruments being 64 or 128 (since most have powers-of-two channels). That is high enough frequency resolution to remove narrow band interference, if needed, but coarse enough frequency resolution so that you are integrating enough signal to see your pulsar in each subband! The default-nsub
tries to combine roughly 8-10 frequency channels into each subband. -
Also for folding raw data, remember that you can use an
rfifind
mask file using-mask
or specify channels to ignore via an-ignorechan
option. Those can really help with bad RFI. You can also use-zerodm
to fold with zero-DM subtraction. -
If you have a particularly faint pulsar candidate, the
-fine
option can help make sure that you latch on to it, rather than any other nearby noise peaks, since it searches a smaller region of p/pd/pdd/DM volume. -
The
-coarse
option doubles the normal size of each dimension of the p/pd/pdd/DM volume, and so is useful for searching for brighter pulsars where you don't have an exact ephemeris and so need to search a bit more. -
If you fold a slow pulsar (for most pulsar searches, that would be probably periods greater than 100 ms, or so), I recommend that you use the
-slow
option. That specifies the-fine
option, which helps prevent you from latching on to nearby RFI, and it also gives you 100 profile bins (i.e.-n 100
). Since most slow pulsars don't show much acceleration, you should also probably use-nopdsearch
, which will also help avoid latching on to interference. If you don't have a lot of dispersion, I also recommend using-nodmsearch
, or you can easily latch onto DM=0 interference. -
If you have a good timing-based parfile (i.e. ephemeris) for your pulsar, you should probably fold with
-timing
. That provides multiple advantages:- It folds with polycos and doesn't do any searching. That allows you to get
TOAs from the
.pfd
file using PRESTO'sget_TOAs.py
or PSRCHIVE'spat
. - It automatically sets
-dm
based on the parfile value. - It makes sure that the number of profile bins is a power of two (needed for PRESTO's implementation of FFTFIT to get TOAs).
- It sets the number of intervals to 60 by default, which is highly factorable
(and allows you to generate a variety of different numbers of TOAs using
get_TOAs.py
). - It uses
-fine
, which makes (IMO) the final plots look nicer. - Note that if you want to use the absolute phase information in a parfile
(which might be able to keep all of your pulse profiles aligned over many
days), you can use the
-absphase
option. You can't currently get good TOAs using data folded that way, though. If you need this functionality, please let me know. - Folding with
-par
only uses polycos to fold, and doesn't do any of the other things. So you probably don't want to use that ever.
- It folds with polycos and doesn't do any searching. That allows you to get
TOAs from the
-
If you are just doing a very rough and non-scientific fold of some data (i.e. a test pulsar), you can use the
-psr
option and specify a pulsar name.prepfold
will grab the best information for the pulsar from the ATNF Pulsar Database and fold with that (allowing searching). You cannot use.pfd
files folded in that way for TOAs. -
For some PSRFITs data, if you fold with
-noscales
and-nooffsets
you effectively get a bandpass-flattened fold. That can often be useful. -
Folding with binary parameters specified is almost never needed. If you have a binary, you should be folding with
accelsearch
-determined parameters from a search, or with-timing
if you have a good ephemeris. -
You should only use
-searchpdd
if you are not also searching in DM! -
You normally do not need to specify
-psrfits
or-filterbank
unless your filename is bizarre. -
If you notice after the fold that interference or some other issue has made the off-candidate
$\chi^2$ values far from 1 (meaning you get bad statistics and pulse significance, and sometimes see drops in the accumulated$\chi^2$ curve), you can attempt a post-facto fix of that usingshow_pfd -fixchi MYFILE.pfd
You can use the show_pfd
routine to re-generate the prepfold
plots and
statistics. That routine has -killparts
and -killsubs
options for zapping
individual, or ranges, of intervals in time, or subbands in frequency.
In the prepfold
plots, the numbering of the intervals and subbands is shown on
the right side of the phase-vs-time and phase-vs-frequency greyscale plots.
As an example, if you want to zap channels 10, 13-17, 20-30, and 94 from you plot, the command would be:
> show_pfd -killsubs 10,13:17,20:30,94 MYFILE.pfd
Note that ":" replaces "-" for the range, and there are no spaces! And also remember that the 1st channel/subband is 0 and not 1!
How reliable are the uncertainties in the best period and period-derivative(s) as determined by prepfold
?
The answer is that they can be pretty good if your data is well-behaved (i.e. gaussian noise, constant background levels, no interference, etc). But that isn't like most radio data, obviously.
In general, those errors come from combining the uncertainties for each fourier
harmonic of the signal (via equations derived by John Middleditch, and available
in Ransom, Eikenberry, &
Middleditch).
But they seem to underestimate the uncertainty for real signals a bit --
especially really bright ones. It helps if your stepsize in the P/Pdot plane is
very fine (use the -fine
option, for instance, and use more bins in the pulse
profile).
De-reddening the time series data also helps as that makes the noise more like gaussian white noise (which the error derivations assumed).
Bottom line: they are likely underestimated by between 20-100% for real data,
although very recently (June 2021) I pushed up a commit related to the
correlation of prepfold
's profile bins that likely fixes most of this issue.
I'm planning to test it with fake data in the future.
Yes! The prepfold.py
module has a bunch of methods to let you do many
interesting things with your ".pfd" files. Here is an example usage:
In [1]: import presto.prepfold as pp
In [2]: a = pp.pfd("GBT_Lband_PSR_4.62ms_Cand.pfd")
In [3]: a.use_for_timing() # can we use this pfd file for TOAs?
Out[3]: False
In [4]: a.dedisperse(16) # dedisperse at a DM of 16 pc/cm^3
In [5]: a.sumprof # return the summed profile of the dedispersed data
Out[5]:
array([5656790.31402834, 5652502.51116502, 5654345.94100014,
5656388.48898718, 5656145.69576171, 5655103.75782315,
5656093.92149403, 5654931.8004717 , 5654154.6155577 ,
5655499.99197552, 5658468.12322909, 5658051.62727781, ...
There are lots of plotting and analysis options. Just take a read through the
python/presto/prepfold.py
file, especially all the docstrings for the pfd
class methods and attributes.
Also, note that PSRCHIVE can load in and work with some ".pfd" files!
In the "Optimizing" portion of a prepfold
fold/search, it says "Warning!: This is 6535218305 trials! This will take forever!
", and it is! How should I do this?
When you fold raw data for a search candidate, by default it searches over
nearby DMs, periods, and p-dots ("pd"). That is a 3-dimensional search space,
where each dimension is a few times the number of bins in the pulse profile.
That is a lot of trials. If you then add searching over period second
derivatives ("pdd", i.e. -searchpdd
) that makes things orders of magnitudes
worse since it is now a much bigger 4-D search.
What you should do if you get a good search candidate is to find the DM where
the search signal was maximized and do a search over p, pd, and (if needed) pdd
on the time series. Once that is optimized, and the candidate looks good in
the time series fold, then fold the raw data using -nopsearch
and
-nodsearch
and only allow it to search over DM. You can specify the best
-pdd
on the command line if it was needed, but don't specify -searchpdd
.
Note that you can read the ASCII .pfd.bestprof
file from the time series fold
to get the optimized best values for -p
, -pd
, and -pdd
instead of reading
them off the prepfold
plot.
With pat
in PSRCHIVE, the left edge of the template is always phase=0, what does get_TOAs.py
use for phase=0?
By default, get_TOAs.py
uses the old "Princeton" method of rotating the
template, so that the phase of the fundamental harmonic of the template is set
to be at zero phase. So if you have a perfectly Gaussian template, then TOAs
will be referenced at its peak, which would be phase zero (since the fundamental
harmonic peaks at phase zero). This method lets you automatically do
"pretty-good" alignment of similar templates even if you don't manually
rotate/align them yourself.
You can get the same "don't rotate my template" behavior of pat
if you use the
--norotate
(or -r
) option to get_TOAs.py
. And the TOAs generated by pat
and get_TOAs.py
, in that case, should be nearly identical.
It is likely that you folded allowing searching, and get_TOAs.py
doesn't have
enough information to properly determine the correct spin parameters and timing.
Try running pfd_for_timing.py
on your ".pfd" file. If it returns True, you
should be able to get good TOAs. If not, you probably need to re-fold using
-nosearch
.
For de-dispersion, where in a frequency channel does PRESTO assume that the frequency for that channel corresponds? Bottom, center, or top of the channel?
PRESTO does the same as SIGPROC and most other pulsar software and assumes that the frequency corresponds to the center of the channel. In addition, PRESTO never shifts the highest frequency channel during de-dispersion, but only brings the lower frequency channels "forward" in time.
Note that this also means that the MJD Epoch mentioned in a .inf
file
corresponds to the UTC time of the start of the first sample of the center of
the highest frequency channel.
This has changed several times in the past. Since PRESTO processes things in
terms of blocks (i.e. number of samples in time), it is useful to have that
number be a big enough chunk of data that you can do useful work with it. But
you also want it to be small enough to give you useful granularity in rfifind
or in the prepfold
subints in time.
For PSRFITS data, the minimum block size is the duration of each PSRFITS SUBINT
row. But for SIGPROC filterbank files, it is hardcoded in PRESTO. If you have
a need (and there are some special reasons why you might), it is possible to
change that size. It is in $PRESTO/src/sigproc_fb.c
in the line:
s->spectra_per_subint = 2400; // use this as the blocksize
You then need to re-run "make".
I'm trying to downsample my data using prepdata
or prepsubband
and it is not letting me use certain downsampling factors. Why is that?
Your downsampling factor must be evenly divisible into the Spectra per subint
for your data, which can be seen using the readfile
command.
That number is the duration of each PSRFITS row, or for SIGPROC filterbank data, is hard-coded (see above question) at 2400 (which is highly factorable):
❯ factor 2400
2400: 2 2 2 2 2 3 5 5
So you can use -downsamp
of 2, 3, 4, 5, 6, 8, 10, 12, 15, 16, 20, 24, 25, ...
etc, which is pretty good.
If you use DDplan.py
to plan your de-dispersion by pointing it at your raw
data file, it will only give you downsample factors that your data supports.
PRESTO's ".dat" files are straight 32-bit floats (i.e. numpy.float32
, or
typecode "f"), and the ".fft" files are straight 2x32-bit floats treated as
complex numbers (i.e. numpy.complex64
, or typecode "F"). This means that you
can easily read them using numpy.fromfile
:
In [1]: import numpy as np
In [2]: dat = np.fromfile("myfile.dat", dtype=np.float32)
In [3]: fft = np.fromfile("myfile.fft", dtype=np.complex64)
For the FFT data, the zeroth frequency bin (i.e. fft[0]
) has the real-valued
DC component in the real portion, and the real-valued Nyquist frequency stored
as the imaginary portion! That is a common trick for packing FFT results so that
the number of complex data points is N/2, where N was the length of the input
time series, rather than N/2+1. The FFT amplitudes are also completely
unnormalized (equivalent to using the "raw" normalization mode with
explorefft
).
Note that you can read in the information in the related ".inf" files using either:
In [4]: from presto import presto
In [5]: inf1 = presto.read_inffile("myfile.inf")
Reading information from "myfile.inf"
In [6]: inf1
Out[6]: <presto.presto.prestoswig.infodata; proxy of <Swig Object of type 'INFODATA *' at 0x7f8061d1ac00>
which uses PRESTO's C-code and structures, as wrapped by swig
, or:
In [7]: from presto import infodata
In [8]: inf2 = infodata.infodata("myfile.inf")
In [9]: inf2
Out[9]: <presto.infodata.infodata at 0x7f80c9d2d400>
which is pure Python (and therefore much easier to modify, if needed).
Please let me know if you have other things that you think should go in this file!
Scott Ransom [email protected]