pyhande.lazy¶
Tools for the lazy amongst us: automation of common HANDE analysis tasks.
- pyhande.lazy.find_starting_iteration_mser_min(data, md, start_max_frac=0.9, n_blocks=100, verbose=None, end=None)¶
Find the best iteration to start analysing CCMC/FCIQMC data based on MSER minimization scheme.
Warning
Use with caution, check whether output is sensible and adjust parameters if necessary.
This function gives an optimal estimation of the starting interations based on MSER minimization heuristics. This methods decides the starting iterations \(d\) as minimizing an evalualtion function MSER(\(d\)) = \(\Sigma_{i=1}^{n-d} ( X_{i+d} - X_{mean}(d) ) / (n-d)^2\). Here, \(n\) is length of time-series, \(X_i\) is ‘sum H_0j N_j’ / ‘N_0’ of \(i\)-th step, and \(X_{mean}\) is the average of \(X_i\) after the \(d\)-th step.
- Parameters:
data (
pandas.DataFrame
) – Calculation output for a FCIQMC or CCMC calculation.md (dict) – Metadata corresponding to the calculation in data.
n_blocks (int) – This analysis takes long time when \(n\) is large. Thus, we pick up \(d\) for every ‘n_blocks’ samples, calculate MSER(\(d\)), and decide the optimal estimation of the starting iterations only from these d.
start_max_frac (float) – MSER(d) may oscillate when become unreanably small when \(n-d\) is large. Thus, we calculate MSER(\(d\)) for \(d\) < (\(n\) * start_max_frac) and give the optimal estimation of the starting iterations only in this range of \(d\).
verbose (int) – Inactive. This valuable does not change anything.
end (int or None) – Last iteration included in analysis. If None, the last iteration included is the last iteration of the data set.
- Returns:
starting_iteration – Iteration from which to start reblocking analysis for this calculation.
- Return type:
integer
- pyhande.lazy.lazy_hybrid(calc, md, start=0, end=None, batch_size=1)¶
New post-analysis on zero-temperature QMC calcaulations.
Note
std_analysis()
is recommended unless custom processing is required before blocking analysis is performed.This scheme is made by hybridizing two different post-analysis methods, AR model and Straatsma. The former (the latter) is comparatively good at estimating the statistic error for smaller (larger) length of time-series, respectively. This method just picks up the larger statistic error from the ones given by both methods. The mathematical details of both methods are explained in an upcoming paper.
- Parameters:
calc (
pandas.DataFrame
) – Zero-temperature QMC calculation output.md (dict) – Metadata for the calculation in calc.
start – See
std_analysis()
.end – See
std_analysis()
.batch_size (int) – The energy time-series is coarse-grained by averaging several sequential samples into just one sample and the statistic error is calculated for the coarse-grained time-series. This variable designates how many sequential samples are averaged together.
- Returns:
info (
collections.namedtuple()
) – Seestd_analysis()
.[todo] - Catch ValueError from statsmodels when there is too little
[todo] - data.
- pyhande.lazy.std_analysis(datafiles, start=None, end=None, select_function=None, extract_psips=False, reweight_history=0, mean_shift=0.0, calc_inefficiency=False, verbosity=1, starts_reweighting=None, extract_rep_loop_time=False, analysis_method='reblocking', warmup_detection='hande_org')¶
Perform a ‘standard’ analysis of HANDE output files.
- Parameters:
datafiles (list of strings) – names of files containing HANDE QMC calculation output.
start (int or None) – iteration after which/until which the blocking analysis is performed. The end iteration is included in analysis, the start iteration is not. If start is None, then attempt to automatically determine a good iteration using
find_starting_iteration()
. If end is None, the last iteration included is the last iteration of the data set.end (int or None) – iteration after which/until which the blocking analysis is performed. The end iteration is included in analysis, the start iteration is not. If start is None, then attempt to automatically determine a good iteration using
find_starting_iteration()
. If end is None, the last iteration included is the last iteration of the data set.select_function (function) – function which returns a boolean mask for the iterations to include in the analysis. Not used if set to None (default). Overrides
start
. See below for examples.extract_psips (bool) – also extract the mean number of psips from the calculation.
reweight_history (integer) – reweight in an attempt to remove population control bias. According to [Umrigar93] this should be set to be a few correlation times.
mean_shift (float) – prevent the weights from becoming to large.
calc_inefficiency (bool) – determines whether inefficiency should be calculated.
verbosity (int) – values greater than 1 print out blocking information when automatically finding the starting iteration. 0 and 1 print out the starting iteration if automatically found. Negative values print out nothing from the automatic starting point search.
starts_reweighting (list of floats) – used by the reweighting_graph function to pass more than one starting iteration
extract_rep_loop_time (bool) – also extract the mean time taken per report loop from the calculation.
analysis_method (string) – determines which post-analysis method is used to estimate the statistic error. Currently ‘reblocking’ and ‘hybrid’ are prepared.
warmup_detection (string) – determines which method is used to decide the starting iterations to be discarded before calculation the statistic error. Currently ‘hande_org’ and ‘mser_min’ are prepared.
- Returns:
info –
raw and analysed data, consisting of:
- metadata, data
from
pyhande.extract.extract_data_sets()
. Ifdata
consists of several concatenated calculations, then the onlymetadata
object is from the first calculation.- data_len, reblock, covariance
from
pyblock.pd_utils.reblock()
. The projected energy estimator (evaluated bypyhande.analysis.projected_energy()
) is included inreblock
.- opt_block, no_opt_block
from
pyhande.analysis.qmc_summary()
. A ‘pretty-printed’ estimate string is included inopt_block
.
- Return type:
list of
collections.namedtuple()
Examples
The following are equivalent and will extract the data from the file called hande.fciqmc.out, perform a blocking analysis from the 10000th iteration onwards, calculated the projected energy estimator and find the optimal block size from the blocking analysis:
>>> std_analysis(['hande.fciqmc.out'], 10000) >>> std_analysis(['hande.fciqmc.out'], ... select_function=lambda d: d['iterations'] > 10000)
References
- Umrigar93
Umrigar et al., J. Chem. Phys. 99, 2865 (1993).
- pyhande.lazy.check_key(calc, key)¶
Check if this key is present in calc, and if not, append “_1”.
- Parameters:
calc (
pandas.DataFrame
) – Zero-temperature QMC calculation output.key (str) – key name to check in calc.
- Returns:
key_
- Return type:
- str:
modified key name.
- pyhande.lazy.zeroT_qmc(datafiles, reweight_history=0, mean_shift=0.0)¶
Extract zero-temperature QMC (i.e. FCIQMC and CCMC) calculations.
Reweighting information is added to the calculation data if requested.
Note
std_analysis()
is recommended unless custom processing is required before blocking analysis is performed.- Parameters:
datafiles – See
std_analysis()
.reweight_history – See
std_analysis()
.mean_shift – See
std_analysis()
.
- Returns:
calcs (list of
pandas.DataFrame
) – Calculation outputs for just the zero-temperature/ground-state QMC calculations contained in datafiles.metadata (list of dict) – Metadata corresponding to each calculation in calcs.
- pyhande.lazy.lazy_block(calc, md, start=0, end=None, select_function=None, extract_psips=False, calc_inefficiency=False, extract_rep_loop_time=False)¶
Standard blocking analysis on zero-temperature QMC calcaulations.
Note
std_analysis()
is recommended unless custom processing is required before blocking analysis is performed.- Parameters:
calc (
pandas.DataFrame
) – Zero-temperature QMC calculation output.md (dict) – Metadata for the calculation in calc.
start – extract_rep_loop_time: See
std_analysis()
.end – extract_rep_loop_time: See
std_analysis()
.select_function – extract_rep_loop_time: See
std_analysis()
.extract_psips – extract_rep_loop_time: See
std_analysis()
.calc_inefficiency – extract_rep_loop_time: See
std_analysis()
.
- :paramextract_rep_loop_time:
See
std_analysis()
.
- Returns:
info – See
std_analysis()
.- Return type:
- pyhande.lazy.filter_calcs(outputs, calc_types)¶
Select calculations corresponding to a given list of calculation types.
- Parameters:
outputs (list of (dict,
pandas.DataFrame
orpandas.Series
)) – List of (metadata, data) tuples for each calculation, as created inpyhande.extract.extract_data_sets()
.calc_types (iterable of strings) – Calculation types (e.g. ‘FCIQMC’, ‘CCMC’, etc.) to select.
- Returns:
filtered – As in
pyhande.extract.extract_data_sets()
but containing only the desired calculations.- Return type:
list of (dict,
pandas.DataFrame
orpandas.Series
)
- pyhande.lazy.concat_calcs(metadata, data)¶
Concatenate data from restarted calculations to analyse together.
- Parameters:
metadata (list of dicts) – Extracted metadata for each calculation.
data (list of
pandas.DataFrame
) – Output of each QMC calculation.
- Returns:
calcs_metadata (list of dicts) – Metadata for each calculation, with duplicates from restarting dropped.
calcs (list of
pandas.DataFrame
) – Output of each QMC calculation, with parts of a restarted calculation combined.
- pyhande.lazy.find_starting_iteration(data, md, frac_screen_interval=300, number_of_reblockings=30, number_of_reblocks_to_cut_off=1, pos_min_frac=0.8, verbose=0, show_graph=False, end=None)¶
Find the best iteration to start analysing CCMC/FCIQMC data.
Warning
Use with caution, check whether output is sensible and adjust parameters if necessary.
First, consider only data from when the shift begins to vary. We are interested in finding the minimum in the fractional error in the error of the shift weighted by 1/sqrt(number of data points left). The error in the error of the shift and the error in the shift vary as 1/sqrt(number of data points to analyse) with the number of data points to analyse. If we were looking for the minimum in either of these quantities, the minimum would therefore be biased to the lower iterations as then more data points are included in the analysis. However, we have noticed that the error in the shift and its error fluctuate as we have less iterations to analyse which means that our search for the minimum could get trapped easily in a local minimum. We therefore consider their fraction. As they are divided by each other in the fractional error, the 1/sqrt(number of data points to analyse) gets removed. It is therefore artificially included as a weight. To be more conservative, we also find the minimum in the weighted fractional error in the error of # H psips, N_0, sum H_0j N_j. We then consider the minimum out of these four minima which is at the highest number of iterations.
The best estimate of the iteration to start the blocking analysis is found by:
discard data during the constant shift phase.
estimate the weighted fractional error in the error of the shift, # H psips, N_0, sum H_0j N_j, by blocking the remaining data \(n\) times, where the blocking analysis considers the last \(1-i/f\) fraction of the data and where \(i\) is the number of blocking analyses already performed, \(n\) is number_of_reblockings and \(f\) is frac_screen_interval.
find the iteration which gives the minimum estimate of the weighted fractional error in the error of the shift, numerator of projected energy, reference and total population. We then focus on the minimum out of these four minima which is at the highest number of iterations. If this is in the first pos_min_frac fraction of the blocking attempts, go to 4, otherwise repeat 2 and perform an additional number_of_reblockings attempts.
To be conservative, discard the first number_of_reblocks_to_cut_off blocks from the start iteration, where each block corresponds to roughly the autocorrelation time, and return the resultant iteration number as the estimate of the best place to start blocking from.
- Parameters:
data (
pandas.DataFrame
) – Calculation output for a FCIQMC or CCMC calculation.md (dict) – Metadata corresponding to the calculation in data.
frac_screen_interval (int) – Number of intervals the iterations from where the shift started to vary to the end are divided up into. Has to be greater than zero.
number_of_reblockings (int) – Number of reblocking analyses done in steps set by the width of an interval before it is checked whether suitable minimum error in the error has been found. Has to be greater than zero.
number_of_reblocks_to_cut_off (integer) – Number of reblocking analysis blocks to cut off additionally to the data before the best iteration with the lowest error in the error. Has to be non negative. It is highly recommended to not set this to zero.
pos_min_frac (float) – The minimum has to be in the first pos_min_frac part of the tested data to be taken as the true minimum. Has be to greater than a small number (here 0.00001) and can at most be equal to one.
verbose (int) – If greater than 1, prints out which blocking attempt is currently being performed.
show_graph (bool) – Determines whether a window showing the shift vs iteration graph pops up highlighting where the minimum was found and - after also excluding some reblocking blocks - which iteration was found as the best starting iteration to use in reblocking analyses.
end (int or None) – Last iteration included in analysis. If None, the last iteration included is the last iteration of the data set.
- Returns:
starting_iteration – Iteration from which to start reblocking analysis for this calculation.
- Return type:
integer
- pyhande.lazy.reweighting_graph(datafiles, start=None, verbosity=1, mean_shift=0.0)¶
Plot a graph of reweighted projected energy vs. reweighted factor W.
Detecting biases by reweighting is described in [Umrigar93] and [Vigor15] , see pyhande.weight for details. The graph produced by this function is similar to figure 4 in [Vigor15].
A similar function has been published in Neufeld, V., & Thom, A. J. Research data and further information supporting “A study of the dense uniform electron gas with high orders of coupled cluster” [Dataset]. https://doi.org/10.17863/CAM.14336 under Attribution 4.0 International (CC BY 4.0).
- Parameters:
datafiles (list of strings) – names of files containing HANDE QMC calculation output.
start (int or None) – iteration from which the blocking analysis is performed. If None, then attempt to automatically determine a good iteration using
find_starting_iteration()
.verbosity (int) – values greater than 1 print out blocking information when automatically finding the starting iteration. 0 and 1 print out the starting iteration if automatically found. Negative values print out nothing from the automatic starting point search.
mean_shift (float) – prevent the weights from becoming to large.
References
- Umrigar93
C.J. Umirigar et al., J. Chem. Phys. 99, 2865 (1993)
- Vigor15
W.A. Vigor, et al., J. Chem. Phys. 142, 104101 (2015).
Thanks to Will Vigor for original implementation.