diff --git a/src/semeio/fmudesign/probabilit_idea.md b/src/semeio/fmudesign/probabilit_idea.md new file mode 100644 index 00000000..8370561d --- /dev/null +++ b/src/semeio/fmudesign/probabilit_idea.md @@ -0,0 +1,276 @@ +# Probabilit - A Python library for Monte Carlo Analysis + +**Probabilit** is a Python library for performing Monte Carlo-based risk analysis. It supports **Latin Hypercube Sampling** (LHS) with **Iman–Conover** correlation, providing an open-source alternative to Excel plugins like @Risk, Model Risk, and Crystal Ball. PyMC exists for Bayesian analyses (Markov Chain Monte Carlo) but doesn’t focus on LHS or straightforward correlation specification. **Probabilit** aims to fill that gap, offering an intuitive API for users in Jupyter Notebooks, scripts, or custom GUIs. + +## Background and Motivation + + +### Core Features + +- **Latin Hypercube Sampling** by default (and support for standard random sampling). +- **Correlation** via direct Cholesky or Iman–Conover. +- **Near-Correlation** Matrix Adjustments: provides logic similar to Matlab’s `nearcorr` if a correlation matrix is invalid. +- **Flexible Compositions**: Combine random variables in general expressions (e.g., `total_cost = material_cost + labor_cost * overhead`). +- **Random Variables as Parameters**: You can use one random variable as the parameter for another (e.g., `exp.Normal(mean=another_variable, std=1)`). +- **Comprehensive Distribution Support**: Including: + - **Continuous Distributions**: + - Normal (Gaussian) + - Log-Normal + - Uniform + - Triangular + - Beta + - Beta-PERT (Program Evaluation and Review Technique) + - Gamma + - Weibull + - Exponential + - Student's t + - **Discrete Distributions**: + - Poisson + - Binomial + - Negative Binomial + - Discrete Uniform + - **Data-Driven Distributions**: + - Kernel Density Estimation (KDE) from data +- **Built-in Plotting**: Common plotting methods (histograms, box plots, correlation plots, etc.). +- **Lazy Evaluation**: Sampling is deferred until results are explicitly needed (e.g., calling `.sample()`, `.hist()`, or `.report()`). +- **Graph Visualization**: Visualize the computational DAG. +- **Report Generation**: Export a summary of the experiment, including variable definitions, correlation matrices, plots, and statistics, in HTML, PDF, or Markdown. +- **Flexible Distribution Definition**: + - Define distributions using traditional parameters (mean, std, min, max, etc.) + - Define distributions using percentiles (P10, P50, P90) + - Automatic parameter conversion between different specification methods + - Support for both symmetric and skewed distributions via percentiles + +### Lazy Evaluation Requirements + +1. **Deferred Sampling** + All sampling operations happen only when results are actually needed (e.g., a call to `.sample()`). +2. **On-Demand Computation** + If an expression depends on multiple random variables, samples for those variables will only be generated on first access. +3. **Re-Evaluation on Changes** + If a random variable’s parameters or data change, any expressions or final results that depend on it will naturally be re-evaluated when next accessed. +4. **Eager Evaluation Option** + When the user explicitly requests it, we can generate and store samples at once for all nodes (though no default caching is performed). + +### DAG Structure + +**Nodes**: +- **Source Nodes**: Base random variables (e.g., Normal, Uniform, user-fitted distributions). +- **Transform Nodes**: Mathematical operations or any custom function combining one or more nodes. +- **Correlation Nodes**: Store correlation requirements (a matrix plus method like Cholesky or Iman–Conover) and ensure the final correlated samples comply with the matrix (possibly adjusting via a nearcorr-like approach). + +Each node tracks: +- **Distribution Parameters** (for source nodes) +- **Correlation Requirements** (if relevant) +- **Transform Functions** (for transform nodes) +- **Sampling State** (e.g., sample size, sampling method) +- **Dependency List** (which other nodes it depends on) +- **Lazy Computation Triggers** (logic to generate or re-generate samples on demand) + +**Edges**: +- **Data Flow**: Depicts how random variables and transform nodes feed into other transforms or final expressions. +- **Correlation Requirements**: Connect variables that share a correlation specification. +- **Re-Evaluation Triggers**: Ensures that when a node’s definition changes, dependent nodes know they must re-generate samples on the next request. + +### Evaluation Process + +1. **Build the DAG** with all source variables, transforms, and correlation nodes. +2. **Topological Sort** to determine execution order. +3. **Collect Correlation Requirements** and decide on the correlation application strategy (Cholesky vs. Iman–Conover). +4. **Generate Samples When Accessed**: Only when a user calls `.sample()`, `.hist()`, or `.report()`, the library walks the DAG and produces samples for each required node. +5. **No Caching**: Each time results are requested, sampling or transforms are performed anew (or only stored temporarily in the user’s workspace). +6. **Propagate Changes**: If a parameter changes, or new data is fitted, the DAG will reflect that, ensuring the next call to `.sample()` or any plot method uses the updated settings. + +### Report Generation + +A high-level method such as `exp.report()` can generate a default document with all the standard content: +- **Configuration Overview** (sample size, sampling method, correlation strategy, etc.). +- **Variable List** with definitions (names, distribution types, parameters). +- **Correlation Matrices** used in the experiment. +- **Summary Statistics** (mean, standard deviation, percentiles) for each variable or expression. +- **Visualizations** (histograms, boxplots, correlation plots) for quick inspection. +- **Interpretations or Key Findings** (optional narrative about which variables significantly influence results). +- **Output Formats**: HTML, PDF, Markdown. +- **Customization**: Possibly via YAML or similar config files to tailor the final report. + +--- + +With these updates, **Probabilit** retains *lazy evaluation* (generating samples on demand) but drops any notion of caching the generated samples. Each new request for results triggers fresh sampling (unless the user explicitly decides otherwise). +""" + +import pytest +import polars as pl +from probabilit import Probabilit + +def test_comprehensive_project_analysis(): + """ + Extended test case using all distributions in a project context, + ensuring all variables contribute to final results. + """ + exp = Probabilit(n=10_000, sampling_method='latin_hypercube') + + # === Direct Material and Equipment Costs === + steel_cost = exp.Normal.from_percentiles( + p10=800_000, + p90=1_200_000, + name="Steel_Cost" + ) + + maintenance_cost = exp.LogNormal.from_percentiles( + p10=50_000, + p90=150_000, + name="Maintenance_Cost" + ) + + fuel_cost = exp.Uniform( + min_val=1000, + max_val=1500, + name="Daily_Fuel_Cost" + ) + + equipment_lifetime = exp.Weibull( + shape=2.5, + scale=5000, + name="Equipment_Lifetime" + ) + + # Equipment replacement cost based on lifetime + equipment_replacement_cost = 100_000 * (10000 / equipment_lifetime) + + # === Labor and Productivity === + productivity = exp.Triangular.from_percentiles( + p10=85, + p50=100, + p90=110, + name="Productivity_Factor" + ) + + worker_absences = exp.DiscreteUniform( + min_val=0, + max_val=5, + name="Daily_Worker_Absences" + ) + + # Base labor cost adjusted for productivity and absences + base_labor_cost = 1_000_000 * (100 / productivity) * (1 + worker_absences * 0.01) + + # === Quality and Inspection Costs === + quality_score = exp.Beta( + alpha=5, + beta=2, + name="Quality_Score" + ) + + quality_issues = exp.Poisson( + lambda_param=3.5, + name="Weekly_Quality_Issues" + ) + + successful_inspections = exp.Binomial( + n=10, + p=0.8, + name="Successful_Inspections" + ) + + certification_attempts = exp.NegativeBinomial( + r=3, + p=0.6, + name="Certification_Attempts" + ) + + # Quality-related costs + rework_cost = quality_issues * 5000 * (1 - quality_score) + inspection_cost = (10 - successful_inspections) * 2000 + certification_cost = certification_attempts * 10000 + + # === Schedule and Timing === + schedule_duration = exp.BetaPERT( + min_val=8, + most_likely=10, + max_val=14, + name="Schedule_Duration" + ) + + repair_time = exp.Gamma( + shape=2, + scale=1.5, + name="Repair_Time" + ) + + time_between_failures = exp.Exponential( + rate=0.1, + name="Time_Between_Failures" + ) + + # Schedule impact costs + downtime_cost = (repair_time / time_between_failures) * schedule_duration * 5000 + schedule_delay_cost = schedule_duration * fuel_cost + + # === Market and Price Variations === + price_variation = exp.StudentT( + df=5, + name="Price_Variation" + ) + + # Apply market variations to base costs + market_adjusted_cost = (1 + price_variation * 0.1) + + # === Overhead and Risk Factors === + historical_overhead = [0.12, 0.15, 0.14, 0.13, 0.16, 0.15, 0.14] + overhead_rate = exp.KDE.from_data( + data=historical_overhead, + name="Overhead_Rate" + ) + + # === Final Cost Calculation === + # Direct costs + direct_cost = ( + steel_cost + + maintenance_cost + + equipment_replacement_cost + + base_labor_cost + ) * market_adjusted_cost + + # Indirect costs + indirect_cost = ( + rework_cost + + inspection_cost + + certification_cost + + downtime_cost + + schedule_delay_cost + ) + + # Total project cost + total_cost = (direct_cost + indirect_cost) * (1 + overhead_rate) + + # === Validation and Reporting === + results = total_cost.sample() + + assert len(results) == 10_000 + assert all(cost > 0 for cost in results) + + p10_value = total_cost.percentile(10) + p50_value = total_cost.percentile(50) + p90_value = total_cost.percentile(90) + assert p10_value < p50_value < p90_value + + # Generate outputs + total_cost.hist(title="Total Project Cost Distribution") + exp.tornado_plot(total_cost, title="Sensitivity Analysis") + + report = exp.report( + output_format="html", + title="Comprehensive Project Risk Analysis", + description=""" + Complete risk analysis with all variables contributing to final cost: + - Direct costs (materials, equipment, labor) + - Quality and inspection impacts + - Schedule effects + - Market variations + - Risk adjustments + """ + ) + + assert report is not None + exp.save("comprehensive_risk_model.yml") + \ No newline at end of file