Difference between revisions of "SIPs and SLURPs"

From Testiwiki
Jump to: navigation, search
(own thoughts)
(Procedure: using res table for textual distributions)
Line 35: Line 35:
  
 
The method is based on an assumption of latin hypercube sampling. This means that the numbers drawn from a distribution are not random but the whole distribution is divided into n bins which are equally apart from each other and have different probabilities. In effect, the distribution is treated as a frequency distribution with x<sub>1</sub> observations from bin 1, x<sub>2</sub> from bin 2, ... and x<sub>n</sub> from bin n. These values are clearly deterministic given the distribution, but they will be shuffled randomly. When the minimum, the maximun, and the number of bins are known, the values can be deduced. The the packed part of the SIP only contains the order of values that come from different bins.
 
The method is based on an assumption of latin hypercube sampling. This means that the numbers drawn from a distribution are not random but the whole distribution is divided into n bins which are equally apart from each other and have different probabilities. In effect, the distribution is treated as a frequency distribution with x<sub>1</sub> observations from bin 1, x<sub>2</sub> from bin 2, ... and x<sub>n</sub> from bin n. These values are clearly deterministic given the distribution, but they will be shuffled randomly. When the minimum, the maximun, and the number of bins are known, the values can be deduced. The the packed part of the SIP only contains the order of values that come from different bins.
 +
 +
Probability distributions are located in the [[Opasnet base structure|cell]] table. Currently, there is a ''sip'' field, but this maybe should be extended to have a separate field for all parameters (prec, min, max, n, sample=sip).
 +
 +
If the probability distribution is a classified distribution with text values, then each possible value (i.e., the [[result range]]) should be stored in the res table. Then, the sample only contains the obs value for the particular text result, and that value is used to pick the right result from the res table.
  
 
==See also==
 
==See also==

Revision as of 05:09, 1 November 2010

Stochastic information packet (SIP) is a format for describing random samples from probability distributions. A SIP is essentially a Monte Carlo sample of possible values, using a standard sample size, with a distribution representative of the possible outcomes. Importantly, the SIP is treated as the representation of the value and uncertainty of the quantity. To capture relationships between quantities, multiple SIPs are bundled into a SLURP (Stochastic Library Unit with Relationship Preserved). SIPs and SLURPs may be exchanged between people within the organization, and used directly in decision models. By managing a standardized set of SIPs and SLURPs within an organization, probabilistic estimates from different groups within an organization can be combined within models in a coherent fashion.

The sample values within SIPs and SLURPs appear in a random order, as would be the case in a Monte Carlo sample, but the specific ordering of the samples is critical: It captures the relationships between quantities. Suppose one SIP represents the remaining cost to complete a construction project and another SIP is the remaining time to completion. In scenarios, or samples, with an exceptionally low cost, the remaining time will also usually be small. Likewise, cost overruns usually coincide with delays. These two SIPs are coherent when the ordering of samples captures this relationship, meaning that the nth point of remaining_cost should correspond to the same scenario as the nth point of remaining_time. Coherence in this fashion captures correlation between the quantities as well as other more subtle dependencies that may not be apparent in of correlations. Remaining cost and time are SIPs that should be bundled within the same SLURP. [1]

Scope

SIPs and SLURPs are based on a commercial DIST 1.1 Standard by ProbiliTech. However, the same idea of packing random samples while retaining the original order of samples can be implemented using other means. What is a good way of packing random samples in such a way that is not bound by commercial standards?

Definition

Input

The method should take in a random sample of values (or text) and pack it effectively with a minimal loss of information. The user should be able to adjust the critical parameters, for example

  • The rounding precision prec (2 = two decimal digits, 0 = integer, -1 = rounded to tens)
  • The smallest value sampled min
  • The largest value sampled max
  • Number n of bins used. The default is 256 (28), which is used if n is omitted. However, prec, min, max may constrain the number of possible values, and if that is smaller than n, the smaller number will be used.

Output

A text string with all necessary information to unpack the sample should be the output.

  • An identifier for a sip: SAMPLE
  • The parameter values for the four parameters.
  • If the distribution is a probability table with text values, a list of all possible values are given with sequence numbers.
  • The packed sequence of random draws. If n is 256, 8 bits will be used for each draw. These are changed into characters having the ASCII values 33-288, which are unambiguously understood by most character encoding systems. There might be some exceptions, like asc(256) and "|" should not be used. For effective packing, n should be exactly or slightly smaller than some potency of 2, as n = 257 and n = 512 both take 9 bits.

For example, the output can look like this:

SAMPLE|prec=2|min=260.47|max=294.37|n=16||eKu8W)=εñ"-$§▼eT4i.Mî║|

With n= = 16, each draw takes 4 bits ie. two draws per one character. This example has 23 characters and therefore it contains 46 draws. the bar | is used as a separator between parameters.

Procedure

The method is based on an assumption of latin hypercube sampling. This means that the numbers drawn from a distribution are not random but the whole distribution is divided into n bins which are equally apart from each other and have different probabilities. In effect, the distribution is treated as a frequency distribution with x1 observations from bin 1, x2 from bin 2, ... and xn from bin n. These values are clearly deterministic given the distribution, but they will be shuffled randomly. When the minimum, the maximun, and the number of bins are known, the values can be deduced. The the packed part of the SIP only contains the order of values that come from different bins.

Probability distributions are located in the cell table. Currently, there is a sip field, but this maybe should be extended to have a separate field for all parameters (prec, min, max, n, sample=sip).

If the probability distribution is a classified distribution with text values, then each possible value (i.e., the result range) should be stored in the res table. Then, the sample only contains the obs value for the particular text result, and that value is used to pick the right result from the res table.

See also

References