Difference between revisions of "SIPs and SLURPs"

From Testiwiki
Jump to: navigation, search
(first draft based on the hopepage)
 
m (Procedure)
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{encyclopedia|moderator=Jouni}}
+
{{method|moderator=Jouni}}
 
'''Stochastic information packet (SIP)''' is a format for describing random samples from probability distributions. A SIP is essentially a Monte Carlo sample of possible values, using a standard sample size, with a distribution representative of the possible outcomes.  Importantly, the SIP is treated as the representation of the value and uncertainty of the quantity.  To capture relationships between quantities, multiple SIPs are bundled into a '''SLURP (Stochastic Library Unit with Relationship Preserved)'''.  SIPs and SLURPs may be exchanged between people within the organization, and used directly in decision models.  By managing a standardized set of SIPs and SLURPs within an organization, probabilistic estimates from different groups within an organization can be combined within models in a coherent fashion.
 
'''Stochastic information packet (SIP)''' is a format for describing random samples from probability distributions. A SIP is essentially a Monte Carlo sample of possible values, using a standard sample size, with a distribution representative of the possible outcomes.  Importantly, the SIP is treated as the representation of the value and uncertainty of the quantity.  To capture relationships between quantities, multiple SIPs are bundled into a '''SLURP (Stochastic Library Unit with Relationship Preserved)'''.  SIPs and SLURPs may be exchanged between people within the organization, and used directly in decision models.  By managing a standardized set of SIPs and SLURPs within an organization, probabilistic estimates from different groups within an organization can be combined within models in a coherent fashion.
  
 
The sample values within SIPs and SLURPs appear in a random order, as would be the case in a Monte Carlo sample, but the specific ordering of the samples is critical:  It captures the relationships between quantities.  Suppose one SIP represents the remaining cost to complete a construction project and another SIP is the remaining time to completion.  In scenarios, or samples, with an exceptionally low cost, the remaining time will also usually be small.  Likewise, cost overruns usually coincide with delays.  These two SIPs are coherent when the ordering of samples captures this relationship, meaning that the nth point of remaining_cost should correspond to the same scenario as the nth point of remaining_time.  Coherence in this fashion captures correlation between the quantities as well as other more subtle dependencies that may not be apparent in of correlations.  Remaining cost and time are SIPs that should be bundled within the same SLURP. <ref>[http://www.lumina.com/ana/SIPsandSLURPs.htm SIPs and SLURPs with Analytica]</ref>
 
The sample values within SIPs and SLURPs appear in a random order, as would be the case in a Monte Carlo sample, but the specific ordering of the samples is critical:  It captures the relationships between quantities.  Suppose one SIP represents the remaining cost to complete a construction project and another SIP is the remaining time to completion.  In scenarios, or samples, with an exceptionally low cost, the remaining time will also usually be small.  Likewise, cost overruns usually coincide with delays.  These two SIPs are coherent when the ordering of samples captures this relationship, meaning that the nth point of remaining_cost should correspond to the same scenario as the nth point of remaining_time.  Coherence in this fashion captures correlation between the quantities as well as other more subtle dependencies that may not be apparent in of correlations.  Remaining cost and time are SIPs that should be bundled within the same SLURP. <ref>[http://www.lumina.com/ana/SIPsandSLURPs.htm SIPs and SLURPs with Analytica]</ref>
 +
 +
==Question==
 +
 +
SIPs and SLURPs are based on a commercial DIST 1.1 Standard by ProbiliTech. However, the same idea of packing random samples while retaining the original order of samples can be implemented using other means. '''What is a good way of packing random samples in such a way that is not bound by commercial standards?'''
 +
 +
==Answer==
 +
 +
===Input===
 +
 +
The method should take in a random sample of values (or text) and pack it effectively with a minimal loss of information. The user should be able to adjust the critical parameters, for example
 +
* v = The version ''v'' of SIP that is used. Current version: 1. Default: 1.
 +
* prec = The rounding precision ''prec'' (2 = two decimal digits, 0 = integer, -1 = rounded to tens, 99 = 16 significant digits at whatever position). Default: 99.
 +
* log = Whether the bins are evenly spaced on logarithmic scale, or alternatively on arithmetic scale. Default: FALSE.
 +
* min = The smallest value sampled ''min''. Default: 0.
 +
* max = The largest value sampled ''max''. Default: 1.
 +
* bins = Number of ''bins'' used. The default is 256 (2<sup>8</sup>). SIP will use log<sub>2</sub>(bin) bits for each draw. Parameters ''prec, min, max'' may constrain the number of possible values to less than ''bins''; the smaller number will be used.
 +
* n = Number ''n'' of samples (or iterations) drawn. Default: 1000.
 +
* infminus = Whether -Inf value is possible. Default: FALSE
 +
* infplus = Whether Inf value is possible. Default: FALSE {{comment|# |Not sure if these are needed for anything.|--[[User:Jouni|Jouni]] 18:38, 20 August 2011 (EEST)}}
 +
* draw = The actual random draw packed as a text string.
 +
* levels = When the draw is actually a list of text values (or a list of numbers with uneven spaces), this parameter contains the different levels of the draw (i.e., list of possible answers). The format is common in [[R]]: c("First possible answer","Second possible answer","Third possible answer"). This is not asked from the user, because it is derived from the ''sample'' if and only if it contains text values; in this case, ''prec, min, max, bins'' are not used.
 +
 +
===Output===
 +
 +
A text string with all necessary information to unpack the sample should be the output.
 +
* An identifier for a sip: SIP()
 +
* The parameter values for the parameters needed.
 +
* If the distribution is a probability table with text values, a list of all possible values are given with sequence numbers.
 +
* The packed sequence of random draws.
 +
 +
For example, the output can look like this:
 +
 +
SIP(v=1,prec=2,min=260.47,max=294.37,bins=16,n=46,draw="eKu8W)=εñ'-$§▼eT4i.Mî║")
 +
 +
With ''bins'' = 16, each draw takes 4 bits ie. two draws per one character. This example has 23 characters and therefore it contains 46 draws. Comma is used as a separator between parameters.
 +
 +
Also other methods for storing random sample information can be used.
 +
; PARAM(mean= , sd=, min=, max=...): for storing statistical parameters of a distribution.
 +
; QUANT(value, value, value...): for storing the quantiles of a distribution. Values must be in the ascending order, and the cumulative probability distribution is cut into evenly spaced quantiles that are determined by the parameter values. The first and last parameters are min and max, respectively.
 +
 +
===Procedure===
 +
 +
{{todo|Katsokaapas tätä. Vastaako tämä teistä sitä, mistä puhuttiin torstaina? Kukahan ehtisi kirjoittaa rcodet loppuun? Olisikohan Pauli kiinnostnut yrittämään? --[[User:Jouni|Jouni]] 18:38, 20 August 2011 (EEST)|Einari Happonen, Juha Villman, Teemu Rintala, Pauli Ordén}}
 +
 +
The method is based on evenly distributed bins. This means that the numbers drawn from a distribution are first rounded to these bins. In effect, the distribution is treated as a frequency distribution with x<sub>1</sub> observations from bin 1, x<sub>2</sub> from bin 2, ... and x<sub>''bins''</sub> from the last bin. When ''min, max, bins'' (or alternatively the ''levels'') are known, the exact value for each bin can be deduced. The packed part of the SIP (i.e., ''draw'') simply contains the number of bin for each sampled value.
 +
 +
Probability distributions are located in the [[Opasnet base structure|cell]] table.
 +
 +
If the probability distribution is a classified distribution with text values, then each possible value is described with the ''levels'' parameter. Then, the ''draw'' simple contains the packed number of the level for each iteration.
 +
 +
If ''n'' is 256, 8 bits will be used for each draw. These are changed into characters having the ASCII values 33-288, which are unambiguously understood by most character encoding systems. There might be some exceptions, like asc(256) and '"' that  should not be used. For effective packing, ''bins'' should be exactly or slightly smaller than some power of 2, as ''n'' = 257 and ''n'' = 512 both take 9 bits.
 +
 +
 +
'''R code for encoding a SIP
 +
 +
{{attack|# |The code does not contain even a full idea yet.|--[[User:Jouni|Jouni]] 18:38, 20 August 2011 (EEST)}}
 +
 +
{{comment|# |Functions I'd recommend to be used: ''as.character(as.raw(x))'', where x is a number from 0  to 255 (the bin); and ''cut(sample, nbins)'' for binning ''sample'' into ''nbins'' evenly spaced bins.|--[[User:Teemu R|Teemu R]] 11:14, 22 August 2011 (EEST)}}
 +
 +
<rcode>
 +
# This function creates, from a random sample, a SIP text string that can be stored into Opasnet Base.
 +
SIP.make <- function(sample, prec=99, log=FALSE, min=0, max=1, bins=256, infminus=FALSE, infplus=FALSE) {
 +
v <- 1
 +
n <- length(sample)
 +
# infminus and infplus are not used for anything at the moment.
 +
# sample must be a vector.
 +
if(sum(as.numeric(is.character(sample)))>0) {
 +
  levels <- levels(as.character(sample))
 +
  sample <- as.factor(sample)
 +
else {
 +
  min <- min(sample)
 +
  max <- max(sample)
 +
  sample # <- round(sample,prec) This should round the sample to the required precision. How is this actually operationalised, as the usage of bins should be coherent. If prec=0 then all bins should be integers. How is this done if e.g. min=0, max=10, bins=25?
 +
}
 +
sip <- paste("SIP(v=",v,",prec=",prec,",min=",min,",max=",max",",bins=",bins,",n=",n,",draw=",draw,")", sep="") # This row piles up all the parameters created.
 +
sip
 +
}
 +
</rcode>
 +
 +
'''R code for decoding a SIP
 +
 +
{{attack|# |The code contains a rough idea about what it should do but it does not work yet.|--[[User:Jouni|Jouni]] 18:38, 20 August 2011 (EEST)}}
 +
 +
<rcode>
 +
# This function translates a SIP text string into a random sample, i.e. a data.frame
 +
# whose first column is run and it contains i:n (n=samplesize) and
 +
# whose second column is result and it contains the random sample.
 +
SIP <- function(version=1, prec=99, log=FALSE, min=0, max=1, bins=256, n=1000, infminus=FALSE, infplus=FALSE, levels, draw) {
 +
if(levels == NULL) {
 +
  if(log==TRUE) {min <- log(min); max <- log(max) # Go to logarithmic scale
 +
  levels <- sequence(min, max, (max-min)/(bins-1)) # Find the values for each bin
 +
  if(log==TRUE) {level <- exp(level) # Go back to arithmetic scale
 +
  }
 +
temp # <- Change draw into a stream of zeros and ones. Split that into bytes of length log_2(bins). Convert those into integers.
 +
sample <- data.frame(run=1:length(temp), result=levels[temp]) # Pick the correct level from levels for each iteration.
 +
sample
 +
}
 +
</rcode>
  
 
==See also==
 
==See also==
Line 11: Line 109:
 
* [http://lumina.com/wiki/index.php/SipEncode SipEncode] [http://lumina.com/wiki/index.php/SipDecode SipDecode] (In Analytica wiki, requires password)
 
* [http://lumina.com/wiki/index.php/SipEncode SipEncode] [http://lumina.com/wiki/index.php/SipDecode SipDecode] (In Analytica wiki, requires password)
 
* [http://www.21stcenturyrisk.com/ 21st Century Risk Modeling]
 
* [http://www.21stcenturyrisk.com/ 21st Century Risk Modeling]
 +
 +
==Keywords==
 +
 +
Random sample, probability, distribution
  
 
==References==
 
==References==
  
 
<references/>
 
<references/>
 +
 +
==Related files==
 +
 +
{{mfiles}}

Latest revision as of 08:14, 22 August 2011

Stochastic information packet (SIP) is a format for describing random samples from probability distributions. A SIP is essentially a Monte Carlo sample of possible values, using a standard sample size, with a distribution representative of the possible outcomes. Importantly, the SIP is treated as the representation of the value and uncertainty of the quantity. To capture relationships between quantities, multiple SIPs are bundled into a SLURP (Stochastic Library Unit with Relationship Preserved). SIPs and SLURPs may be exchanged between people within the organization, and used directly in decision models. By managing a standardized set of SIPs and SLURPs within an organization, probabilistic estimates from different groups within an organization can be combined within models in a coherent fashion.

The sample values within SIPs and SLURPs appear in a random order, as would be the case in a Monte Carlo sample, but the specific ordering of the samples is critical: It captures the relationships between quantities. Suppose one SIP represents the remaining cost to complete a construction project and another SIP is the remaining time to completion. In scenarios, or samples, with an exceptionally low cost, the remaining time will also usually be small. Likewise, cost overruns usually coincide with delays. These two SIPs are coherent when the ordering of samples captures this relationship, meaning that the nth point of remaining_cost should correspond to the same scenario as the nth point of remaining_time. Coherence in this fashion captures correlation between the quantities as well as other more subtle dependencies that may not be apparent in of correlations. Remaining cost and time are SIPs that should be bundled within the same SLURP. [1]

Question

SIPs and SLURPs are based on a commercial DIST 1.1 Standard by ProbiliTech. However, the same idea of packing random samples while retaining the original order of samples can be implemented using other means. What is a good way of packing random samples in such a way that is not bound by commercial standards?

Answer

Input

The method should take in a random sample of values (or text) and pack it effectively with a minimal loss of information. The user should be able to adjust the critical parameters, for example

  • v = The version v of SIP that is used. Current version: 1. Default: 1.
  • prec = The rounding precision prec (2 = two decimal digits, 0 = integer, -1 = rounded to tens, 99 = 16 significant digits at whatever position). Default: 99.
  • log = Whether the bins are evenly spaced on logarithmic scale, or alternatively on arithmetic scale. Default: FALSE.
  • min = The smallest value sampled min. Default: 0.
  • max = The largest value sampled max. Default: 1.
  • bins = Number of bins used. The default is 256 (28). SIP will use log2(bin) bits for each draw. Parameters prec, min, max may constrain the number of possible values to less than bins; the smaller number will be used.
  • n = Number n of samples (or iterations) drawn. Default: 1000.
  • infminus = Whether -Inf value is possible. Default: FALSE
  • infplus = Whether Inf value is possible. Default: FALSE --# : Not sure if these are needed for anything. --Jouni 18:38, 20 August 2011 (EEST)
  • draw = The actual random draw packed as a text string.
  • levels = When the draw is actually a list of text values (or a list of numbers with uneven spaces), this parameter contains the different levels of the draw (i.e., list of possible answers). The format is common in R: c("First possible answer","Second possible answer","Third possible answer"). This is not asked from the user, because it is derived from the sample if and only if it contains text values; in this case, prec, min, max, bins are not used.

Output

A text string with all necessary information to unpack the sample should be the output.

  • An identifier for a sip: SIP()
  • The parameter values for the parameters needed.
  • If the distribution is a probability table with text values, a list of all possible values are given with sequence numbers.
  • The packed sequence of random draws.

For example, the output can look like this:

SIP(v=1,prec=2,min=260.47,max=294.37,bins=16,n=46,draw="eKu8W)=εñ'-$§▼eT4i.Mî║")

With bins = 16, each draw takes 4 bits ie. two draws per one character. This example has 23 characters and therefore it contains 46 draws. Comma is used as a separator between parameters.

Also other methods for storing random sample information can be used.

PARAM(mean= , sd=, min=, max=...)
for storing statistical parameters of a distribution.
QUANT(value, value, value...)
for storing the quantiles of a distribution. Values must be in the ascending order, and the cumulative probability distribution is cut into evenly spaced quantiles that are determined by the parameter values. The first and last parameters are min and max, respectively.

Procedure

TODO: {{#todo:Katsokaapas tätä. Vastaako tämä teistä sitä, mistä puhuttiin torstaina? Kukahan ehtisi kirjoittaa rcodet loppuun? Olisikohan Pauli kiinnostnut yrittämään? --Jouni 18:38, 20 August 2011 (EEST)|Einari Happonen, Juha Villman, Teemu Rintala, Pauli Ordén|}}


The method is based on evenly distributed bins. This means that the numbers drawn from a distribution are first rounded to these bins. In effect, the distribution is treated as a frequency distribution with x1 observations from bin 1, x2 from bin 2, ... and xbins from the last bin. When min, max, bins (or alternatively the levels) are known, the exact value for each bin can be deduced. The packed part of the SIP (i.e., draw) simply contains the number of bin for each sampled value.

Probability distributions are located in the cell table.

If the probability distribution is a classified distribution with text values, then each possible value is described with the levels parameter. Then, the draw simple contains the packed number of the level for each iteration.

If n is 256, 8 bits will be used for each draw. These are changed into characters having the ASCII values 33-288, which are unambiguously understood by most character encoding systems. There might be some exceptions, like asc(256) and '"' that should not be used. For effective packing, bins should be exactly or slightly smaller than some power of 2, as n = 257 and n = 512 both take 9 bits.


R code for encoding a SIP

# : The code does not contain even a full idea yet. --Jouni 18:38, 20 August 2011 (EEST)

--# : Functions I'd recommend to be used: as.character(as.raw(x)), where x is a number from 0 to 255 (the bin); and cut(sample, nbins) for binning sample into nbins evenly spaced bins. --Teemu R 11:14, 22 August 2011 (EEST)

+ Show code

R code for decoding a SIP

# : The code contains a rough idea about what it should do but it does not work yet. --Jouni 18:38, 20 August 2011 (EEST)

+ Show code

See also

Keywords

Random sample, probability, distribution

References

Related files

<mfanonymousfilelist></mfanonymousfilelist>