Regenerating distributions

From Testiwiki
Jump to: navigation, search


Question

How to regenerate a probability distribution from data that only contains mean, sd or variance, and a few quantiles?

Answer

Rationale

This Matlab code was obtained from Patrycja Gradowska

[1].

% This script recreates the probability distribution of a random variable whose
% only mean, variance/standard deviation and a couple of percentiles are known. 
% We start by constructing the initial cumulative distribution function (CDF) 
% of the variable by linear interpolation between known percentiles
% (including lower and upper limits defined by the intrinsic range). 
% In general, this distribution will not reproduce the mean and 
% standard deviation as provided by the data. Therefore, the constrained optimization 
% is further used to find the CDF that satisfies 
% the percentiles given and whose mean and variance are as close as possible 
% to the mean and variance from the data.

clear all;
clc;

% ----- Input data: known summary statistics of the variable distribution -----
% (1) mean and standard deviation
EX  =  3.538520795; 
SdX = 4.046127233;
VarX  =  SdX^2; 

% (2) 5th, 25th, 50th, 75th and 95th percentiles
Q = [0.486401325	0.606605988	1.945751342	4.944405412	10.52461027];
   
q05    =  Q(1);    
q25    =  Q(2);    
q50    =  Q(3);   
q75    =  Q(4);   
q95    =  Q(5);   

% ----- Determine the intrinsic range ----- 
ql=Q(1)-0.1*(Q(5)-Q(1));
qu=Q(5)+0.1*(Q(5)-Q(1));

if ql<0  %Intake is positive among users/consumers 
   ql=0.0001;
end

% ----- Linear interpolation among successive percentiles -----
Y = zeros(1001,1); 

Y(1:51) = interp1([0.03 50],[ql,q05],[0.03 1:50],'linear'); 
Y(51:251) = interp1([50 250],[q05,q25],[50:250],'linear'); 
Y(251:501) = interp1([250 500],[q25,q50],[250:500],'linear'); 
Y(501:751) = interp1([500 750],[q50,q75],[500:750],'linear'); 
Y(751:951) = interp1([750 950],[q75,q95],[750:950],'linear');
Y(951:1001) = interp1([950 999.98],[q95,qu],[950:999 999.98],'linear');

% ----- Optimization part -----

diff=@(Y)((mean(Y)-EX)^2+(var(Y)-VarX)^2); %Objective function

% ----- Optimization constraints -----
u=-1*ones(1000,1);
A=eye(1001)+diag(u,1);
A(1001,1001)= 0;
b=zeros(1,1001);
bb=[zeros(50,1);1;zeros(199,1);1;zeros(249,1);1;zeros(249,1);1;zeros(199,1);1;zeros(50,1)];
Aeq=diag(bb);
beq=[zeros(50,1);Q(1);zeros(199,1);Q(2);zeros(249,1);Q(3);zeros(249,1);Q(4);zeros(199,1);Q(5);zeros(50,1)];

Y0=Y; %Starting point for optimization
lb=0.0001*ones(1,1001); %Lower bound for distribution percentiles
ub=qu*ones(1,1001); %Upper bound for distribution percentiles

options=optimset('LargeScale','off','Display','iter', 'MaxFunEval', 500000, 'MaxIter', 500000);% Options for Matlab solver
[sol,fval,exitflag,output] = fmincon(diff,Y0,A,b,Aeq,beq,lb,ub,[],options); 

% ----- Saving data into a .dis file -----
data = [[0.03 1:999 999.98]' sol];
save BS_Herring_Pregnant_Women.dis data /ascii

See also

References

  1. Patrycja Gradowska: Regenerating distributions. Personal communication, email 21.4.2009