Object-oriented programming in Opasnet

From Testiwiki
Revision as of 08:13, 13 April 2012 by Jouni (talk | contribs) (Methods)
Jump to: navigation, search

Object-oriented programming is an approach where programs (or, in Opasnet, typically assessment models) have a modular structure in such a way that each part is considered as a separate object that has specific properties and interacts with other objects in standard ways.

Question

How should object-oriented programming be utilised in Opasnet in such a way that

  • it has seamless connections to R-tools,
  • it is easy to understand by non-expert users and contributors,
  • it uses the variable structure and other information structures (e.g. universal object) used in open assessment, and
  • it enables standards for typical processes in environmental health assessments (such as distribution modelling, life tables, decision optimising, etc.).

Answer

Structure of objects

Objects have two different implementations: wiki page in Opasnet, and S4 class object called oavariable (open assessment variable) in R-tools. The wiki page is the user-friendly interface for users, and oavariable is the versatile format for efficient, standardised modelling. The default direction for data is long (using the terminology in the merge function).

--# : Should we have attribute "target" that defines the target of the variable estimate. For example, "height" may estimate the whole variation of heights of individuals in a population, or it may estimate the mean height of the population. Somehow the population, the target that is the basic unit (individual in this case) and the statistical parameter should be explicitly described. Can this be done by using vector attributes that have a value for each index column in the sample? Is this an index-specific issue, or variable-specific? --Jouni 07:50, 9 April 2012 (EEST)

Attribute What it contains How implemented in the wiki How implemented in the R-tools as a S4 class object oavar
These attributes are needed in R-tools.
data Observations, expert judgement, discussions, and other pieces of information. Subheading under Rationale Slot data = "data.frame". The data frame must contain Obs and Unit columns, at least one index column, and Result as the observation column. However, it must not contain Iter column.
sample A random sample from the distribution (default is 10000 iterations). Not shown Slot sample = "data.frame". The data frame must contain columns Iter, Obs, Unit, one column for each index, and Result (if there are originally more than one observation columns, they are molten with melt function).
marginal A Boolean vector with the size = number of indices = ncol(data) - 1. TRUE if an index is indexing a marginal distribution in sample, FALSE if joint distribution. The difference is that in a marginal distribution there are n iterations for each location of the index, while in joint distribution, there are altogether n iterations in such a way that the frequencies of locations match their probabilities. Not implemented in wiki. Slot marginal = "vector". Especially with indices with lots of locations, joint distribution needs much less memory.
formula A computer code or algorithm to derive the answer from rationale and objects listed in dependencies. The formula may assume a deterministic dependency (e.g. y <- k*x + b), a conditional probability structure (y ~ dnorm(x, sd)), or a rank correlation matrix. Subheading under Rationale, often using <rcode> tags. Slot formula = "list". There may be several competing algorithms. Each of them is described (as a function?) as one entry in the list. When implementing the formula, the algorithm that is implemented is randomly selected for each iteration with equal probability unless otherwise specified in formula.prob.
These attributes are not (yet) implemented in R-tools.
question A research question that defines the topic of the object First main heading Slot question = "character". Contains the question as text.
answer The current best answer to the question, shown as text, data table, or distribution. Second main heading; contains a single data table. NOTE! The data table is actually under ratonale/data but often it is the same as answer. The actual answer is precisely described by distribution and sample (see below). Only sub-attributes are implemented.
locations List of names for observation columns in the wide format (columns that are not indices), i.e. the same as measure.vars parameter in the function melt. The same as locations parameter in t2b tag. Slot locations = "vector". A vector with all observation column names. Can be integer (observation column position) or string (observation column name). By default, the column name for locations is "Parameter" and the column name for the actual observations is "Result". # : Do we actually need this in R, if always a variable is molten using melt before actual use? --Jouni 14:54, 4 April 2012 (EEST)
observation An identifier of an individual when the answer consists of a group of individuals. Obs column (usually implicit because the default is that each row is an observation) in the data table. Obs column in data.frames data and sample. Not explicitly needed as a slot in S4 object. --# : How do we operate, if there are different Obs in different variables and they are merged? a) Repeat shorter Obs's until they reach the length of the longest. b) Refuse to operate unless the user renames or removes all but one Obs column; however if Obs only has 1 observation, it can be temporarily removed during merge. --Jouni 14:54, 4 April 2012 (EEST) # : I prefer b). a) is too abstract so that the user is just confused. --Jouni 14:54, 4 April 2012 (EEST)
iteration An identifier of a probabilistic run or iteration. Sometimes it is also called a possible world or realisation. Iter column in the data table (data usually not shown probabilistically in wiki). Iter column in the data.frames sample (and rarely in data). Not explicitly needed as a slot in S4 object.
distribution A joint probability distribution (with indices as dimensions) describing the answer mathematically. Not shown Slot distribution = "distribution?". A distribution created with e.g. dnorm(0,1). --# : We don't know yet how to actually implement this and how the indices are included. --Jouni 14:54, 4 April 2012 (EEST)
rationale Any information that is needed to convince a critical reader that the answer is good. Third main heading Only sub-attributes are implemented.
unit The measurement unit(s) that are used in the answer to measure the topic. The format used is kg m^2 /s^2 where a space implies a multiplication. Subheading under Rationale with plain text. Also mentioned in the data table with parameter unit. If data table rows differ in units, there must be a Unit index. There is no separate slot for unit. The unit is merged with data and subsequently with sample. This must be done, because if rows are ordered, it is impossible to attach units to right rows based on separate information.
dependencies List of upstream objects that are causally related to this object. Subheading under Rationale with a list of links to upstream (and sometimes downstream) wiki pages. Slot dependencies = "vector". A character vector where entries have the format "Op_fi:Vaativuusluokkien keskipalkat". If the wiki identifier is omitted, the default is op_en.
formula.prob A list of probabilities assigned to the competing algorithms in formula. The default is that each has an equal probability. A detail in <rcode> code. Slot formula.prob = "vector". Should have the same size as formula.

Methods

R code should be developed in such a way that there are object-specific implementations of critical functions. The user should see straightforward content, and all messy indexing etc should happen behind scenes.

These methods should be implemented for oavariable objects.

  • show, print: show the data slot.
  • plot: plot the sample, showing one (the first by default) marginal index with all locations and all other marginal indices with the first location only.
  • tidy: applies to data: remove id column; add Obs and Iter columns if they do not exist; Change the direction from wide to long.
  • createSample: create sample directly from data using interp.input.
  • GetSample, GetData: extract sample and data from the object, respectively.
  • Ops: applies to sample: merge two oavariables based on index columns, then perform the Ops operation to the result columns.
  • standardUnits: Based on units, transform the result column of data to SI units using Unit transformations table; then update unit.
  • demarginalize: turn one specified index from marginal to joint format. This function has parameter jointlimit: if the length(index_i) * n > jointlimit, then index_i is demarginalized. The default for jointlimit is 1000000. n is the length of Iter.

Oavariable functions available: make, update, plot, print, oarbind, merge, callGeneric Ops

+ Show code

Example code

+ Show code

Important rules:

  1. Never use ".x" in the name of an oavariable.
  2. The observation column always must have name "Result".
  3. The iteration (aka. sample, run) column always must have name "Iter".
  4. In the data slot, there is always column "Obs". It defines the observations that are from the same individual. Of omitted, each row is assumed to be from different individual.
  5. Data slot must not have Iter column, but sample may or may not have. Typically it does have it.
  6. Variable n is reserved for number of iterations.
  7. Column name Source is reserved for the source of the result (either Data of Formula).
  8. After using make.oavariable(), use update() to update the sample of the oavariable based on both Data and Formula. This is not done automatically, because often there are problems to run the formula part of the code.

Rationale

See also

References


Related files

<mfanonymousfilelist></mfanonymousfilelist>