Difference between revisions of "Modelling in Opasnet"

From Testiwiki
Jump to: navigation, search
(Page structure for modelling pages)
Line 4: Line 4:
 
{{method|moderator=Jouni|stub=Yes}}
 
{{method|moderator=Jouni|stub=Yes}}
 
'''Opasnet modelling environment''' is an open, web-based platform for collaboratively developing numerical models.
 
'''Opasnet modelling environment''' is an open, web-based platform for collaboratively developing numerical models.
 +
 +
Opasnet wiki-R modeling platform
 +
 +
==Introduction==
 +
Opasnet has intergrated tools for building and running easily accessible statistical models in the wiki. The platform is completely modular and individual variables are perfectly reusable.
 +
 +
==Main features==
 +
*Wiki - pages provide a natural analogy to variables in a statistical model. They contain descriptive information as well as necessary meta data (i.e. scope).
 +
*R - is an open sourced statistical programming language akin to for example Matlab. The Opasnet wiki has an extension (R tools) to include executable R scripts on any page. The output is displayed in html format as an applet or seperate tab.
 +
*Database - MongoDB is used to store variable related data.
 +
*Interfaces - All these different components obviously need to work together and we have built interface solutions for each of the combinations: R-tools for running wiki integrated R scripts, table2base for uploading wiki tables to the database, OpasnetBase wiki extension for showing database entries in the wiki and the opbase script family for communication between R and the database.
 +
*OpasnetUtils -  is an R package (library), which contains tools for building mathematical models within our modeling framework which is described below in detail. OpasnetUtils is completely:
 +
**modular
 +
**recursive
 +
**customizable, a knowledgeable user can take over any automated model part and resume automation for the rest of the model
 +
**
 +
 +
==Usage==
 +
Mathematical models consist of variables, which may be known or uknown and/or can be derived from other variables using more models. Modeling in Opasnet is variable-centric. Since variables are defined universally, they should be resusable in all other models (partly or wholly). Though naturally more complex models with extremely large datasets will need more customized and static definitions to run efficiently.
 +
In practice known variables can be defined by writing parseable tables (t2b) in the wiki pages or uploading datasets directly to the OpasnetBase and downloading them in the variable defining R code or using some existing datatools within packages installed on the Opasnet server i.e. Scraper.
 +
Latent variables are usually defined using R code that depends on other defined variables, which should be listed under the dependecies of that variable.
 +
Variables can be defined both latently and by data. Both definitions are stored and can be compared.
 +
 +
==OpasnetUtils features==
 +
The modeling itself is done using R and the OpasnetUtils package provides most of the actual tools.
 +
===Ovariables===
 +
The ovariable is a class defined by OpasnetUtils. It has eight separate "slots":
 +
*name
 +
*output
 +
*data
 +
*marginal
 +
*formula
 +
*dependencies
 +
*ddata
 +
Automated ovariable Checks etc.
 +
 +
===Utilising ovariables===
 +
Defining and analyzing endpoints of a model can be as easy as Fetching a relevant variable, evaluating it (using EvalOutput) and using some of the available functions (i.e. summary() for ovariables).
 +
 +
...
  
 
==Page structure for modelling pages==
 
==Page structure for modelling pages==

Revision as of 09:31, 14 July 2013


Opasnet modelling environment is an open, web-based platform for collaboratively developing numerical models.

Opasnet wiki-R modeling platform

Introduction

Opasnet has intergrated tools for building and running easily accessible statistical models in the wiki. The platform is completely modular and individual variables are perfectly reusable.

Main features

  • Wiki - pages provide a natural analogy to variables in a statistical model. They contain descriptive information as well as necessary meta data (i.e. scope).
  • R - is an open sourced statistical programming language akin to for example Matlab. The Opasnet wiki has an extension (R tools) to include executable R scripts on any page. The output is displayed in html format as an applet or seperate tab.
  • Database - MongoDB is used to store variable related data.
  • Interfaces - All these different components obviously need to work together and we have built interface solutions for each of the combinations: R-tools for running wiki integrated R scripts, table2base for uploading wiki tables to the database, OpasnetBase wiki extension for showing database entries in the wiki and the opbase script family for communication between R and the database.
  • OpasnetUtils - is an R package (library), which contains tools for building mathematical models within our modeling framework which is described below in detail. OpasnetUtils is completely:
    • modular
    • recursive
    • customizable, a knowledgeable user can take over any automated model part and resume automation for the rest of the model

Usage

Mathematical models consist of variables, which may be known or uknown and/or can be derived from other variables using more models. Modeling in Opasnet is variable-centric. Since variables are defined universally, they should be resusable in all other models (partly or wholly). Though naturally more complex models with extremely large datasets will need more customized and static definitions to run efficiently. In practice known variables can be defined by writing parseable tables (t2b) in the wiki pages or uploading datasets directly to the OpasnetBase and downloading them in the variable defining R code or using some existing datatools within packages installed on the Opasnet server i.e. Scraper. Latent variables are usually defined using R code that depends on other defined variables, which should be listed under the dependecies of that variable. Variables can be defined both latently and by data. Both definitions are stored and can be compared.

OpasnetUtils features

The modeling itself is done using R and the OpasnetUtils package provides most of the actual tools.

Ovariables

The ovariable is a class defined by OpasnetUtils. It has eight separate "slots":

  • name
  • output
  • data
  • marginal
  • formula
  • dependencies
  • ddata

Automated ovariable Checks etc.

Utilising ovariables

Defining and analyzing endpoints of a model can be as easy as Fetching a relevant variable, evaluating it (using EvalOutput) and using some of the available functions (i.e. summary() for ovariables).

...

Page structure for modelling pages

This is a plan for an improved page structure for pages related to modelling, databases, and codes in Opasnet.

Portal:Modelling with Opasnet Main page. Contains a brief introduction and links to the content.

Practices
Tools

Question

How should modelling be done in Opasnet in practice? This page should be a general guidance on principles, not a technical manual for using different tools.

What should be the main functionalities of Opasnet modelling environment such that

  • it supports decision analysis,
  • it supports BBNs and Bayesian inference,
  • it mainly contains modelling functionalities for numerically describe reality but
  • it is also possible to numerically describe scenarios (i.e., deliberate deviations from the truth in order to be able to compare two alternative worlds that are the same in other respect than the deliberate deviation).

Answer

For a general instruction about contributing, see Contributing to Opasnet.

Difference between revisions of "Modelling in Opasnet"(-)
ObsPropertyGuidance
1StructureAnswer should be a data table either on the page or uploaded to Opasnet Base using R code.
2StructureThe indices should logically match those of parent objects.
3ApplicabilityThe question of an object should primarily be tailored according to the particular needs of the assessment under work, and only secondarily to general use.
4CoherenceThe Answer of an object should be coherent with all information used in the object. In addition, it should be coherent with all other objects. If some information in another object affects the answer of this object, a link to the other object should be placed under Rationale, and specifically under Dependencies if there is a causal connection.
5CoherenceEnsuring coherence is a huge task. Therefore, simple things should be done first and more laborious only if there is a need. The order in which things should be done is usually this: a) Search for similar objects and topics in Opasnet. b) If found, make links to them in both directions. c) Discuss the related info in Rationale. d) Include the info in calculations of the Answer. e) Merge the two related objects into one larger object that contains all information from the two objects and that is internally coherent.
6CoherenceWhen you find two (or more) pieces of information about one topic, but the pieces are inconsistent, describe the Answer in this way (from simple to complex): a) Describe qualitatively what was found. b) Describe the Answer quantitatively as a list of possible hypotheses, one hypothesis for each piece of information. c) Describe the hypotheses probabilistically by giving the same probability to each hypothesis. d) Using expert judgement and/or open critical discussion, adjust probabilities to give less weight to less convincing hypotheses. e) Develop a probabilistic model that explicitly describes the hypotheses based on our understanding about topic itself and the quality of the info, and use the info as input data.
7Multi-site assessmentWhen several similar assessments are to be performed for several sites, the structure of the assessments should contain a) a single page for the multi-site assessment, including code that has the site name as input, b) a single summary page containing a list of all sites and their individual pages, structured as a data table, c) an individual page for each site containing a data table with all site-specific parameter values needed in the assessment.
8FormulaWhenever possible, all computing code should be written in R.
9FormulaInstead of copying the same code to several pages, the multi-site assessment approach should be used. Alternatively, #include_code function should be used (after it has been finalised and functional).
10FormulaSome procedures repeat themselves over and over again in impact assessments. These can be written as functions. Common or important functions can be included in libraries that are available in R-tools. Search for R-tools libraries so that you learn to use the same functions as others do.
11FormulaWhen you develop your own functions with a general purpose, you should suggest your own functions to be added to an R-tools library.
12Preferred R codeObjects should be described as data.frames. The use of arrays is discouraged.
13Preferred R codeProbabilistic information is incorporated in a data.frame using index Run, which contains the number of Monte Carlo iteration from 1 to n (samplesize).
14Preferred R codeGraphs are drawn using ggplot2 graphics package.
15Preferred R codeUploading to and downloading from Opasnet Base is done using OpasnetBaseUtils package. Uploading is only possible from some computers with specific IP.
16Preferred R codeWhen possible and practical, summary parameters and sub-object functions are performed with the tapply family functions.
17Preferred R codeTwo data.frames with one or more common indices are put together using the merge function.
18Preferred R codeWhen two data.frames have identical rows (or columns), they are put together with the cbind, which adds more columns, (or rbind, which adds more rows) function.

Links related to the answer: Data table Opasnet Base R Parent object Child object R-tools OpasnetBaseUtils ggplot2 tapply merge data.frame rbind cbind

Note! The text talks about objects, which means any information objects. The most common objects are variables.

Relationship of Answer and Rationale

All variable pages should have a clear question and a clear answer. The answer should typically be in a form of a data table that has all indices (explanatory columns) needed to make the answer unambiguous and detailed enough. If the answer table is very large, it might be a bad idea to show it on the page; instead, a description is shown about how to calculate the answer based on Dependencies and Rationale, and only a summary of the result is shown on the page; the full answer is saved into Opasnet Base.

The answer should be a clear and concise answer to the specific question, not a general description or discussion of the topic. The answer should be understandable to anyone who has general knowledge and has read the question.

In addition, the answer should be convincing to a critical reader who reads the following data and believes it is correct:

  • The Rationale section of the page.
  • The Answer sections of all upstream variables listed in the Dependencies section.
  • In some cases, also downstream variables may be used in inference (e.g. in hierarchical Bayes models).

It should be noted that the data mentioned above should itself be backed up by original research from several independent sources, good rationale etc. It should also be noted that ALL information that is needed to convince the reader should be put into the places mentioned and not somewhere else. In other words, when the reader has read the rationale and the relevant results, (s)he should be able to trust that s(he) is now aware of all such major points related to the specific topic that have been described in Opasnet.

This results in guidance for info producers: if there is a relevant piece of information that you are aware of but it is not mentioned, you should add it.

Indices of the data table

The indices, i.e. explanatory columns, should match in variables that are causally connected by a causal diagram (i.e., mentioned in Dependencies). This does not mean that they must be the same (as not all explanations are relevant for all variables) but it should be possible to see which parts of the results of two variables belong together. An example is a geographical grid for two connected variables such as a concentration field of a pollutant and an exposure variable for a population. If the concentration and population use the same grid, the exposure is easy to compute. However, they can be used together with different grids, but then there is a need to explain how one data can be converted into the other grid for calculating exposures.

Increasing preciseness of the answer

This is a rough order of emphasis that could guide the work when starting from scratch and proceeding to highly sophisticated and precise answers. The first step always requires careful thinking, but if there are lots of easily available data, you may proceed through steps 2 - 4 quickly; with very little data it might be impossible to get beyond step 3.

  1. Describe the variables, their dependencies and their indices (explantaions) to develop a coherent and understandable structure and causal diagram.
  2. Describe the variables as point estimates and simple (typically linear) relations to make the first runnable model. Check that all parts make sense. Check that all units are consistent. Whether all values and results are plausible is desirable but not yet critical.
  3. Describe the variables as ranges in such a way that the true value is most likely within the range. This is more important than having a very precise range (and thus higher probability not covering the truth). This may result in vague conclusions (like: It might be a good idea to do this, but on the other hand, it might be a bad idea). But that's exactly how it should be: in the beginning, we should be uncertain about conclusions. Only later when we collect more data and things become more precise, also the conclusions are clarified. At this step, you can use sensitivity analyses to see where the most critical parts of your model are.
  4. The purpose of an assessment model is to find recommendations for actions. Except for the most clear cases, this is not possible by using variable ranges. Instead, probability distributions are needed. Then, the model can be used in optimising, i.e., finding optimal decision combinations.
  5. When you have your model developed this far, you can use the Value of information analysis (VOI analysis) to find the critical parts of your model. The difference to a sensitivity analysis is that a VOI analysis tests which parts would change your recommendation, not which parts would change your estimate of outcome. Often the two analyses point to the same direction, but a VOI analysis is more about what you care, while a sensitivity analysis can be performed even if no explicit decision has yet been clarified.

Rationale

A draft based on own thinking. Not even the topics are clear yet.

Montako lukiolaista tarvitaan korvaamaan 1 asiantuntija? Laske tehokas asiantuntijan opiskeluaika ja se osuus joka siitä tarvitaan ratkaisemaan kyseinen ongelma

Arvaus: 10. Asiantuntijat halveksivat pinnallista tietoa ja heillä on syvällistä. Mikä ero? Kytkennät. Jos 2 asiaa on mahdollisia mutta ei yhtaikaa, asiantuntija tunnistaa tämän mutta maallikko ei. Lukiolaisista saadaan asiantuntijoita opettamalla heille menetelmä kuvata kytkentöjä. Sen jälkeen kaiken tiedon ei tarvitse enää olla 1 ihmisen päässä.

Ihmisten on vaikea hahmottaa, että lukuisia ongelmia voidaan ratkoa kerralla samalla menetelmällä. Sen sijaan yhden ongelman ratkaisuun voidaan motivoida suuria joukkoja, jos aihe on heille tärkeä. Pitäisikö siis löytää se yksi tärkeä asia? Muut sitten alkavat ratketa vahingossa.

Vaikeaa on myös nähdä metatason kysymyksiä eli järjestelmää tai itseä osana isompaa rakennetta, jonka puitteissa ovat myös mahdolliset maailmat ja jonka sisältä ratkaisut löytyvät.

Mielikuvituksen jaloin laji on kuvitella hyviä asioita, jotka voisivat olla mutta eivät ole, sekä niiden ei-olemisen ja olemisen välistä polkua.

Tieteellinen tiede on kuin amerikkalainen unelma: tieteen menetelmin tehdään riittävästi läpimurtoja jotta joka sukupolvelle riittää omat menestystarinansa ja idolinsa, mutta käytännössä tieteen metodi on liian kaukana tutkijan arjesta jotta se todella siihen vaikuttaisi. Niinpä tutkijat elävät illuusion varassa kuten amerikkalaisetkin, ja puurtavat vailla mahdollisuuksia todellisiin tavoitteisiinsa jotka ovat suurempia ja vaikuttavampia kuin mihin tieteen järjestelmä antaa mahdollisuuksia. Tutkijoiden aika ja resurssit menevät 2 asian miettimiseen: mistä saan rahaa ja miten saan ajatuksiani julkaistuksi. Sen sijaan ajatustensa itsensä kehittämiseen on aina liian vähän aikaa. Niinpä ei vain tavoitteet vaan myös kyvyt ovat suuremmat kuin mihin järjestelmä taipuu. Parhaiten pärjäävät toimitusjohtajatyypit, jotka osaavat organisoida rahankeruun, julkaisemisen ja instituutiot oman mielenkiintonsa kohteisiin.

A meta-analysis produces an esimate (with confidence intervals) for the average exposure-response. However, we are interested in the distribution of the individual exposure-response functions. If we know the distribution of the individual distributions, we can always produce the distribution for the average. In addition, if we assume that our sample of individuals is not a random sample from the whole population of interest, we can try to assess the subgroup's bias from the random sample, and in this way produce sub-population-specific exposure-response distributions. In other words, we can mimic a real study using bootstrapping (biased bootstrapping, if we have an estimate how the study is biased) from our individual-level whole-population distribution; in this way, we can test whether it is plausible that the study data actually came from such a whole-population distribution in such a way we think. This is difficult task, but if it works, it offers a method to systematically include any studies in the same examination of individual-level (not average) exposure-response functions.

See also

Keywords

Modelling, Opasnet Base, scenario

References


Related files

<mfanonymousfilelist></mfanonymousfilelist>