The connections between causation, probability, and network representations suggested that with appropriate assumptions and background knowledge, something about the causal structure can be learned from observation, and the outcomes of some ideal interventions can be predicted. In the second part of the three-part series on the subject, Ashoka gives us unique insight into scientific revolutions, in the weekly column, exclusively in Different Truths.
In the early 1980s, several statisticians developed a network representation of probability relations that formalised and generalised ideas that had been used for a long while in biology, social science, and elsewhere. According to their representation, suppose we have data for a number of variables, each of which takes a definite value in each individual object or case (the variables might be height, weight, ratio of Democrats to Republicans, whatever; the individual objects, or cases, could be people, rats, cells, state governments, whatever). Represent each variable as a node and draw arrows from some nodes to other nodes.
In the early 1990s, a group of philosophers and statisticians at Carnegie Mellon noted that many of the information restrictions, or conditional independence facts, represented in a network also hold in a related way if the arrows represent causal relations, and, relying on the Markov condition, they gave a general characterisation of the relation between network structure, probabilities, and causal claims.
The idea is easiest to see for interruptions of a simple causal chain. For instance, if pushing the doorbell button causes the bell to ring, which in turn causes the house parrot to say “hello,” then if you intervene to keep the bell from ringing, pushing or not pushing the doorbell button will not change the probability that the parrot says “hello.” After your intervention, the state of the button and the state of the parrot will be independent of each other; neither will provide information about the other. But if you do not intervene to disconnect the bell, pushing the button will be independent of the parrot’s speech conditional on the state of the bell, ringing or not ringing; if you know whether the bell is ringing, the parrot’s speech won’t give you any more information as to whether someone is at the door. In many cases, the independence relations produced by interventions in a system parallel the conditional independence relations implied by the network representation of the causal structure of the system.
These connections between causation, probability, and network representations suggested that with appropriate assumptions and background knowledge, something about the causal structure can be learned from observation, and the outcomes of some ideal interventions can be predicted.
The class of alternative networks that might conceivably describe the causal relations among a set of variables, before data is collected, is astronomical even for small numbers of variables, and with larger numbers of variables remains huge even if some of the variables are ordered so that one can assume that later variables do not influence earlier ones.
Even so, early in the 1990s, researchers at the University of Pittsburgh, Carnegie Mellon, UCLA, and Microsoft developed algorithms and software for searching for the class of diagrams that can account for any set of independence relations among variables. Since then many related algorithms have been proposed and applied by others. These procedures search efficiently for information within the huge space of alternative possible causal structures, but, unlike stepwise regression, some of these procedures come with a weak guarantee of reliability. For example, as the size of the sample increases, according to the Markov condition and one other further technical assumption, the Bayes net search programs ‘converge’ to giving correct information about the causal structure behind the data.
To estimate the effect of lead, Scheines and his Dutch collaborators resorted to a relatively new technique in Bayesian statistics. Bayesian statistics proceeds by assigning ‘prior probabilities’ to alternative hypotheses, by computing for each hypothesis the probability of the data on the assumption that that hypothesis is true, and, from all this, computing a new, or ‘posterior,’ probability for each hypothesis or range of parameters considered. For a long time, because the posterior probabilities often could not be computed, Bayesian statistics was chiefly a toy used only for simple problems; computational developments in the last two decades have changed that considerably. Scheines used the economists’ judgments of the probability distribution for values of parameters related to measurement error to assign prior probabilities to their measurement error model. Then he and his collaborators computed the posterior probability distribution for values of the parameter representing the influence of lead on IQ. By this method, they found that low-level lead exposure is almost certainly at least two times more damaging to cognitive ability than Needleman had estimated.
Every cell in your body has the same DNA but cells in different tissues look and function very differently – brains, after all, are not bones. The difference comes from the proteins that make up the physical structure of a cell and regulate – indeed, in some sense constitute – its metabolism. The thousands of different kinds of proteins are themselves produced by a collaborative manufacturing process in the cell. Amino acids–any of twenty simple molecules provided to the cell from outside–are stitched together to form a protein, which may then fold and combine chemically or physically with other proteins. Each basic protein originates along a template of ribonucleic acid (RNA) outside the nucleus, and different template molecules–different kinds of RNA molecules–make different proteins. Messenger RNA (mRNA), itself copied from DNA, generates the template RNA. Whether a piece of DNA is copied into mRNA within any interval of time depends on several things, including the chemical sequence of the particular DNA piece (whether it is a coding sequence, i.e., a gene), the chemical sequences of other regions of the chromosome that are physically close (regulator sites), concentrations of small molecules inside the nucleus of the cell, and concentrations of proteins. Certain proteins attach to the regulator sites of a gene and cause the gene to be copied (in other terminology, ‘transcribed’ or ‘expressed’) into RNA, which in turn goes on to make proteins. An important clue to fundamental biology and its medical applications lies in this process of gene expression, in knowing which genes respond to new chemical or physical environments, and which cellular functions are influenced by the proteins those responding genes produce.
Traditionally, this kind of problem had been approached one gene at a time– for instance by finding some of the proteins that regulate a gene, finding the protein or proteins the gene yields, identifying some of the roles those proteins play in cellular metabolism. But about ten years ago, biologists developed techniques for simultaneously measuring the concentrations of each of the thousands –and in some contexts essentially all–of the distinct kinds of mRNA molecules present in a collection of cells. Biologists could get a snapshot of how much each gene in the cells had been copied or expressed. Multiple snapshots, moreover, could be taken at different times, as little as a few minutes apart, so that researchers could see the varying responses of the cell genome to changing conditions. So what affects what genes? Answers to this question are coming in at an astonishing rate.
[To be continued]
©Ashoka Jahnavi Prasad
#ProbabilityInStudyOfScience #DevelopmentOfSoftwareForCalculatingPrbability #SoftwareForGeneticStudy #ProbabilityAndVariables #StatisticAndStudyOfScience #MidweekMusing #DifferentTruths