Causal theory for the study of gene cause-effect relationships

By studying changes in gene expression, scientists learn how cells work at the molecular level, which could help them understand the development of certain diseases.

But humans have about 20,000 genes that can interact with each other in complex ways, so even knowing which groups of genes to target is an enormously complicated problem. Also, genes work together in modules that regulate each other.

MIT researchers have now developed the theoretical basis for methods that could identify the best way to aggregate genes into related groups in order to efficiently learn the underlying cause-and-effect relationships between many genes.

Importantly, this new method achieves this using only observational data. This means that researchers do not need to conduct costly and sometimes impractical intervention experiments to obtain the data needed to infer underlying causal relationships.

In the long term, the technique could help scientists identify potential gene targets to induce certain behaviors in a more precise and efficient way, potentially allowing them to develop precise treatments for patients.

“In genomics, it is very important to understand the mechanism underlying cellular states. But cells have a multi-level structure, so the level of summarization is also very important. “If you figure out the right way to aggregate the observed data, the information you learn about the system should be more interpretable and useful,” says graduate student Jiaqi Zhang, a member of the Eric and Wendy Schmidt Center and co-author of a paper on the technique.

Zhang is joined on the paper by co-author Ryan Welch, currently a master’s student in engineering; and lead author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems, and Society (IDSS), who is also director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and a researcher at the MIT Laboratory for information and decision systems (LIDS). The research will be presented at a conference on neural information processing systems.

Learning from observational data

The problem the researchers set out to tackle involves gene learning programs. These programs describe which genes work together to regulate other genes in a biological process, such as cell development or differentiation.

Because scientists can’t effectively study how all 20,000 genes interact, they use a technique called the causal delta angle to learn how to combine related groups of genes into a representation that allows them to efficiently investigate cause-and-effect relationships.

In previous work, the researchers demonstrated how this can be done efficiently in the presence of intervening data, which is data obtained by disturbing variables in the network.

However, conducting intervention experiments is often expensive and there are some scenarios where such experiments are either unethical or the technology is not good enough for the intervention to succeed.

With only observational data, researchers cannot compare genes before and after an intervention to see how groups of genes work together.

“Most research in causal disentanglement assumes an interventionist approach, so it wasn’t clear how much information you can disentangle using observational data alone,” Zhang says.

MIT researchers have developed a more general approach that uses a machine learning algorithm to efficiently identify and aggregate groups of observed variables, e.g. genes, using only observational data.

They can use this technique to identify causal modules and reconstruct an accurate underlying representation of the cause-and-effect mechanism. “While this research was motivated by the problem of elucidating cellular programs, we first had to develop a new causal theory to understand what can and cannot be learned from observational data. With this theory in hand, in future work we can apply our knowledge to genetic data and identify gene modules and their regulatory relationships,” says Uhler.

Layered representation

Using statistical techniques, researchers can calculate a mathematical function known as the variance for the Jacobian score of each variable. Causal variables that do not affect any subsequent variables should have a variance of zero.

Researchers reconstruct the representation in the structure layer by layer, starting by removing variables in the bottom layer that have zero variance. Then they work backwards, layer by layer, removing variables with zero variance to determine which variables or groups of genes are associated.

“Identifying deviations that are zero quickly becomes a combinatorial goal that is quite difficult to solve, so deriving an efficient algorithm that could solve this was a big challenge,” says Zhang.

The result of their method is an abstracted representation of the observed data with layers of interconnected variables that accurately summarize the underlying structure of cause and effect.

Each variable represents an aggregate group of genes that work together, and the relationship between two variables represents how one group of genes regulates the other. Their method effectively captures all the information used in determining each layer of variables.

After proving that their technique was theoretically correct, the researchers ran simulations to show that the algorithm could effectively infer meaningful causal representations using only observational data.

In the future, scientists want to apply this technique in real-world applications of genetics. They also want to explore how their method might provide additional insights in situations where some interventional data are available, or help scientists understand how to design effective genetic interventions. In the future, this method could help researchers more efficiently determine which genes work together in the same program, which could help identify drugs that could target those genes to treat certain diseases.

This research is funded in part by the US Office of Naval Research, the National Institutes of Health, the US Department of Energy, a Simons Investigator Award, the Eric and Wendy Schmidt Center at the Broad Institute, the Advanced Undergraduate Research Opportunities Program at MIT, and an Apple AI/ML PhD Fellowship.

Leave a Comment