AI researchers try to decode the cogs of decision-making in neural networks

Researchers can train brain-like artificial neural networks to classify images, such as cat pictures. Using a series of manipulated images, scientists can determine which part of the image – say the whiskers – is used to identify it as a cat. However, when the same technology is applied to DNA, researchers don’t know which parts of the sequence are important to the neural network. This unknown decision process is known as the “black box”. (Source: Ben Wigler/CSHL, 2021)

Wading through over three billion base pairs of DNA sequence, it is not possible for the human brain to determine at a glance whether a certain stretch of the cipher is binding to a transcription factor or is at an accessible location in the chromatin lattice.

These functional attributes can be identified by artificial intelligence algorithms called neural networks, loosely modeled after the human brain, which are designed to recognize patterns.

Peter Koo, PhD, assistant professor at Cold Spring Harbor Laboratory (CSHL) and collaborator Matt Ploenzke, PhD, of Harvard University’s Department of Biostatistics are using a type of neural network called a convolutional neural network (CNN) to develop a way to train machines to predict the function of DNA sequences.

These findings are reported in an article titled “Improve representations of genomic sequence motifs in convolutional networks with exponential activations‘, published in Nature Machine Intelligence.

The study reports that teaching neural networks to predict functions of short sequences allowed it to decipher patterns in larger sequences with training. The initial experiments of the current study are conducted on synthetic DNA sequences. The scientists then generalize the results to real DNA sequences over several live data sets. In future studies, researchers hope to analyze more complex DNA sequences that regulate the activity of genes essential for development and disease.

Although artificial intelligence and deep learning scientists have discovered how to train computers to recognize images, they have yet to understand how machines learn to classify objects quickly and accurately. Understanding how machines classify abstract patterns, such as patterns in DNA sequences, is even more difficult since we humans cannot recognize the correct answers.

Machine learning researchers have trained neural networks to recognize common objects such as cats or airplanes by repeatedly training these neural networks with many images. The program is then tested by presenting it with a new image of a cat or an airplane and noting whether it classifies the new image correctly.

Translating the same approach to test whether neural networks can detect sequence patterns in DNA is not straightforward, however. Indeed, it is possible for a human to determine whether the neural network has correctly identified an object as a cat or a dog, because the human brain itself can draw this conclusion. This is not the case in the case of detecting biologically significant patterns in DNA sequences.

The human brain cannot recognize functional patterns in DNA sequences. Therefore, even if the neural network comes up with a series of meaningful pattern patterns in a segment of DNA, researchers may not be able to tell whether the computer correctly identifies a significant pattern.

Human programmers are therefore unable to judge the reasons for the learning process that neural networks undergo or the precision of the decisions that neural networks arrive at. This hidden process that makes it hard to trust the output of the neural network is what researchers call a “black box.”

“It can be quite easy to interpret these neural networks, because they will just point to, say, a cat’s whiskers. And that’s why it’s a cat against a plane. In genomics, it’s not that simple because the genomic sequences are not in a form where humans really understand the patterns that these neural networks are pointing to,” says Koo.

The authors train the CNNs by showing them genomic DNA sequences. This learning process resembles the way the brain processes images.

CNNs have become increasingly popular and are the state-of-the-art technology to accurately predict a variety of regulatory motifs in genomic DNA. CNN’s success is due to its ability to learn patterns directly from training data. However, little is understood about the inner workings of CNNs, like many other deep learning algorithms, and are therefore labeled black boxes.

The study introduces a new method for teaching important DNA patterns to one layer of the CNN. This allows the neural network to build on the data to identify more complex patterns. Koo’s discovery takes a look inside the black box and identifies some key features that lead to the computer’s decision-making process.

But Koo has a bigger goal in mind for the field of artificial intelligence. There are two ways to improve a neural network: interpretability and robustness. Interpretability refers to the ability of humans to decipher why machines give a certain prediction. The ability to produce an answer even with errors in the data is called robustness. Usually researchers focus on one or the other.

“What my research is trying to do is connect these two things because I don’t think they’re separate entities. I think we get better interpretability if our models are more robust,” Koo says.

Deep learning has the potential to have a significant impact on basic biology, but the main challenge is to understand the reasons behind their predictions. Koo’s research develops methods to interpret black box models, with the goal of understanding the mechanisms underlying sequence-function relationships in gene regulation.

Koo hopes that if a machine can find robust, interpretable DNA patterns related to gene regulation, it will help geneticists understand how mutations affect cancer and other diseases. “We partnered with other members of the CSHL Cancer Center to study the sequence basis of epigenomic differences between healthy and cancerous cells,” Koo notes.

James G. Williams