CancerMine: training computers to deliver cancer-gene associations from literature

Precision oncology needs to know which genes are important in cancer. Applying machine learning to the published literature was able to give us this information.
CancerMine: training computers to deliver cancer-gene associations from literature

Our paper was driven by BC Cancer’s efforts toward precision oncology in British Columbia. Precision oncology uses DNA and RNA sequencing to find molecular anomalies that may be the driving force behind a cancer. At BC Cancer, we use this information to guide treatment decisions as part of our Personalized OncoGenomics (POG) program.

However, it’s one thing to have sequence information from a tumour and entirely another to interpret what it means for the specific cancer patient from which it came. This is often the job of bioinformaticians, who take raw sequence mutation information, put it in context with the specific patient’s cancer type and create a report for an oncologist.

Figure 1. The data can give an overview of the important cancers associated with each gene (above is TERT) or the importance of each gene in a cancer type.

To do this, a list of important genes is needed for different cancer types, and knowledge bases, such as CiViC, can help with interpretation. These resources are especially important when an analyst is faced with a cancer sample from a cancer type that they have not seen before.

While several such resources exist, they either do not provide the context of the cancer or do not provide a citation to supporting literature. We realised that we would need to create our own. New oncology literature is updated frequently and curating a database from literature is an extremely costly investment. So, we decided to apply our developing expertise in machine learning to the task.

Our group of bioinformaticians at Canada’s Michael Smith Genome Sciences Centre at BC Cancer (GSC) in Vancouver has been looking at different applications of machine learning for biomedical text processing, and we developed a tool, Kindred, to extract relations from text. So, we asked ourselves: could we train Kindred to find mentions of oncogenes and tumor suppressors in abstracts or full text from oncology literature?

First, it needed examples. This required someone to sit down, read sentences and annotate their meaning. But humans make mistakes, or understand sentences differently, so three people annotated text examples to deal with discrepancies between them, which turned out to be the longest part of the project. With the sentences annotated, Kindred learned to identify important text and was set to work crunching PubMed, and CancerMine was born.

Knowing that clinicians and analysts wouldn’t accept a large number of false positives, we created a machine learning approach with a very high level of precision (~86%) with an accepted lower recall (~30%). We worked with the hypothesis that important cancer genes would be mentioned multiple times. This was indeed the case with genes like ERBB2 being mentioned in hundreds of papers. We also worked hard to make sure that the knowledge base would reflect the current knowledge in the field with monthly updates.

Figure 2. CancerMine lets you quickly check a list of genes for those associated with cancer.

The data is easy to download and view. We hope it will prove useful to the cancer bioinformatics community and is an example of a method to build a knowledge base that could be applied to many other fields in biology.

Written by Kevin Sauve and Jake Lever