Phylogenetic Data Mining


Thousands of laboratories sequence the genome of infectious diseases. For COVID-19 alone, we have access to millions of sequenced cases shared in public repositories (such as the GISAID database containing more than 14 million COVID-19 genome sequence submissions).

The genetic information allows inferring the genetic relationship between cases: Like a paternity test for humans, this Phylogenetic information allows us to assess which case is a genetic descendent from which case. The figure above shows a resulting phylogenetic tree of COVID-19 cases which connects each case (except for the root) with its most recent genetic ancestor.

Combined with spatial and temporal information available for these cases, we are able not only to understand where cases occurred but also where they came from and where they spread to.

Prior Work

We've started exploring phylogenetic data using a visual analytics framework that we called PhyloView [1]. Phyloview allows one to digest and visualize both the spatiotemporal ecology of COVID-19 (on a map) as well as the phylogenetic relationships (as a phylogenetic tree). The map allows to interactively select cases. This allows local communities to understand where observed cases originated from to identify ports of entry. One can even query their own case (if the case was sequenced) to understand where their own case originated from. Such information may be instrumental in preventing future pandemics.

Research Directions

In our previous work we obtained and curated the data. We now have a large phylogenetic tree having 14+ millions of leafs (cases) each having location and time information. To make this data useful for decision-makers, a number of research challenges need to be overcome:

  • Phylogenetic data imputation: Phylogenetic data is very, very sparse. Only ~5% of observed cases are sequenced. In addition, only an (unknown) fraction of COVID-19 cases are actually reported. Thus, our phylogenetic data only contains 5% of an unknown fraction of all cases. That means that our phylogenetic relationships, in most cases, don't point to the immediate genetic father of a cases, but point to the grand, grand, [...], grandfather. A big challenge is to impute missing data, given other cases but also given human mobility data.

  • Phylogenetic data representation and prediction: Can we learn the underlying ecology of infectious disease spread and model/represent it? For example, spatiotemporal graph convolutional neural networks have been very successful at predicting road traffic conditions. Can we model the flow of infectious (which we know through our phylogenetic relationships) in a similar way to predict future cases but also to impute unobserved cases?


This project is funded by NSF Grant 2109647 titled Data-Driven Modeling to Improve Understanding of Human Behavior, Mobility, and Disease Spread. Funding is available for one PhD Student (fully funded for three years).


This work will be in collaboration with the Department of Geography and Geoinformation Science at George Mason University.

[1] Le, M.T., Attaway, D., Anderson, T., Kavak, H., Roess, A. and Züfle, A., 2022, June. PhyloView: A System to Visualize the Ecology of Infectious Diseases Using Phylogenetic Data. In 2022 23rd IEEE International Conference on Mobile Data Management (MDM) (pp. 222-229). IEEE.