Mathematicians Use AI to Identify Emerging COVID-19 Variants
03/12/2024
Like many other RNA viruses, COVID-19 has a high mutation rate and short time between generations meaning it evolves extremely rapidly. This means identifying new strains that are likely to be problematic in the future requires considerable effort.
Currently, there are almost 16 million sequences available on the GISAID database (the Global Initiative on Sharing All Influenza Data), which provides access to genomic data of influenza viruses.
Mapping the evolution and history of all COVID-19 genomes from this data is currently done using extremely large amounts of computer and human time.
The described method allows automation of such tasks. The researchers processed 5.7 million high-coverage sequences in only one to two days on a standard modern laptop; this would not be possible for existing methods, putting identification of concerning pathogen strains in the hands of more researchers due to reduced resource needs.
Thomas House, Professor of Mathematical Sciences at The University of Manchester, said: “The unprecedented amount of genetic data generated during the pandemic demands improvements to our methods to analyse it thoroughly. The data is continuing to grow rapidly but without showing a benefit to curating this data, there is a risk that it will be removed or deleted.
“We know that human expert time is limited, so our approach should not replace the work of humans all together but work alongside them to enable the job to be done much quicker and free our experts for other vital developments.”
The proposed method works by breaking down genetic sequences of the COVID-19 virus into smaller “words” (called 3-mers) represented as numbers by counting them. Then, it groups similar sequences together based on their word patterns using machine learning techniques.