Hamming distance project
Initial problem
- Bottleneck in computation: parallelized python code to calculate hamming distances from large set of gene expressions
- Pre-processing step that takes many hours to complete even using many cores
What we did
- Replace with a highly optimized, vectorized and parallelized c++ version
- Packaged as a python library hammingdist, to fit into the existing user workflow
- Set up Continuous Integration and Continuous Delivery to provide automated testing and deployment of the code
Result
- Code now runs in minutes on a single core with original dataset
- Code now scales using HPC resources to process hundreds of thousands of gene expressions
- Publication: arXiv:2106.07292 [q-bio.PE]
- Pipeline: CoVtRec Topological Surveillance of Recurrent Mutations in SARS-CoV-2