Initially developed to verify the Antifreeze glycoproteins in the genomes of two icefishes (see Rivera-Colón, 2023), klumpy now has expanded to include a suite of tools for assessing whether a particular region in a genome assembly is missassembled by leveraging the original data (i.e., raw reads) used in the assembly process, or it can instead be used to search for a gene of interest in a set of sequences.
Depending on which component of klumpy the user wants to run, klumpy will need
klumpy can either scan a reference genome and detect regions that may have been missassembled by examining how the raw reads align to the assembly or it can detect genes of interest in the form of klumps. In both cases, klumpy can generate images for better and easier interpretation.
Locate sequences of interest.
Using the provided query sequences, klumpy will generate a map of the k-merized query sequences onto the subject sequences, which can then be readily used to detect regions where the query sequence(s) are located.
Illustrate sequences of interest.
After klumps have been generated, an image illustrating their locations can be generated in a Klump Plot
Locate missassembled regions in a genome assembly.
Given a set of alignments onto a genome assembly, klumpy will use a sliding window approach to locate regions where the raw reads are unable to properly tile across the region. If provided GFF/GTF file, only regions containing genes will be examined.
Illustrate sequence alignments.
Provided a SAM or BAM file and a target region, a set of alignments can be filtered and viewed for examination. Additional features such as klumps, exons, and assembly gaps can be included for a more thorough image.
For a practical guide, users can view the workflow we used in our publication for all four test cases here.
klumpy is implemented in Python3 and is released under the GNU GPL license.
If you use klumpy on your work, please cite our Mol Ecol Resour manuscript:
Madrigal G, Minhas BF, Catchen JM. (2025) Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs Mol Ecol Resour 25: e13982. DOI: 10.1111/1755-0998.13982
klumpy was developed by Gio Madrigal <gm33@illinois.edu>, with contributions from Bushra Fazal Minhas <bfazal2@illinois.edu> and Julian Catchen <jcatchen@illinois.edu>