Moving Just Enough Deep Sequencing Data to Get the Job Done

Nicholas Mills; Ethan M Bensman; William L Poehlman; Walter B Ligon 3rd; F Alex Feltus

doi:10.1177/1177932219856359

Moving Just Enough Deep Sequencing Data to Get the Job Done

Bioinform Biol Insights. 2019 Jun 14:13:1177932219856359. doi: 10.1177/1177932219856359. eCollection 2019.

Authors

Nicholas Mills¹, Ethan M Bensman², William L Poehlman³, Walter B Ligon 3rd¹, F Alex Feltus³

Affiliations

¹ Holcombe Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, USA.
² School of Computing, Clemson University, Clemson, SC, USA.
³ Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.

Abstract

Motivation: As the size of high-throughput DNA sequence datasets continues to grow, the cost of transferring and storing the datasets may prevent their processing in all but the largest data centers or commercial cloud providers. To lower this cost, it should be possible to process only a subset of the original data while still preserving the biological information of interest.

Results: Using 4 high-throughput DNA sequence datasets of differing sequencing depth from 2 species as use cases, we demonstrate the effect of processing partial datasets on the number of detected RNA transcripts using an RNA-Seq workflow. We used transcript detection to decide on a cutoff point. We then physically transferred the minimal partial dataset and compared with the transfer of the full dataset, which showed a reduction of approximately 25% in the total transfer time. These results suggest that as sequencing datasets get larger, one way to speed up analysis is to simply transfer the minimal amount of data that still sufficiently detects biological signal.

Availability: All results were generated using public datasets from NCBI and publicly available open source software.

Keywords: FASTQ; RNA-Seq; data transfers; high-throughput DNA sequencing.