使用SEQKIT拆分FASTQ文件

在学习数据分析的过程中,原始文件往往很大,这会导致反馈时间极长,比如比对过程,对于普通配置的个人电脑,一个FASTQ文件可能耗时数小时,这会极大地影响对错误的排查过程,增加学习成本。考虑到这一点,我们可以将要分析的FASTQ文件拆分成多个小文件,只取其中一个文件进行比对,为实现这一功能,可以使用SEQKIT。可选程序有两个,一个是seqkit split,这个主要针对FASTA文件,第二个是seqkit split2,这个可以处理单端或双端FASTQ文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
seqkit split2 -h # 查看帮助

split sequences into files by part size or number of parts # 按大小或数目分割文件

This command supports FASTA and paired- or single-end FASTQ with low memory occupation and fast speed. # 支持FASTA,单端或双端FASTQ

The file extensions of output are automatically detected and created according to the input files. # 输出文件是根据输入文件自动命名的

Usage:
seqkit split2 [flags]

Flags:
-l, --by-length string split sequences into chunks of N bases, supports K/M/G suffix
-p, --by-part int split sequences into N parts # 按数目分割
-s, --by-size int split sequences into multi parts with N sequences # 按大小分割
-f, --force overwrite output directory
-h, --help help for split2
-O, --out-dir string output directory (default value is $infile.split) # 输出目录
-1, --read1 string (gzipped) read1 file # 双端测序第一个文件
-2, --read2 string (gzipped) read2 file # 双端测序第二个文件

Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2) # 线程数

对于给定的双端测序文件,使用参数如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
seqkit split2 -1 SRR12846241_1.fastq -2 SRR12846241_2.fastq -p 20 # 拆分为20个文件

# 输出文件至文件夹中:SRR12846241_1.fastq.split/
SRR12846241_1.part_001.fastq SRR12846241_1.part_011.fastq SRR12846241_2.part_001.fastq SRR12846241_2.part_011.fastq
SRR12846241_1.part_002.fastq SRR12846241_1.part_012.fastq SRR12846241_2.part_002.fastq SRR12846241_2.part_012.fastq
SRR12846241_1.part_003.fastq SRR12846241_1.part_013.fastq SRR12846241_2.part_003.fastq SRR12846241_2.part_013.fastq
SRR12846241_1.part_004.fastq SRR12846241_1.part_014.fastq SRR12846241_2.part_004.fastq SRR12846241_2.part_014.fastq
SRR12846241_1.part_005.fastq SRR12846241_1.part_015.fastq SRR12846241_2.part_005.fastq SRR12846241_2.part_015.fastq
SRR12846241_1.part_006.fastq SRR12846241_1.part_016.fastq SRR12846241_2.part_006.fastq SRR12846241_2.part_016.fastq
SRR12846241_1.part_007.fastq SRR12846241_1.part_017.fastq SRR12846241_2.part_007.fastq SRR12846241_2.part_017.fastq
SRR12846241_1.part_008.fastq SRR12846241_1.part_018.fastq SRR12846241_2.part_008.fastq SRR12846241_2.part_018.fastq
SRR12846241_1.part_009.fastq SRR12846241_1.part_019.fastq SRR12846241_2.part_009.fastq SRR12846241_2.part_019.fastq
SRR12846241_1.part_010.fastq SRR12846241_1.part_020.fastq SRR12846241_2.part_010.fastq SRR12846241_2.part_020.fastq
# 共产生20个文件,每个文件约55M,进行下面的分析时可以比较轻松完成。
  • 本文作者:括囊无誉
  • 本文链接: Linux/seqkit/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!