使用SEQKIT拆分FASTQ文件

在学习数据分析的过程中，原始文件往往很大，这会导致反馈时间极长，比如比对过程，对于普通配置的个人电脑，一个FASTQ文件可能耗时数小时，这会极大地影响对错误的排查过程，增加学习成本。考虑到这一点，我们可以将要分析的FASTQ文件拆分成多个小文件，只取其中一个文件进行比对，为实现这一功能，可以使用SEQKIT。可选程序有两个，一个是seqkit split，这个主要针对FASTA文件，第二个是seqkit split2，这个可以处理单端或双端FASTQ文件。

seqkit split2 -h # 查看帮助

split sequences into files by part size or number of parts # 按大小或数目分割文件

This command supports FASTA and paired- or single-end FASTQ with low memory occupation and fast speed. # 支持FASTA，单端或双端FASTQ

The file extensions of output are automatically detected and created according to the input files. # 输出文件是根据输入文件自动命名的

Usage:
  seqkit split2 [flags]

Flags:
  -l, --by-length string   split sequences into chunks of N bases, supports K/M/G suffix 
  -p, --by-part int        split sequences into N parts # 按数目分割
  -s, --by-size int        split sequences into multi parts with N sequences # 按大小分割
  -f, --force              overwrite output directory
  -h, --help               help for split2
  -O, --out-dir string     output directory (default value is $infile.split) # 输出目录
  -1, --read1 string       (gzipped) read1 file # 双端测序第一个文件
  -2, --read2 string       (gzipped) read2 file # 双端测序第二个文件

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
      --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others. can also set with environment variable SEQKIT_THREADS) (default 2) # 线程数

对于给定的双端测序文件，使用参数如下：

seqkit split2 -1 SRR12846241_1.fastq -2 SRR12846241_2.fastq -p 20 # 拆分为20个文件

# 输出文件至文件夹中：SRR12846241_1.fastq.split/
SRR12846241_1.part_001.fastq  SRR12846241_1.part_011.fastq  SRR12846241_2.part_001.fastq  SRR12846241_2.part_011.fastq
SRR12846241_1.part_002.fastq  SRR12846241_1.part_012.fastq  SRR12846241_2.part_002.fastq  SRR12846241_2.part_012.fastq
SRR12846241_1.part_003.fastq  SRR12846241_1.part_013.fastq  SRR12846241_2.part_003.fastq  SRR12846241_2.part_013.fastq
SRR12846241_1.part_004.fastq  SRR12846241_1.part_014.fastq  SRR12846241_2.part_004.fastq  SRR12846241_2.part_014.fastq
SRR12846241_1.part_005.fastq  SRR12846241_1.part_015.fastq  SRR12846241_2.part_005.fastq  SRR12846241_2.part_015.fastq
SRR12846241_1.part_006.fastq  SRR12846241_1.part_016.fastq  SRR12846241_2.part_006.fastq  SRR12846241_2.part_016.fastq
SRR12846241_1.part_007.fastq  SRR12846241_1.part_017.fastq  SRR12846241_2.part_007.fastq  SRR12846241_2.part_017.fastq
SRR12846241_1.part_008.fastq  SRR12846241_1.part_018.fastq  SRR12846241_2.part_008.fastq  SRR12846241_2.part_018.fastq
SRR12846241_1.part_009.fastq  SRR12846241_1.part_019.fastq  SRR12846241_2.part_009.fastq  SRR12846241_2.part_019.fastq
SRR12846241_1.part_010.fastq  SRR12846241_1.part_020.fastq  SRR12846241_2.part_010.fastq  SRR12846241_2.part_020.fastq
# 共产生20个文件，每个文件约55M，进行下面的分析时可以比较轻松完成。