Roche 07789688001 Sequencing Solutions Heat-Seq HSQutils User manual

  • Hello! I am an AI chatbot trained to assist you with the Roche 07789688001 Sequencing Solutions Heat-Seq HSQutils User manual. I’ve already reviewed the document and can help you find the information you need or explain it in simple terms. Just ask your questions, and providing more details will help me assist you more effectively!
For Research Use Only.
Not for use in diagnostic procedures.
HEAT-Seq HSQutils Software
User’s Guide
Version 1.3
HEAT-Seq HSQutils Software User’s Guide, v1.3
2
Copyright
© 2016-2019 Roche Sequencing Solutions, Inc. All Rights Reserved.
Roche Molecular Systems, Inc.
1080 US Highway 202 South
Branchburg, NJ 08876USA
Editions
Version 1.0, July 2016. Version 1.2, December 2017. Version 1.3, June 2019.
Restrictions and Liabilities
This document is provided “as is” and Roche Sequencing Solutions, Inc. (Roche) assumes no responsibility for any typographical, technical,
or other inaccuracies in this document. Roche reserves the right to periodically change information that is contained in this document;
however, Roche makes no commitment to provide any such changes, updates, enhancements, or other additions to this document to you in
a timely manner or at all.
OTHER THAN THE LIMITED WARRANTY CONTAINED IN THIS USER GUIDE, ROCHE MAKES NO REPRESENTATIONS, WARRANTIES,
CONDITIONS OR COVENANTS, EITHER EXPRESS OR IMPLIED (INCLUDING WITHOUT LIMITATION, ANY EXPRESS OR IMPLIED
WARRANTIES OR CONDITIONS OF FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, MERCHANTABILITY, DURABILITY,
TITLE, OR RELATED TO THE PERFORMANCE OR NON-PERFORMANCE OF ANY PRODUCT REFERENCED HEREIN OR PERFORMANCE OF
ANY SERVICES REFERENCED HEREIN).
This document might contain references to third party sources of information, hardware or software, products, or services and/or third party
web sites (collectively the “Third-Party Information”). Roche does not control, and is not responsible for, any Third-Party Information,
including, without limitation the content, accuracy, copyright compliance, compatibility, performance, trustworthiness, legality, decency,
links, or any other aspect of Third-Party Information. The inclusion of Third-Party Information in this document does not imply endorsement
by Roche of the Third-Party Information or the third party in any way.
Roche does not in any way guarantee or represent that you will obtain satisfactory results from using Roche products as described herein.
The only warranties provided to you are included in the Limited Warranty enclosed with this guide. You assume all risk in connection with
your use of Roche products.
Roche is not responsible nor will be liable in any way for your use of any software or equipment that is not supplied by Roche in connection
with your use of Roche products.
Conditions of Use
Roche does not guarantee the results achieved from experiments performed using HSQutils software. You are responsible for
understanding and using the software described within.
Use Restrictions
For patent license limitations for individual products please refer to: www.technical-support.roche.com.
HEAT-Seq HSQutils Software User’s Guide, v1.3
3
Table of Contents
Preface.................................................................................................................................................................... 4
Intended Use ........................................................................................................................................................................................................... 4
HSQutils .............................................................................................................................................................................................................. 4
Contact Information .............................................................................................................................................................................................. 4
Software and User Guide Updates ............................................................................................................................................................ 4
Technical Support............................................................................................................................................................................................ 4
Manufacturer and Distribution ................................................................................................................................................................... 4
Reporting Issues .............................................................................................................................................................................................. 5
Conventions Used in This Manual ................................................................................................................................................................... 5
Symbols ............................................................................................................................................................................................................... 5
Text ....................................................................................................................................................................................................................... 5
Chapter 1. Before You Begin ............................................................................................................................. 6
What is HSQutils software?................................................................................................................................................................................ 6
System Requirements .......................................................................................................................................................................................... 7
Software Command Line Conventions ........................................................................................................................................................... 7
Chapter 2. Installing HSQutils ............................................................................................................................ 8
Obtaining HSQutils software ............................................................................................................................................................................. 8
Installing HSQutils software ............................................................................................................................................................................... 8
Running HSQutils software ................................................................................................................................................................................ 8
Chapter 3. Complete Analysis Workflow ......................................................................................................... 9
Step 1. Index a Reference Genome ............................................................................................................................................................... 10
Step 2. Examine Sequence Read Quality ..................................................................................................................................................... 11
Step 3. Filter on Sequence Read Quality ..................................................................................................................................................... 12
Step 4. Trim Sequence Reads to Remove UID and Primers .................................................................................................................. 12
Step 5. Map Trimmed Reads ........................................................................................................................................................................... 13
Step 6. Remove Duplicates and Precisely Trim Primers ......................................................................................................................... 14
Step 7. Variant Calling and Filtering.............................................................................................................................................................. 15
Chapter 4. HSQutils trim ................................................................................................................................... 17
HSQutils trim description .................................................................................................................................................................................. 17
HSQutils trim options ......................................................................................................................................................................................... 17
HSQutils trim output files.................................................................................................................................................................................. 18
Chapter 5. HSQutils dedup ............................................................................................................................... 19
HSQutils dedup description ............................................................................................................................................................................. 19
HSQutils dedup options .................................................................................................................................................................................... 20
HSQutils dedup output ...................................................................................................................................................................................... 21
Probe details file............................................................................................................................................................................................. 21
HSQutils dedup summary file .................................................................................................................................................................... 22
References ........................................................................................................................................................... 24
Glossary ................................................................................................................................................................. 25
Appendix A. Frequently Asked Questions ..................................................................................................... 26
Appendix B. Probe Information File ................................................................................................................ 27
Appendix C. Troubleshooting ........................................................................................................................... 28
Appendix D. Limited Warranty ......................................................................................................................... 29
Preface
HEAT-Seq HSQutils Software User’s Guide, v1.3
4
Preface
Intended Use
For Research Use Only. Not for use in diagnostic procedures.
HSQutils
The HSQutils software is a utility package intended to be used in an analysis workflow which contains a separately
sourced sequencing read mapper and variant caller.
Contact Information
Software and User Guide Updates
Roche provides updates to HSQutils software at the website below. Check this website periodically for important
updates.
github.com/NimbleGen/bioinformatics/releases
The most recent version of this user guide and other information on the HEAT-Seq product family can be found at
the following location:
sequencing.roche.com/support.html
Technical Support
If you have technical questions, contact your local Roche Technical Support. Go to
sequencing.roche.com/support.html for contact information.
Support cannot be provided for modified HSQutils source code.
Manufacturer and Distribution
Manufacturer
Roche Molecular Systems, Inc.
Branchburg, NJ USA
Distribution
Roche Diagnostics GmbH
Mannheim, Germany
Distribution in USA
Roche Diagnostics Corporation
Indianapolis, IN USA
Preface
HEAT-Seq HSQutils Software User’s Guide, v1.3
5
Reporting Issues
If you experience issues while using HSQutils software, send information regarding the nature of the problem to
Roche by contacting your local Roche Technical Support. Go to
sequencing.roche.com/support.html for contact information.
Describe in as much detail as possible the nature of the problem and the steps you took to produce the
problem, including command line parameters used.
HSQutils trim and HSQutils dedup applications generate a .log file each time it is used. The contents of the
log file may be useful in helping you solve any problems you might have with the software. Save all log files
so that they are available to submit to Roche Technical Support when you report your problem.
Report the version of HSQutils being used. Modified source code cannot be supported.
If possible, save all files related to the problem. These files are sometimes too large to be attached and sent via email.
Archive a copy of these files so that they are available for Roche Technical Support if needed. Roche Technical
Support can provide instructions on how to transfer the files via secure FTP if they are needed for troubleshooting.
Conventions Used in This Manual
Symbols
Symbol
Description
Important Note: Information critical to the success of the procedure or use of the product. Failure to
follow these instructions could result in compromised data.
Information Note: Designates a note that provides additional information concerning the current
topic or procedure.
Text
Conventions Description
Numbered listing Indicates steps in a procedure that must be performed in the order listed.
Italic type, blue Identifies a resource in a different area of this manual or on a web site.
Italic type Identifies the names of dialog boxes, windows, tabs, panels, views, or message boxes in the software.
Bold type
Identifies names of menus and controls (buttons, checkboxes, etc.) in the software.
Chapter 1. Before You Begin
HEAT-Seq HSQutils Software User’s Guide, v1.3
6
Chapter 1. Before You Begin
This User’s Guide describes the use of the HSQutils software utilities and how to combine the HSQutils software
utilities with a mapper and variant caller to analyze sequencing reads generated during a Roche HEAT-Seq capture
experiment.
What is HSQutils software?
HSQutils is a software package consisting of two tools, HSQutils trim and HSQutils dedup, which help analyze
sequencing reads generated during a Roche HEAT-Seq capture experiment.
The HSQutils trim tool is applied to sequencing reads in FASTQ format to trim UID and probe primer sequences. It
provides new FASTQ files ready for mapping using BWA or another mapper, which in turn produces a SAM or
BAM file as output.
The HSQutils dedup tool processes the output of the read mapper (the BAM file) to remove amplification duplicates
and precisely trim probe primer sequences from sequencing reads. Precise primer trimming is necessary to avoid
false negative variant calls in bases overlapping a primer. The ability to remove amplification duplicates from HEAT-
Seq reads is unique among amplification-based target enrichment technologies and helps to avoid allele biases which
may be introduced during amplification steps.
Figure 1: Analysis workflow incorporating HSQutils as a part of an analysis solution for sequencing reads generated during a
HEAT-Seq target enrichment experiment. Steps performed using HSQutils are highlighted in blue. Details and relevant external
references for each of these steps are provided later in this document.
Chapter 1. Before You Begin
HEAT-Seq HSQutils Software User’s Guide, v1.3
7
System Requirements
Linux OS (tested on Redhat Enterprise Linux 6, 64-bit architecture)
Java 8+ (see www.java.com/en/download/index.jsp)
At least 8 GB RAM (Tested on 30 million reads with 8 GB RAM on a dedicated system. Additional RAM may be
required or improve processing speed for an increased number of sequencing reads.)
100 GB hard disk space (requirements vary depending on sequencing volume per sample)
Software Command Line Conventions
Example software command line entries are included throughout this User’s Guide, using standard conventions. Use
the following rules to translate the examples into actual command line instructions for use with your own data:
Replace the text ‘SAMPLE in the example commands with a unique sample name.
Replace the text ‘DESIGN in the example commands with the appropriate design name.
Replace the textref.fawith the actual file name of a reference genome (e.g.hg19.fa’).
Replace ‘/path/to/… with a valid path on your system.
It is assumed that all SAMPLE and DESIGN files are located in the current directory, and that this will
also be the location of the SAMPLE output files and report files.
Type the entire command shown for each step on a single line, despite the way it appears on the printed
page.
There should be no spaces within a file path, but there must be spaces before and after each option.
Tip: use the Tab key to auto-complete paths and file names while typing.
Chapter 2. Installing HSQutils
HEAT-Seq HSQutils Software User’s Guide, v1.3
8
Chapter 2. Installing HSQutils
This chapter describes how to obtain and install HSQutils software.
Obtaining HSQutils software
The HSQutils software is available for download at the following location:
github.com/NimbleGen/bioinformatics/releases
See the Downloads section on the GitHub page for the most current version of HSQutils and download the ZIP file.
The ZIP file contains a pre-compiled executable JAR file.
Installing HSQutils software
Decompress (unzip) the ZIP file downloaded from GitHub using a decompression utility. In Linux, the following
command line can be used if the zip utility is installed (where VERSION is the version string in the zip filename):
> unzip hsqutils_VERSION.zip
Copy the HSQutils JAR file (hsqutils.jar) into your preferred local directory.
The pre-compiled Java JAR file is ready to use. No additional installation is required.
Running HSQutils software
Invoke HSQutils using the following command line:
> java -Xmx3g -Xms3g -jar hsqutils.jar
If the default system version of Java is not 8+ then you will have to provide the full path to the Java version 8+
installation, similar to:
> /path/to/version8/java -Xmx3g -Xms3g -jar hsqutils.jar
The Xmx and Xms attributes are provided to Java to indicate memory requirements. Since Java uses additional
overhead above the 3 GB provided to the utility, a minimum of 8 GB of memory is required in total.
When the HSQutils jar is invoked in this way with no parameters, names and descriptions of all utilities available in
the HSQutils software package will be listed.
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
9
Chapter 3. Complete Analysis Workflow
This chapter describes how to use HSQutils software tools as part of a complete analysis workflow to map
sequencing reads and call variants (see Figure 2). Specific tools for sequencing read quality filtering, mapping, and
variant calling are demonstrated here (see list in Table 1), but other tools may work if they use the standard
SAM/BAM alignment format.
Package (version) Tool
Function as used in this
document
BCFtools (1.2)
call
Variant calling and filtering.
filter
Variant filtering.
BWA (0.7.12-r1039)
index
Generate an indexed genome from
FASTA sequence.
mem
Map sequencing reads to an indexed
genome.
FastQC (0.11.3)
fastqc
Assess sequencing read quality (per-
base quality plot).
Picard (1.134)
CreateSequenceDictionary
Generate a sequence dictionary
(.dict) for the reference genome.
SAMtools (1.2)
faidx
FASTA indexing.
mpileup
Call variants on a BAM file.
view
View or extract header or read data.
Trimmomatic (0.33)
trimmomatic
Raw sequencing read quality filtering
Table 1: Third-party data analysis tools used in this chapter. The examples described in this document were tested using
the software versions listed in parentheses. See links in References for installation instructions and explanations of command
options. These tools were tested on a RedHat Enterprise 64-bit Linux system.
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
10
Figure 2: HSQutils provides essential parts of the analysis of sequencing reads generated during a HEAT-Seq target
enrichment experiment. Steps performed using HSQutils are highlighted in blue. The use of other tools listed here is described in this
chapter, but these tools were not developed by Roche. Substitution with similar tools may work but they have not been tested by Roche.
Details and relevant external references for each of these steps are provided in this chapter.
Step 1. Index a Reference Genome
Index the FASTA formatted genome sequence with chromosomes in karyotype sort order using UCSC Genome
Browser (genome.ucsc.edu) naming conventions, e.g. chr1, chr2, ..., chr10, chr11, … chrX, chrY, chrM. Unassembled
and haplotype-specific sequences may be optionally omitted.
e same reference genome can use the
.
s such as 'ref.fa' in the command line examples.
of the subsequent step
. An indexed reference genome consists of the genome FASTA file and all
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
11
Package
Tool(s) Used
BWA
index
SAMtoolsfaidx
Picard
CreateSequenceDictionary
Input(s)
ref.fa
Output(s)
ref.fa {indexed}
ref.fa = unmodified reference genome
ref.fa.amb, ref.fa.ann, ref.fa.bwt, ref.fa.pac, ref.fa.sa = reference genome
index files
ref.fa.fai = FASTA index
ref.dict = reference sequence dictionary
Text in blue should be replaced as appropriate for your system and sample, everything else
should be typed as shown but on one line.
Generate Reference Genome Index
/path/to/bwa index -a bwtsw /path/to/ref.fa
Generate FASTA Index (needed only if using tools which require a FASTA index)
/path/to/samtools faidx /path/to/ref.fa
Generate Sequence Dictionary (needed only if using tools which require a DICT file)
java -Xmx3g -Xms3g -jar /path/to/Picard/picard.jar
CreateSequenceDictionary REFERENCE=/path/to/ref.fa OUTPUT=ref.dict
Step 2. Examine Sequence Read Quality
Use fastqc to generate a per-base sequence quality plot and report based on the raw sequencing reads. The
fastqc tool can work on either compressed or uncompressed FASTQ files. It’s important to review sequencing
quality as it may impact the ability to map reads and also may impact variant calling.
Package
Tool(s)
Used
FastQC
fastqc
Input(s)
SAMPLE_R1.fastq
SAMPLE_R2.fastq
Output(s)
SAMPLE_R1_fastqc.zip
SAMPLE_R2_fastqc.zip
Use FastQC to assess per-base quality. Text in blue should be replaced as appropriate for
your system and sample, everything else should be typed as shown but on one line.
/path/to/fastqc --outdir . --nogroup SAMPLE_R1.fastq SAMPLE_R2.fastq
A .zip file is created for each SAMPLE input file in the correct directory. An HTML report named fastqc_report.html
is created that is viewable in a web browser. The authors of FastQC have posted the following examples of the
QC report for a good and a bad sequencing run:
www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
12
If poor quality sequencing is observed, analysis may proceed with the knowledge that SNP call reliability may be in
question. The presence of excess sequencing error may result in a poor rate of reads mapping to the genome and/or
false positive SNP calls. The sequence quality trimming performed by Trimmomatic below will be effective in
mitigating this risk to some degree, but resequencing will be needed in the worst cases.
Step 3. Filter on Sequence Read Quality
Use Trimmomatic to remove poor quality sequencing read bases from raw sequencing reads. Trimmomatic will
accept gzip compressed or decompressed FASTQ files. Removal of poor quality sequencing bases and reads is
important to maximize the ability to map reads to the genome and reduces the risk of false positive variants due to
sequencing error.
PackageTool(s) Used
Trimmomatic
Input(s)
SAMPLE_R1.fastq.gz
SAMPLE_R2.fastq.gz
Output(s)
SAMPLE_R1_quality_filtered.fastq
SAMPLE_R1_unpaired.fastq
SAMPLE_R2_quality_filtered.fastq
SAMPLE_R2_unpaired.fastq
Quality trim reads using Trimmomatic. Text in blue should be replaced as appropriate for
your system and sample, everything else should be typed as shown but on one line.
java -Xms3g –Xmx3g -jar /path/to/trimmomatic.jar PE -phred33
SAMPLE_R1.fastq SAMPLE_R2.fastq SAMPLE_R1_quality_filtered.fastq
SAMPLE_R1_unpaired.fastq SAMPLE_R2_quality_filtered.fastq
SAMPLE_R2_unpaired.fastq TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:50
The Trimmomatic application will produce four files. The
SAMPLE_R1_quality_filtered.fastq and
SAMPLE_R2_quality_filtered.fastq contain the reads that are still paired after quality filtering. The
SAMPLE_R1_unpaired.fastq and SAMPLE_R2_unpaired.fastq contain singleton reads, where the other mate
of the pair was discarded because of poor quality or because the remaining read was shorter than the minimum
length of 50 bp. In this workflow, unpaired reads are discarded. If you want to increase the percentage of passing
reads, adjust the Trimmomatic parameters especially for MINLEN, which is the required minimum length after
trimming. For more details on Trimmomatic parameters, see the authors website listed in the References section of
this document.
Step 4. Trim Sequence Reads to Remove UID and Primers
The HSQutils trim application is used to trim the quality filtered reads appropriately so that the UID bases and
HEAT-Seq primer bases are completely removed prior to mapping against the reference genome. If the quality
filtering step was skipped, raw reads may be provided as input, otherwise the quality filtered FASTQ files are
provided as input. The probe information file provided by Roche is also required.
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
13
Package
Tool(s) Used
HSQutils
trim
Input(s)
SAMPLE_R1_quality_filtered.fastq
SAMPLE_R2_quality_filtered.fastq
DESIGN_probe_info.txt
Output(s)
trimmed_ SAMPLE_R1_quality_filtered.fastq
trimmed_ SAMPLE_R2_quality_filtered.fastq
HSQutils_trim_[timestamp].log
Trim reads using HSQutils trim. Text in blue should be replaced as appropriate for your
system and sample, everything else should be typed as shown but on one line.
java -Xmx3g -Xms3g -jar /path/to/hsqutils.jar trim
--r1 SAMPLE_R1_quality_filtered.fastq
--r2 SAMPLE_R2_quality_filtered.fastq
--probe DESIGN_probe_info.txt
Output from the HSQutils trim tool is a new pair of FASTQ files ready for mapping, as well as a log file.
Step 5. Map Trimmed Reads
The trimmed reads can now be mapped to the indexed reference genome. Although BWA is the only mapper that is
demonstrated here, other mappers which accept Illumina’s paired end FASTQ-formatted reads and which generate a
properly formatted SAM or BAM file may work.
Package
Tool(s) Used
BWA
mem
SAMtoolsview
Input(s)
trimmed_ SAMPLE_R1_quality_filtered.fastq
trimmed_ SAMPLE_R2_quality_filtered.fastq
ref.fa {indexed}
Output(s)
SAMPLE_initial_mapping.bam
Map trimmed reads and convert to BAM format. Text in blue should be replaced as
appropriate for your system and sample, everything else should be typed as shown but on
one line.
/path/to/bwa mem -R
"@RG\tID:1\tDS:HEATSEQ\tPL:ILLUMINA\tLB:LIBRARY\tSM:SAMPLE"
/path/to/ref.fa -M trimmed_SAMPLE_R1_quality_filtered.fastq
trimmed_SAMPLE_R2_quality_filtered.fastq | /path/to/samtools view -Sb
- > SAMPLE_initial_mapping.bam
The command line presented here consists of two commands separated by a pipe character (“|”). The first
command invokes the BWA mapper. The second command invokes SAMtools to convert the output of BWA
directly to BAM format, eliminating the need to generate an intermediate SAM file. The information with the “-R
option supplied to the BWA mapper becomes the read group header in the BAM file. This is required by some tools
such as GATK, but isn’t required by HSQutils. It’s included here for compatibility with some commonly used
analysis tools.
The generated output fileSAMPLE_initial_mapping.bam” is a compressed BAM file containing alignments
of the trimmed reads to the genome. This file is provided as input to the HSQutils dedup tool.
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
14
Step 6. Remove Duplicates and Precisely Trim Primers
The HSQutils dedup application is used to identify duplicates (reads which derive from the same capture event)
and correct the mapping coordinates to fully match the capture target region of the probes while excluding any
HEAT-Seq primer sequences. The HSQutils dedup tool accepts BAM files generated by mapping reads trimmed
by the HSQutils trim tool. If the input BAM file is not sorted or indexed, HSQutils dedup will perform
these steps in the indicated temporary directory (default is /tmp). The FASTQ files provided as input to the
HSQutils dedup tool should be the same quality filtered FASTQ files that were provided as input to the
HSQutils trim tool.
PackageTool(s) Used
HSQutilsdedup
Input(s)
SAMPLE_initial_mapping.bam
SAMPLE_R1_quality_filtered.fastq
SAMPLE_R2_quality_filtered.fastq
DESIGN_probe_info.txt
Output(s)
SAMPLE_dedup.bam
HSQutils_dedup_summary.txt
probe_details.txt
HSQutils_dedup_[timestamp].log
Remove duplicates and perform precise primer trimming. Text in blue should be replaced
as appropriate for your system and sample, everything else should be typed as shown but
on one line.
java -Xmx3g -Xms3g -jar hsqutils.jar dedup
--r1 SAMPLE_R1_quality_filtered.fastq
--r2 SAMPLE_R2_quality_filtered.fastq
--probe DESIGN_probe_info.txt
--inputBam SAMPLE_initial_mapping.bam --outputBamFileName
SAMPLE_dedup.bam
The generated output BAM file “SAMPLE_dedup.bam” will contain one representative read pair for each capture
event and each read pair will be trimmed to map to only the capture target region of each probe. This BAM file can
be used for variant calling and other downstream target enrichment or sequencing analysis applications which
accept a BAM file as input.
See the output summary file (HSQutils_dedup_summary.txt) for a summary of HEAT-Seq experiment performance.
The summary report metrics are described in more detail in the HSQutils dedup chapter.
A log file is output by HSQutils dedup. A message near the end of the log file indicates if the tool completed
successfully. If there was a runtime error during data processing, the HSQutils_dedup_summary.txt file will not
exist.
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
15
Step 7. Variant Calling and Filtering
Variants can be called using the BAM file produced by the HSQutils dedup tool. Here we use the SAMtools
mpileup command to call germline variants (SNPs), and BCFtools call and filter commands to filter the
variants by read depth. Other variant calling software may work as long as it accepts a BAM file as input. Non-
diploid, low frequency, and somatic variants must be called using other methods not described in this document.
Package
Tool(s) Used
SAMtools
mpileup
BCFtoolscall
BCFtools
filter
Input(s)
ref.fa {indexed}
SAMPLE_dedup.bam {indexed}
DESIGN_capture_targets.bed
Output(s)
SAMPLE_filtered_variants.vcf
Text in blue should be replaced as appropriate for your system and sample, everything
else should be typed as shown but on one line.
Call Genomic Variants
/path/to/samtools mpileup -Bugf /path/to/ref.fa SAMPLE_dedup.bam -l
DESIGN_capture_targets.bed | /path/to/bcftools call -vm -O u -o
SAMPLE_samtools_raw_variants.bcf
Filter Raw Variants
/path/to/bcftools filter -i 'MQ>=30 && DP>=10 && DP<=50000'
SAMPLE_samtools_raw_variants.bcf > SAMPLE_filtered_variants.vcf
Note that the SAMtools mpileup option –l is the letter ‘l’ (EL), not the number one1’. This SAMtools
mpileup option results in the tool making variant calls only in the given list of regions (BED format).
Variant filtering options shown here may need to be optimized for your specific research use. Here, the BCFtools
filter command is being used to filter variant calls to remove low confidence calls. Mapping quality is filtered by
‘MQ>=30’ in the example. Reads which don’t map uniquely in the genome automatically receive a mapping quality
score of 0, therefore this filter will remove from variant calling any non-unique regions which may have been
sequenced. As the minimum mapping quality is increased, the number of variants called decreases. Also in the
example, ‘DP>=10’ and ‘DP<=50000’ are filtering on minimum and maximum read depth, respectively. As you
increase the minimum read depth you will start to lose true variant calls with low coverage, but variants with a depth
of fewer than five reads are generally considered unreliable due to sequencing error.
BCFtools creates a VCF file according to the VCF file format specification version 4.2. Older versions of GATK
appear to work only with v4.1 VCF files. If necessary, use Picard VcfFormatConverter to reformat the v4.2
VCF file to appear as v4.1 (command not shown).
Additional downstream variant analysis is not covered in this document, but may consist of comparison against
known variants for a sample, comparison of SNP calls against dbSNP, variant classification, and variant effect
analysis.
low frequency variants from heterogeneous
must be performed using alternate analysis methods. These alternate methods
accept BAM files as input for variant calling but are not described here.
Chapter 3. Complete Analysis Workflow
HEAT-Seq HSQutils Software User’s Guide, v1.3
16
mapping may result in missing or amplification-biased variant calls.
Chapter 4. HSQutils trim
HEAT-Seq HSQutils Software User’s Guide, v1.3
17
Chapter 4. HSQutils trim
This chapter describes the HSQutils trim tool, including its purpose, recommended usage, available options, and
output files.
mand line examples.
HSQutils trim description
The HSQutils trim tool is designed to remove all HEAT-Seq primer bases and UID bases from the reads prior to
mapping. These bases are trimmed from reads prior to mapping to improve genomic mapping efficiency. The
output of the HSQutils trim tool is a new pair of FASTQ files ready for mapping against a reference genome.
HSQutils assumes FASTQ files with a base quality ASCII encoding of Phred+33.
An example command line is shown below:
java -Xmx3g -Xms3g -jar /path/to/hsqutils.jar trim --r1
SAMPLE_R1.fastq.gz --r2 SAMPLE_R2.fastq.gz --probe DESIGN_probe_info.txt
--outputPrefix SAMPLE
HSQutils trim options
Option argument
Command line option
Description
Version flag
--version
Output the version and quit.
Read 1 FASTQ file
--r1
Path to read 1 input FASTQ file (required).
Can be uncompressed or gzip
compressed (.gz).
Read 2 FASTQ file
--r2
Path to read 2 input fastq file (required).
Can be uncompressed or gzip
compressed (.gz).
Probe Information File
--probe
Roche probe info file (required)
Output Directory
--outputDir
Directory for output files (optional).
Defaults to current directory
Output File Prefix
--outputPrefix
Text prefix for output file names
(optional).
Default behavior adds “trimmed_” to the
beginning of the input FASTQ filenames. If
a prefix is provided, an underscore is used
to separate the prefix from the rest of the
filename.
Chapter 4. HSQutils trim
HEAT-Seq HSQutils Software User’s Guide, v1.3
18
HSQutils trim output files
The HSQutils trim tool produces the following files:
Output filename
Description
[Output File Prefix]_trimmed_[FASTQ R1 Filename]
A trimmed version of the Read 1 FASTQ file.
[Output File Prefix]_trimmed_[FASTQ R2 Filename]
A trimmed version of the Read 2 FASTQ file.
[Output File Prefix]_HSQutils_trim_[timestamp].log
A log of the trimming application events.
Chapter 5. HSQutils dedup
HEAT-Seq HSQutils Software User’s Guide, v1.3
19
Chapter 5. HSQutils dedup
This chapter describes the HSQutils dedup tool, including its purpose, recommended usage, available options, and
output files.
mand line examples.
HSQutils dedup description
Amplification during capture and sequencing can result in multiple sequenced reads that represent the same DNA
fragment precursor. This can bias variant calling in regions of DNA which amplify preferentially.
The goal of duplicate removal ( “deduplication” or “dedup”) is to identify reads/duplicates which derive from the
same capture event by comparing the mapped coordinates of each read and the associated UID. The HSQutils dedup
tool identifies duplicate groups and selects one read pair to represent the entire duplicate group. The other duplicate
reads can be removed or marked, as specified in the command line options.
Additionally, the HSQutils dedup tool will precisely trim the reads to include as much of the capture target region as
possible while excluding any primer sequence. This is necessary to avoid interfering with variant calls in the primer
region, especially if that region is covered by another probe.
It is also possible to run HSQutils with duplicate removal or marking disabled. Precise primer trimming would still
be provided. This may be appropriate in experiments where amplification bias is not perceived to be high and/or
total available sequencing data is low, so that all possible read pairs are used for variant calling.
HSQutils dedup may make use of alternative mapping information when present in the provided input BAM file in
the “XA” tag. This is applicable to read pairs in which one or both reads do not map uniquely but both have one
alignment option which fits in the context of the panel (as defined in the probe info file). Note that not all mappers
generate an “XA” tag with alternative read alignments.
An example command line is shown below:
java -Xmx3g -Xms3g -jar /path/to/hsqutils.jar dedup
--r1 SAMPLE_R1.fastq.gz --r2 SAMPLE_R2.fastq.gz
--probe DESIGN_probe_info.txt --inputBam SAMPLE.BAM --outputPrefix SAMPLE
will not work for removing duplicates from
HEAT-Seq target enrichment experiment.
Chapter 5. HSQutils dedup
HEAT-Seq HSQutils Software User’s Guide, v1.3
20
HSQutils dedup options
Option argument
Command line option
Description
Version flag
--version
Output the version and quit.
Read 1 FASTQ file
--r1
Path to raw read 1 input FASTQ file
(required).
Can be uncompressed or gzip
compressed (.gz). This is not the trimmed
FASTQ file generated by HSQutils trim.
Read 2 FASTQ file
--r2
Path to raw read 2 input fastq file
(required).
Can be uncompressed or gzip
compressed (.gz). This is not the trimmed
FASTQ file generated by HSQutils trim.
Probe Information File
--probe
Roche probe info file (required)
Input BAM or SAM
Filepath
--inputBam
Path to input BAM or SAM file containing
the aligned reads.(required)
Output Directory
--outputDir
Directory for output files (optional; local
directory is the default)
Output BAM Filename
-o, --outputBamFileName
Name for output BAM file. (required)
Output File Prefix
--outputPrefix
Text prefix for output file names
(optional).
If a prefix is provided, an underscore is
used to separate the prefix from the rest
of the output filename. The prefix is not
added to the output BAM filename.
Temporary Directory
--tmpDir
Location to store temporary files, default
is /tmp. (optional)
Number of Processors
--numProcessors
The number of threads to run in parallel.
The maximum number of threads is 10. If
not specified this will default to the
number of cores available on the machine
up to 10. (optional)
Mark Duplicates
--markDuplicates
Mark duplicate reads in the BAM file but
don’t remove them. Default omits
duplicates from output BAM. (optional)
(flag)
/