How to convert FASTQ file to BAM file?

Converting a FASTQ file to a BAM file is a fundamental multi-step process in next-generation sequencing data analysis, essential for transforming raw sequencing reads into aligned and usable data for downstream applications like variant calling. This conversion involves aligning the reads from the FASTQ file to a reference genome and then processing the aligned data into the binary BAM format.

Understanding FASTQ and BAM Files

Before diving into the conversion process, it's helpful to understand the nature of these file formats:

FASTQ File: This is the primary output from a sequencing machine. It contains raw sequencing reads along with their corresponding quality scores. Each read typically consists of four lines: read identifier, sequence, a plus sign, and quality scores.
BAM File: A Binary ALignment/Map (BAM) file is the compressed binary representation of a SAM (Sequence Alignment/Map) file. It stores aligned sequencing reads against a reference genome, along with alignment information, flags, mapping quality, and other data crucial for variant discovery and genome analysis. BAM files are indexed, allowing for quick retrieval of specific regions.

The Multi-Step Conversion Pipeline

The conversion from raw FASTQ to an analysis-ready BAM file is typically managed through a series of bioinformatics steps, often automated as a pipeline. Here's a breakdown of the key stages:

1. Preparation and Setup

Before initiating the alignment process, several preparatory steps are crucial:

Raw FASTQ Files: Ensure your raw sequencing data in FASTQ format is readily accessible. This is the primary input for the entire process.
Reference Genome: A high-quality reference genome (in FASTA format) is indispensable. Reads will be aligned against this reference. It needs to be indexed for efficient alignment.
Configuration File: For automated pipelines, a configuration file (e.g., cohort.config) is often prepared. This file specifies parameters, input file paths, and output directories, streamlining the entire workflow.

2. Core Processing Steps

Once the preparations are complete, the pipeline proceeds through several computational stages:

FASTQ File Splitting (If Necessary):
- Large FASTQ files, especially from whole-genome sequencing projects, might be split into smaller, manageable chunks. This can be done per sequencing lane or based on read group identifiers to facilitate parallel processing and optimize resource utilization.
Read Alignment:
- This is the core step where raw sequencing reads are mapped to the reference genome. Algorithms compare each read to the reference to find its most probable location. Tools like BWA (Burrows-Wheeler Aligner) are widely used for this purpose. The output of this step is typically a SAM (Sequence Alignment/Map) file, which is a human-readable text format.
SAM to BAM Conversion and Merging:
- The SAM file generated from alignment is then converted into a more compact, binary BAM format. During this stage, if reads were aligned from multiple FASTQ files (e.g., from different lanes or splits), their respective BAM files are often merged into a single, comprehensive BAM file. Tools like Samtools are commonly used for this conversion and merging. The BAM file is also sorted and indexed for efficient data access.
Post-Alignment Processing:
- Marking Duplicate Reads: Polymerase Chain Reaction (PCR) amplification during library preparation can introduce duplicate reads. Identifying and marking these duplicates (e.g., using GATK MarkDuplicates or Samtools markdup) is crucial for accurate variant calling, as ignoring them can lead to an overestimation of variant allele frequencies.
- Extracting Split and Discordant Reads: For advanced analyses, especially structural variant calling, reads that align to multiple locations or show unusual insert sizes (discordant pairs) or split alignments (split reads) are extracted or specifically flagged.
Base Quality Score Recalibration (BQSR - Optional):
- This optional but highly recommended step adjusts reported base quality scores in the BAM file to more accurately reflect the true probability of a sequencing error. Tools like GATK BaseRecalibrator analyze covariates (e.g., read group, base context, cycle) to build a recalibration model, which is then applied to the BAM file. This improves the accuracy of downstream analyses, particularly variant calling.

3. Executing the Pipeline

Many bioinformatics groups develop and utilize optimized scripts to automate these steps. For instance, a common approach involves running a single bash script, such as bash create_project.bash, which orchestrates all the aforementioned steps from initial FASTQ preparation to the generation of the final analysis-ready BAM file. This script typically takes the prepared configuration file, raw FASTQ files, and the reference genome as inputs.

Summary of Stages and Outputs

The following table summarizes the typical input and output at different stages of the conversion process:

Stage	Key Input(s)	Primary Tool(s) (Examples)	Key Output(s)
Preparation	Raw FASTQ files, Reference Genome	(Data organization)	Indexed Reference Genome
Alignment	FASTQ files, Indexed Reference	BWA (or Bowtie2, Minimap2)	SAM file
Conversion & Merging	SAM file(s)	Samtools	Sorted, Merged BAM file
Duplicate Marking	Sorted BAM file	GATK MarkDuplicates, Samtools	Marked Duplicates BAM file
BQSR (Optional)	Marked BAM file, Reference	GATK BaseRecalibrator, ApplyBQSR	Recalibrated BAM file

By following these systematic steps, raw sequencing data from FASTQ files is transformed into a highly valuable, analysis-ready BAM file, forming the foundation for comprehensive genomic studies.