ChIP-seq pipeline steps and reports

Analysis pipeline steps

Reads trimming: Reads are trimmed using cutadapt (DOI: 14806/ej.17.1.200) (with the parameters –times 2 -q 20 -m 25). In this process, primers corresponding to the Tru-seq protocol are removed.
Quality control: Reads quality control is evaluated using FastQC (with the parameter –cassava). A report file, containing quality results for all of the samples is generated using multiQC.
Mapping to genomes: The quality trimmed reads are mapped to Mouse and Human genomes using Bowtie2 (doi: 10.1038/nmeth.1923.) (with the parameters –local for all analyses and, in addition, -X 2000 for paired end analyses). Refseq annotation is provided for the mapped genes.

4. Alignment filtering: Following the alignment, reads are filtered using samtools view with the parameters -F 4 to remove unmapped reads that are output by bowtie2, -q 39 and with -f 0x2 for paired end reads. The remaining unique reads are then indexed and sorted, using samtools index and samtools sort.

Generation of statistics on the alignment using flagstat.
Visualization in graphs: The reads are graphically visualized using ngsplot (with the parameters -G -R genebody -C -O samples -D refseq -L 50000).

7. Peak calling: Significant chip regions (peaks) are evaluated and compared to control samples if present using macs2 callpeak (with the parameters –bw 300 -B -f –SPMR -g -keep-dup auto -q 0.01 for all analyses, BAMPE –nomodel for paired end analyses, and BAM for single end analyses). The resulting peaks “*_peaks_filtered.broadPeak” are filtered to exclude peaks from the blacklist (https://github.com/Boyle-Lab/Blacklist/tree/master/lists).

Conversion to BigWig format: Files containing the predicted peaks coordinates in BedGraph format are converted to BigWig format using bedtools slop (with the parameters -g -b 0), bedClip stdin and bedGraphToBigWig (with default parameters).

9. Peak annotation: The predicted peaks are collected from all samples using multiIntersectBed and then annotated according to the corresponding genome using Homer (with default parameters). Analysis of peaks distribution in genomic regions, and around TSS is done using ChipSeeker, together with a Venn diagram of overlap of peaks sets.

Pipeline report

Upon completion of the analysis, you will be sent an email with links to the results report.

The report includes several sections:

Sequencing and Mapping QC
1. Figure 1 - Plots the average quality of each base across all reads. Qualities of 30 (predicted error rate 1:1000) and above are good.
2. Figure 2 - Histogram showing the number of reads for each sample in the raw data.
3. Figure 3 - Histogram showing the percentage of reads discarded after trimming the adapters (after removing adapters, short, polyA/T and low quality reads are discarded by the pipeline). No figure will be presented if the percentage of reads discarded after trimming for all samples is lower than 1%.
4. Figure 4 - Histogram with the number of reads for each sample in each step of the pipeline.
5. Figure 5 - Plots sequence coverage on and near gene regions.
MACS peak calling
1. Figure 6 - MACS (peak calling) results table for each sample or for each comparison, if the pipeline was run with a control.
2. Figure 7 - Histogram with the number of peaks for all samples or for comparisons, if the pipeline was run with a control.
3. Figure 8 - Histogram showing Peaks distribution in genomic regions.
4. Figure 9 - Histogram showing Peaks distribution around TSS.
5. Figure 10 - Venn diagram of peak overlaps among the first four comparisons.
Bioinformatics Pipeline Methods - Description of pipeline methods.
Links to additional results - Links for downloading tables with raw, normalized counts, log normalized values (rld), and statistical data of contrasts.

Output folders

0_concatenating_fastq

1_cutadapt

2_fastqc

3_multiQC

4_mapping

5_filtered_alignment

6_peaks_prediction

7_peaks_annotation

8_graphs

9_BigWig

10_reports

Log directory

Annotation file

For Peak annotation, we use annotation files (gtf format) from “Ensembl” or “GENCODE”.

Examples of reports

ChIP-Seq example