C201SC18122416 QC Analysis Report

04-March-2019


Novogene Co., Ltd



A. Library Preparation and Sequencing

From the RNA samples to the final data, each step, including sample test, library preparation, and sequencing, influences the quality of the data, and data quality directly impacts the analysis results. To guarantee the reliability of the data, quality control (QC) is performed at each step of the procedure. The workflow is as follows:

1 Sample Quality Control

There are three main methods of QC for RNA samples:

(1) Nanodrop: Preliminary quantitation

(2) Agarose Gel Electrophoresis: tests RNA degradation and potential contamination

(3) Agilent 2100: checks RNA integrity and quantitation

2 Library Construction and Quality Control

After the QC procedures, mRNA from organisms is enriched using oligo(dT) beads. For prokaryotic samples, rRNA is removed using the Ribo-Zero kit that leaves the mRNA. First, the mRNA is fragmented randomly by adding fragmentation buffer, then the cDNA is synthesized by using mRNA template and random hexamers primer, after which a custom second-strand synthesis buffer (Illumina) , dNTPs, RNase H and DNA polymerase I are added to initiate the second-strand synthesis. Second, after a series of terminal repair, A ligation and sequencing adaptor ligation, the double-stranded cDNA library is completed through size selection and PCR enrichment.

The quality control of library consists of three steps:

(1) Qubit 2.0: tests the library concentration preliminarily.

(2) Agilent 2100: tests the insert size.

(3) Q-PCR: quantifies the library effective concentration precisely.

The workflow chart is as follows:

3 Sequencing

The qualified libraries are fed into Illumina sequencers after pooling according to its effective concentration and expected data volume.

Novogene Co., Ltd



B. Results and Instructions

1 Data Quality Control

1.1 Distribution of Sequencing Quality

The “e” represents the sequence error rate and Qphred represents the base quality value,Qphred=-10log10(e). The relationship between sequencing error rate (e) and sequencing base quality value (Qphred) is as below:

Phred scoreerror baseright baseQ-score
101/1090%Q10
201/10099%Q20
301/100099.9%Q30
401/1000099.99%Q40

The distribution of quality score is shown in Fig.1:

Fig.1 Distribution of Sequencing Quality

The base position is on the horizontal axis and the sequencing quality is on the vertical axis

Novogene Co., Ltd



1.2 Distribution of Sequencing Error Rate

For Illumina SBS technology, the distribution of sequencing error rate has two features:

(1) Error rate grows with sequenced reads extension because of the consumption of sequencing reagent. The phenomenon is common in the Illumina high-throughput sequencing platform (Erlich Y. et al. 2008; Jiang et al. 2011).

(2) The reason for the high error rate of the first six bases is that the random hex-primers and RNA template bind incompletely in the process of cDNA synthesis (Jiang et al.2011).

The error rate of this project is shown in Fig.2:


Fig.2 Error Rate Distribution

The base position is on the horizontal axis and the single base error rate is on the vertical axis

Novogene Co., Ltd



1.3 Distribution of A/T/G/C Base

It is used to identify the separation situation of AT and GC by checking the distribution of GC content. According to the principle of complementary bases, the content of AT and GC should be equal at each sequencing cycle and be constant and stable in the whole sequencing procedure. For the stranded-specific library (dUTP library), which remains only single strand information, the distribution of GC contents fluctuates obviously. So it is normal of occurring GC separation.

The distribution of GC content is shown in Fig.3:

Fig.3 A/T/G/C Distribution

The base position is on the horizontal axis and the single base percentage is on the vertical axis

Novogene Co., Ltd



1.4 Results of Raw Data Filtering

The sequenced reads (raw reads) often contain low quality reads and adapters, which will affect the analysis quality. So it's necessary to filter the raw reads and get the clean reads. The filtering process is as follows:

(1) Remove reads containing adapters.

(2) Remove reads containing N > 10% (N represents the base cannot be determined).

(3) Remove reads containing low quality (Qscore<= 5) base which is over 50% of the total base.

Adapter sequences :

  5' Adapter:

  5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3'

  3' Adapter(The underlined 6bp bases is Index):

  5'-GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG-3'

The Sequencing data filtration of this project can be seen in Fig.4 :

Fig.4 Composition of Raw Data

Different color for different components:

(1)Adapter related: (reads containing adapter) / (total raw reads)

(2)Containing N: (reads with more than 10% N) / (total raw reads)

(3)Low quality: (reads of low quality) / (total raw reads)

(4)Clean reads: (clean reads) / (total raw reads)

Novogene Co., Ltd



2 Summary of Sequencing Data Information

The total output of data on the sequencer: Raw data 143.5 G, and the data filtered from raw data: Clean data 137.5 G.

The detail statistics for the quality of sequencing data are shown in Table 1.

Table 1 Data Quality Summary

SampleRaw
Reads
Clean
Reads
Raw
Base(G)
Clean
Base(G)
Effective
Rate(%)
Error
Rate(%)
Q20(%)Q30(%)GC
Content(%)
F_1_7_123965039230099337.26.996.010.0397.7393.5442.82
F_1_7_329146123282746308.78.597.010.0397.4392.8442.88
F_1_7_422040082210806206.66.395.650.0397.5593.1043.30
F_1_7_724805236238973337.47.296.340.0397.7793.5943.22
F_1_10_122345231215629996.76.596.500.0397.0291.6843.91
F_1_10_222956626221079936.96.696.300.0397.4392.9043.88
F_1_10_321598106209157106.56.396.840.0396.7591.1843.80
F_1_10_424641451225882807.46.891.670.0397.5993.2643.78
F_2_4_328380852264508048.57.993.200.0397.2992.6042.86
F_2_4_420414226196569986.15.996.290.0397.7893.6342.91
F_2_4_623491316226194867.06.896.290.0397.3592.3142.68
F_2_4_727082744258623668.17.895.490.0397.6693.3943.43
F_2_F_423823479228197497.16.895.790.0397.6293.3043.68
F_2_F_F26315475254212397.97.696.600.0397.3892.7343.64
F_2_F_621551806207313646.56.296.190.0397.6593.3443.85
F_2_F_824122472231707997.27.096.050.0397.7793.5943.64
WT121397005201969686.46.194.390.0397.1492.4243.03
WT222954251221117596.96.696.330.0397.4892.9843.94
WT323256472225683697.06.897.040.0397.3792.7343.88
WT524141584233346537.27.096.660.0397.4192.8243.68

Sample: sample name
Raw reads: total amount of reads of raw data, each four lines taken as one unit. For paired-end sequencing, it equals the amount of read1 and read2, otherwise it equals the amount of read1 for single-end sequencing.
Clean reads: total amount of reads of clean data, each four lines taken as one unit. For paired-end sequencing, it means the amount of read1 and read2, otherwise it equals the amount of read1 for single-end sequencing.
Raw bases: (Raw reads) * (sequence length), calculating in G. For paired-end sequencing like PE150, sequencing length equals 150, otherwise it equals 50 for sequencing like SE50.
Clean bases: (Clean reads) * (sequence length), calculating in G. For paired-end sequencing like PE150, sequencing length equals 150, otherwise it equals 50 for sequencing like SE50.
Effective Rate(%): (Clean reads/Raw reads)*100%
Error rate: base error rate
Q20, Q30: (Base count of Phred value > 20 or 30) / (Total base count)
GC content: (G & C base count) / (Total base count)

Novogene Co., Ltd



C. Appendix

1. Introduction of Sequencing Data Format

The original data obtained from the high throughput sequencing platforms are transformed to sequenced reads by base calling. Raw data are recorded in a FASTQ file which contains sequenced reads and corresponding sequencing quality information. Every read in FASTQ format is stored in four lines as follows (Cock P.J.A. et al. 2010):

@HWI-ST1276:71:C1162ACXX:1:1101:1208:2458 1:N:0:CGATGT   
NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
+
#55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (such as a FASTA title line).

Line 2 is the sequence of the read.

Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.

Line 4 encodes the quality values for the bases in Line 2.

The details of Illumina sequence identifier are as follows:

IdentifierMeaning
HWI-ST1276Instrument – unique identifier of the sequencer
71run number – Run number on instrument
C1162ACXXFlowCell ID – ID of flowcell
1LaneNumber – positive integer
1101TileNumber – positive integer
1208X – x coordinate of the spot. Integer which can be negative
2458Y – y coordinate of the spot. Integer which can be negative
1ReadNumber - 1 for single reads; 1 or 2 for paired ends
Nwhether it is filtered - NB: Y if the read is filtered out, not in the delivered fastq file, N otherwise
0control number - 0 when none of the control bits are on, otherwise it is an even number
CGATGTIllumina index sequences


Novogene Co., Ltd



2. Explanation of Sequencing Data Related

(1) The data deliverd is a compressed file in format of '.fq.gz'. Before data delivery, we will calculate the md5 value of each compressed file and please check it when you get the data. There are two ways to check the md5 value. In Linux environment, you can use 'md5sum -c <*md5.txt>' command under the data directory. In Windows environment, you can use a calibration tool e.g. hashmyfiles. If the md5 value of compressed file doesn't match with the one we provide in md5 file in data directory, the file may have been damaged during the transmitting procedure.

(2) For paired-end (PE) sequencing, every sample should have 2 data flies (read1 file and read2 file). These 2 files have the same line number, you could use 'wc -l' command to check the line number in Linux environment. The line number divide by 4 is the number of reads.

(3) The date size is the space occupied by the data in the hard disk. It's related to the format of disk and compression ratio. And it has no influence on the quantity of sequenced bases. So the size of read1 file may be unequal to the size of read2 file.

(4) When customer’s samples need large amount of data e.g. whole genome sequencing data, we would use separate-lane sequencing strategy to make sure the quality of data. So it's possible that one sample has several parts sequencing data. For example, if sample 1 has two read1 files, sample1_L1_1.fq.gz and sample1_L2_1.fq.gz, that means this sample was sequenced on different lanes.

(5) About the quality control standard. If we promise to deliver the clean data, we will filter the data strictly according to the standard to obtain high quality clean data which can be used for further research and paper writing. We will discard the paired reads in the following situation: when either one read contains adapter contamination; when either one read contains uncertain nucleotides more than 10 percent; when either one read contains low quality nucleotides (base quality less than 5) more than 50 percent, discard the paired reads. The data analysis results based on this standard can be approved by high level magazines (Yan L.Y. et al . 2013). If you want to get more information, please refer to the official website of Novogene (www.novogene.com).

(6) About the sequenced reads. The Index is normally in the middle of the adapter during the process of experimenting and sequencing except the special library. We can get the Read1 sequence and Read2 sequence by Index read. They are all the sequence of samples so that it's no necessary to dispose the beginning and end of reads in the downstream analysis(e.g. mapping).

(7) Ninety days after the data delivery, we will delete outdated data. So please keep your data properly. If you have any question or doubt, please contact us as soon as possible. Have a nice day!

Novogene Co., Ltd



3 References

Cock P.J.A. et al (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767-1771.

Hansen K.D. et al (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research 38, e131-e131.

Erlich Y.et al (2008). Alta-Cyclic: a self-optimizing base caller for next-generation sequencing.Nature Methods,5,679-682.

Jiang L.C. et al (2011). Synthetic spike-in standards for RNA-seq experiments. Genome research 21, 1543-1551.

Yan L.Y. et al (2013). Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol.