GAB-2018v9n10 - page 11

Genomics and Applied Biology 2018, Vol.9, No.10, 62-71
68
or selectively remove bases that are not accurately measured (Iyer et al., 2015). Clean filtered data after filtration
is assembled into transcripts by software. There are two ways of assembling transcripts. One is the assembly
based on the reference sequence, which used Tophat to compare the sequencing data to the reference genome
(Kim, 2014), and later used cufflinks for stitching (Trapnell, 2014). This method has high sensitivity and requires
less memory for calculation. The other is to obtain transcripts directly from overlap assembly between sequencing
reads without reference genome assembly. This method does not rely on the comparison software and the existing
reference genome, but requires large memory resources as well as higher sequencing depth.
After the assembly of transcripts, the next step is to screen lncRNA. There are several basic screening methods for
lncRNA: one is length, the other is the exon number. For instance, we usually select the length longer than or
equal to 200 bp from the assembled transcript according to the common distribution of lncRNA (Wilusz et al.,
2009). The resulting transcripts were then functionally screened to predict their potential ability to encode proteins.
The analysis methods commonly used now are as follows: CPC software analysis, CNCI software analysis, pfam
protein domain analysis and CPAT software analysis. In general, the intersection of these software types will be
taken to reduce the potential false positives probably caused by a single software.
Once lncRNA is obtained, downstream analysis can be carried out. For instance, lncRNA family classification,
lncRNA expression analysis, lncRNA difference analysis, lncRNA-mRNA co-expression analysis, Pathway
enrichment analysis, and so on. Finally, functional analysis and sample properties of lncRNA will be combined to
discuss relevant biological problems.
With the continuous advance of sequencing technology, the third-generation of high-throughput sequencing
technology has been recognized and accepted by more and more researchers, and the latest technology has been
timely applied to scientific research practice, so we have a powerful tool on the way to explore life science. In
particular, the third-generation high-throughput sequencing technology represented by the Single Molecule
real-time Sequencing (SMRT) technology of Pacific Biosciences is widely applied in scientific research. One of
the reasons for the popularity of third-generation sequencing technology is that its ultra-long read-length data is
extremely convenient for research. For instance, the average reading length of PacBio single-molecule real-time
sequencing technology can reach about 10 kb, which enables us to obtain a complete transcript sequence without
relying on splicing. Due to the short reading length of second-generation sequencing technology, full-length
transcript sequences cannot be obtained directly. In the process of research, a step of splicing transcript must be
added. However, splicing is inevitably leading to errors, which limits our research on transcript. Benefit from the
advantage of the ultra-long read-length of third-generation high-throughput sequencing technology, we have more
and more flexible methods for the study of transcripts. The main processes of lncRNA identification using the
third-generation high-throughput sequencing technology can be summarized as the following parts.
Firstly, the RNA of the sequencing object was extracted, and later was reversely transcribed into cDNA. Then,
libraries of different lengths and sizes were established according to specific requirements, and finally the libraries
were sequenced. After getting the original sequencing data, we still need to filter the data. Insertion of high quality
was classified into full-length transcripts and non-full-length transcripts. Full-length transcripts were clustered and
non-full-length transcripts were used to correct the clustered full-length transcripts (Gordon et al., 2016) to obtain
high-quality transcripts.
The following procedures were similar to the analysis processes after next generation sequencing. Obtained
transcripts of high quality were classified by PLEK software, and the transcript sequences of encoded protein and
noncoding protein were obtained. Then, sequences with a length greater than or equal to 200 bp were selected
from the transcript sequences of non-coding proteins. Then, EMBOSS filter was used to remove the transcript
sequences encoded by ORFs with more than 100 amino acids. Finally, BLAST was used to compare the
remaining transcripts to the NR protein database to further filter out the gene sequences of encoding proteins and
highly reliable lncRNA transcript sequences were obtained.
1...,2,3,4,5,6,7,8,9,10 12,13,14,15,16
Powered by FlippingBook