Araip.K30076.a1.G1 GLEAN-based gene models from BGI, for the Peanut Genomics Consortium, June, 2014.

This is a PROVISIONAL ANNOTATION VERSION, FOR EVALUATION BY PCG MEMBERS. This annotation version is on Arachis ipaensis, gentype accession K30076, assembly version a1.

Repeats were first identified, using Tandem Repeat Finder, RepeatMasker, LTR_FINDER, and RepeatModeler. Gene prediction: 1. De novo prediction was performed on a repeat-masked genome using two programs (AUGUSTUS and GlimmerHMM) with the training set of Arabidopsis. 2. Homolog-based predictions were conducted by comparing protein sequences of Glycine max Gmw82.a1.v1.1, Lotus japonicus v2.5), Medicago truncatula Mt3.5v4, and Phaseolus vulgaris v1.0 to the genome using genBlastA. Genomic regions with homologs aligned (the best record for each protein), along with flanking regions (4,000 bp) at the 5'- and 3'- ends, were extracted for gene structure prediction using Genewise. 3. For the evidence based approach, the respective assembled transcriptomes (provided by the PEANUT FOUNDATION) and EST/cDNA (download from NCBI) were aligned to the genome using BLAT, and gene structures were identified using PASA. The RNA-seq reads waere aligned to the genome using Tophat and cDNA were predicted then using Cufflinks. ORFs were predicted by Cuffcompare with the MARKOV parameter training from a GLEAN gene set (preliminary GLEAN gene set). 4. The final gene set was generated using GLEAN combining gene prediction from de novo, homolog, and evidence-based gene prediction. Genes were filtered out if shorter than 150 bp or with more than 30% Ns. In post-processing (S. Cannon), genes aligning over >= 50% of their length to Arachis transposable elements (from D. Bertioli et al.) at >= 80% identity were also filtered out.

