Theobroma cacao Matina1-6 V1.0 Release This assembly is primarily composed of Roche 454 Titanium STD reads (USDA-ARS, Stoneville), augmented by 6 recombination paired Titanium libraries (CGB, Indiana U), 3 fosmid packaged Sanger sequenced libraries (USDA-ARS, Miami and CUGI, Clemson), and 3 Sanger sequenced BAC Ends libraries (CUGI, Clemson). Included were an additional 2 paired libraries and linear data from a pre-release version of the Roche 1k system (Roche). We also made extensive use of ~100x of 90 bp Illumina based paired sequence (NCGR) and 36x paired 101bp sequence (CGB, Indiana U). The assembly process was as follows: ------------------------------------ 1. Cleaned 454 recombination pairs of duplicate pairs and unpaired read data. 2. Discarded STD Titanium reads < 100 bp. 3. Corrected indel errors in the 454 data by aligning all Illumina reads from a 99x set of V2 illumina data. 4. Assembled the 454 data, FES and BES using our in-house version of the Arachne2 assembler, including a step to remove unanchored repetitive sequence. 5. Applied subsequent filtering steps to remove short (< 150bp) and 454 pair only contigs to minimize chimeric projections. 6. Combined the subsequent assembly with the Cacao physical map (CUGI, Clemson) and the existing Cacao genetic marker maps (CATIE, MCCS, and PNG) (USDA-ARS, Miami) to identify potential misassembled scaffolds. Applied 47 scaffold breaks. 7. Rearranged scaffolds based on the genetic map, additional FES data, and the physical map to build chromosome-scale pseudomolecules with 103 joins (represented by 10k N's in the final scaffolds). 8. Screened the resulting scaffolds for mitochondria, chloroplast, unanchored rDNA, and potential prokaryotic contaminants and removed these scaffolds. 9. Sequestered, small, highly repetitive scaffolds (>=95% composed of 24mers occurring at least 4 times in the larger scaffolds) and small scaffolds, less than 1kb of sequence content. 10. Post-processed the final scaffolds to correct 454 insertion/ deletion errors using 36x TrueSeq Illumina data . 11. Validated the resulting pseudomolecules against EST data sets collected as part of this project (CGB, Indiana U) and against an 995 kb Sanger based contig (CUGI, Clemson). 12. Inserted the reference sequence into chromosome 5, replacing the WGS sequence in that region. The resulting pseudomolecules (numbered 1-10) capture 94.6% of the assembled sequence and 98.2% of the sample EST set. Assembled Sequence Coverage per Library: ---------------------------------------- LIB COV. INSERT STDDEV ----- ------ --------------- LINE 15.58x N/A TC3A 0.56x 2368 +/- 707 TC3F 0.51x 2545 +/- 801 TC3D 0.56x 3948 +/- 455 TC3B 0.92x 3949 +/- 452 TC5C 0.92x 6278 +/- 662 TC8E 1.04x 7071 +/- 988 TC8F 0.30x 7176 +/- 1258 TC8A 0.97x 8128 +/- 1005 TCFB 0.01x 35552 +/- 4362 TCFA 0.01x 35706 +/- 4341 TCFC 0.26x 36122 +/- 4467 TCCB 0.03x 93963 +/- 14897 TCCC 0.03x 114008 +/- 24776 TCCA 0.04x 127062 +/- 22613 Total 21.74x Final Assembly Statistics: -------------------------- Main genome scaffold total: 711 Main genome contig total: 20103 Main genome scaffold sequence total: 346.0 MB Main genome contig sequence total: 330.8 MB (-> 4.4% gap) Main genome scaffold N/L50: 5/34.4 MB Main genome contig N/L50: 1080/84.4 KB Number of scaffolds > 50 KB: 74 % main genome in scaffolds > 50 KB: 98.4% Minimum Number Number Total Total Scaffold Scaffold of of Scaffold Contig Contig Length Scaffolds Contigs Length Length Coverage -------- --------- ------- ----------- ----------- -------- All 711 20,103 345,993,675 330,821,837 95.61% 1 kb 711 20,103 345,993,675 330,821,837 95.61% 2.5 kb 570 19,917 345,771,487 330,635,450 95.62% 5 kb 438 19,654 345,274,253 330,363,628 95.68% 10 kb 260 19,111 343,977,657 329,666,795 95.84% 25 kb 113 17,955 341,704,814 328,516,629 96.14% 50 kb 74 17,270 340,358,744 327,606,137 96.25% 100 kb 36 15,857 337,733,377 325,806,488 96.47% 250 kb 14 13,751 334,312,246 323,434,351 96.75% 500 kb 13 13,655 333,885,350 323,071,182 96.76% 1 mb 11 13,560 332,179,961 321,406,397 96.76% 2.5 mb 10 13,460 330,456,197 319,759,406 96.76% 5 mb 10 13,460 330,456,197 319,759,406 96.76% EST validation: --------------- We used several 454 EST runs to validate expected coding sequence space. We first removed reads < 300bps and removed all duplicate reads (to avoid over counting 454 duplicates). We then placed these at 90% ID and 85% coverage against the assembly. This is a completeness measure, rather than a comprehensive examination of gene space. Organelle sequence excluded: 1015064 total sequences. 992071 sequences (98.15%) place at >90% identity and >85% coverage 4288 library artifacts (0.42%) 9763 sequences have >=50 percent coverage (0.97%) 8942 sequences are not found (0.88%) Organelle sequence included: 1015064 total sequences. 1000012 sequences (98.90%) place at >90% identity and >85% coverage 3883 library artifacts (0.38%) 8485 sequences have >=50 percent coverage (0.84%) 2684 sequences are not found (0.27%) The ESTs that do not place are primarily composed of rDNA and prokaryotic contamination.