- How did you assemble the transcriptome?
- In the paper you implemented a novel strategy to cluster unigenes, how does it work?
- Did the novel clustering approach have any impact?
- Which reference species did you use and why?
- What does each unigene cluster represent?
- What can I do with the sequences in each cluster?
- How does the “Search by gene symbol” option work?
- What can I do with the advanced search?
- How do I interpret the results of search/advanced search?
- What can I do with the "Go" buttons in the unigene view page?
- What's the best way to look for my gene?
- Something is wrong with the interface, what do I do?
- Something's wrong with the annotation, what do I do?
- Software dependencies and known issues?
- Can I have the source code for the web interface?
- Can I have the source code of the re-clustering pipeline?
- Can I have the bulk data?
1. How did you assemble the transcriptome?
We used the Oases pipeline with the multiple Kmers option. Kmers were in range from 23 to 41.
2. In the paper you implemented a novel strategy to cluster unigenes, how does it work?
First we used a sequence identity threshold (CD-hits EST) to partition the original loci formed by Oases into secondary clusters. Then we annotated a reference transcript from each secondary cluster (the longest sequence) using the proteomes of our reference species. Next, we merged secondary clusters diplaying identical annotations and estimated their average sequence identity and overlap. Finally we used arbitrary overlap and identity cut-offs (the rightmost 1% tail of overlap and identity distributions) and merged secondary clusters with overlap and overlap identity surpassing the aforementioned thresholds.
3. Did the novel clustering approach have any impact?
After our clustering strategy, Pearson correlation coefficients between gene family sizes estimated for S. rexii and other fully sequences plant genomes (according to Plaza gene family annotations) consistently improved. Additionally, the number of clusters sharing identical annotations was greatly reduced by our procedure, suggesting that loci derived from partial transcripts with limited overlap had been successfully merged.
4. Which reference species did you use and why?
Arabidopsis thaliana, (TAIR 10) (Lamesch et al., 2012), Solanum Lycopersicum (Solanum genome browser) (Bombarely et al., 2011), Mimulus guttatus, Glycine max and Oryza sativa japonica (all from Phytozome) (Goodstein et al., 2012). We chose Arabidobpsis and Rice as they are "gold-standards" with high quality annotations. The others for their relative phylogenetic vicinity to S. rexii.
5. What does each unigene cluster represent ?
In an ideal world each unigene would represent a cluster of alternative transcripts for a specific gene in Streptocarpus. In reality it is possible that you can find alleles or very close paralogs in the same cluster.
6. What can I do with the sequences in each cluster?
You can think of the sequences in our clusters of unigenes as "reconstructed ESTs". Basically you could use them as a starting point to identify/clone your favourite gene or to design primers for a RT-PCR experiment. You can even use the sequences for phylogenetic analyses. But beware the sequences are not perfect in any sense, they may contain assembly errors and frameshifts, so we would recommend to treat them carefully (we often use the MACSE software to try to identify frameshifts).
7. How does the “Search by gene symbol” option work?
For now, gene symbols are Arabidopsis oriented. If you think of doing a more elaborate search, please use the advanced search form.
8. What can I do with the advanced search?
You can search genes by sequence identifiers (if you already know which gene you want that could be the most sensible way to search for your favourite gene) or by GO term description (to retrieve large groups of functionally related genes) or by gene description. You can also use Plaza gene family identifiers to restrict the search to specific gene families.
Please notice that when you specify values for more than one search field the logic connector (AND) is used, so using multiple values will restrict the space of your search. The database is indeed unigene oriented: it was designed with under the assumption that the user is looking for a particular gene/group of genes at the time.
Please bear in mind that gene annotations are not completely standardized in the biology community, sequence similarity searching can help overcome this problem.
9. How do I interpret the results of search/advanced search?
10. What can I do with the "Go" buttons in the unigene view page?
11. What's the best way to look for my gene?
Sequence search similarity (BLAST) searching is the best choice if you have a homologous sequence from a related organism or sequence identifiers (that would work only if you know the identifier for one of our reference species).
12. Something's wrong with the interface, what do I do?
Send us an email, we'll try to fix it.
13. Something's wrong with the annotation, what do I do?Same s above.
14. Software dependencies and known issues?
An up to date version of java is needed to run the jal-view applet.
15. Can I have the source code of the web interface?
For now the source code does not comply to production standards (mostly undocumented). However at you own risk and peril you can, send us an email or download it here.
16. Can I have the source code of the re-clustering pipeline?
Same as above.
17. Can I have the bulk data? (right click and save as to download)
Reference transcripts :contains the representative sequences Complete collection :all the transcripts in the database Short transcripts :transcripts shorter than 350 bp (see the paper) Annotation table