Can I add new genome references for my data analysis with Kangooroo?
Yes.
There are two possible options:
We can add a new genome reference to the database for you. Please note this will come with a fee. For more information, please contact sales@lexogen.com.
You can upload your genome reference of interest directly on the platform. To do so, please follow the guidelines below.
1 - Download the genome and annotation references from your species of interest from Ensembl (https://ftp.ensembl.org/).
Select the directory ´´pub/´´.
Select the release folder of your choice (e.g., release-110).
Select ´´gtf/´´ folder or ´´dna/´´ folder to download annotation or genome files, respectively.
Select your organism of interest.
Download the gtf.gz file for the annotation reference.
Download the toplevel.fa.gz file for the genome reference.
Note: If the annotation is in gff/gff3 format, use AGAT (https://agat.readthedocs.io/en/latest/gff_to_gtf.html#agat) to convert gff3 to gtf.
2 - Rename gtf and fasta files as follow:
gtf file - annotation_organism_ercc_sirv_biotyped.gtf.
Example: Mus_musculus.GRCm39.110.gtf.gz renamed to annotation_organism_ercc_sirv_biotyped.gtf
fasta file - annotation_organism_ercc_sirv.fa.
Example: Mus_musculus.GRCm39.dna.toplevel.fa.gz renamed to annotation_organism_ercc_sirv.fa
Important: To ensure compatibility with our data analysis pipelines, the .gtf annotation file must include entries for “gene”, “transcript” and “exon” features. For gene entries, the attributes gene_id and gene_name are mandatory and transcripts entries must also include gene_id, transcript_id, gene_name and transcript_name. In addition, each gene_id and transcript_id must be associated with at least one corresponding exon entry which is required to enable correct read assignment and counting during the analysis.
Example:
1 Source gene 1000 5000 . + . gene_id "GENE001"; gene_name "MYGENE";
1 Source transcript 1000 5000 . + . gene_id "GENE001"; transcript_id "TX001"; gene_name "MYGENE"; transcript_name "MYGENE-01";
1 Source exon 1000 1200 . + . gene_id "GENE001"; transcript_id "TX001"; exon_number "1";
1 Source exon 1500 1800 . + . gene_id "GENE001"; transcript_id "TX001"; exon_number "2";
3 - Generate a final directory containing both genome and annotation files.
First, copy the files:
cp annotation_organism_ercc_sirv.fa annotation_organism_ercc_sirv_biotyped.gtf ./Directory_folder/
Then, compress the resulting folder (tar.gz):
tar czfv Directory_folder.tar.gz Directory_folder
4 - Upload the directory folder in Kangooroo as described in the following FAQ: How do I upload my files in the Kangooroo platform?
5 - Tag the reference as a genome file.
By default, all uploaded files have a designated type marked as “File”. To be used as a reference by the pipeline, the type of the uploaded file must be changed to ´´Genome´´. Please watch our tutorial video on how to tag your file as a ´´Genome´´ type file here.