Can I add new genome references for my data analysis with Kangooroo?
Yes. There are two possible options:
We can add a new genome reference to the database for you. Please note this will come with a fee. For more information, please contact sales@lexogen.com.
You can upload your genome reference of interest directly on the platform. To do so, please follow the guidelines below
1 - Download the genome and annotation references from your species of interest from Ensembl (https://ftp.ensembl.org/).
Select the directory ´´pub/´´
Select the release folder of your choice (e.g., release-110)
Select ´´gtf/´´ folder or ´´dna/´´ folder to download annotation or genome files, respectively
Select your organism of interest
Download the gtf.gz file for the annotation reference
Download the toplevel.fa.gz file for the genome reference
2 - Rename gtf and fasta files as follow:
gtf file - annotation_organism_ercc_sirv_biotyped.gtf
Example: Mus_musculus.GRCm39.110.gtf.gz renamed to annotation_organism_ercc_sirv_biotyped.gtf
fasta file - annotation_organism_ercc_sirv.fa
Example: Mus_musculus.GRCm39.dna.toplevel.fa.gz renamed to annotation_organism_ercc_sirv.fa
3 - Create a gene_description.txt file.
The gene_description.txt file is a tab separated file that contains two columns: Gene ID and Gene name.
To generate the gene_description.txt file, please use the command line presented below. This command line uses the annotation file to extract the Gene ID and Gene name information.
((echo "Gene stable ID Gene name Gene description Chromosome/scaffold name Gene start (bp)") && (cat annotation_organism_ercc_sirv_biotyped.gtf | awk 'BEGIN{FS="\t"}{if($3 ~/gene/){if($9 ~ /gene\_id/ && $0 ~ /gene\_name/){match($9,/gene\_id \"([^\"]+)\"/); gid = substr($9,RSTART, RLENGTH); match($9,/gene\_name \"([^\"]+)\"/); gn = substr($9,RSTART, RLENGTH); map[gid] = gn }}}END{for(key in map){print key,map[key]}}' | sed 's/\"//g' | awk '{print $2"\t"$4"\t\t\t"}')) > gene_description.txt
4 - Create a .bed file from the gtf file.
To create a .bed file, please download the tools gtfToGenePred and genePredToBed and use the following command lines:
gtfToGenePred annotation_organism_ercc_sirv_biotyped.gtf annotation_organism_ercc_sirv.genepred
genePredToBed annotation_organism_ercc_sirv.genepred annotation_organism_ercc_sirv.bed
5 - Generate a final directory containing both genome and annotation files.
First, copy the files:
cp gene_description.txt annotation_organism_ercc_sirv.fa annotation_organism_ercc_sirv_biotyped.gtf annotation_organism_ercc_sirv.bed ./Directory_folder/
Then, compress the resulting folder (tar.gz):
tar czfv Directory_folder.tar.gz Directory_folder
6 - Upload the directory folder in Kangooroo as described in the following FAQ: How do I upload my files in the Kangooroo platform?
7 - Tag the reference as a genome file.
By default, all uploaded files have a designated type marked as “File”. To be used as a reference by the pipeline, the type of the uploaded file must be changed to ´´Genome´´. Please watch our tutorial video on how to tag your file as a ´´Genome´´ type file here.