From Documentation
Jump to: navigation, search
(Replaced content with "Moved to https://helpwiki.sharcnet.ca/wiki/Graham%E2%80%99s_Reference_Dataset_Repository")
 
Line 1: Line 1:
Since May 2021 we have been testing a [https://en.wikipedia.org/wiki/Network_File_System Network File System (NFS)] data mount to provide our users with some commonly used datasets in [[#Bioinformatics|Bioinformatics]] and [[#AI | AI]]. This data mount is provided in an effort to better serve our users and to lower the usage on their project accounts with commonly used datasets. These datasets are mounted on <code>/datashare/</code>. You can explore the top directories by listing the mount:
+
Moved to https://helpwiki.sharcnet.ca/wiki/Graham%E2%80%99s_Reference_Dataset_Repository
 
+
<syntaxhighlight lang="bash">
+
[jshleap@gra-login1 ~]$ ls -lL /datashare/
+
total 152
+
drwxrwxr-x 9 jshleap sn_staff        4096 Jul  6 11:14 1000genomes
+
drwxrwxr-x 2 jshleap sn_staff      94208 Jun  4 15:30 BLASTDB
+
drwxrwxr-x 2 jshleap sn_staff        107 Jun  4 15:30 BLAST_FASTA
+
drwxrwxr-x 5 jshleap sn_staff        229 Jun  4 18:49 CIFAR-10
+
drwxrwxr-x 5 jshleap sn_staff        221 Jun  4 18:49 CIFAR-100
+
drwxrwxr-x 6 jshleap sn_staff        115 Apr 27 10:00 COCO
+
drwxrwxr-x 2 jshleap sn_staff        135 Jun 10 18:23 DIAMONDDB_2.0.9
+
drwxrwxr-x 6 jshleap sn_staff        321 Feb  4 17:39 EggNog
+
drwxrwxr-x 2 jshleap sn_staff          6 Mar 16 16:42 github_mirror
+
drwxrwxr-x 3 jshleap sn_staff          46 Mar 23 14:23 hg38
+
drwxrws--- 9 jshleap imagenet-optin  244 Jun 16 09:22 ImageNet
+
drwxrwxr-x 8 jshleap sn_staff        4096 Jun  7 16:58 kraken2_dbs
+
drwxrwxr-x 2 jshleap sn_staff        191 Jun  4 18:49 MNIST
+
drwxrwxr-x 2 jshleap sn_staff          50 Jun  4 18:51 MPI_SINTEL
+
drwxrwxr-x 2 jshleap sn_staff        4096 Jun  9 17:09 NCBI_taxonomy
+
drwxrwxr-x 6 jshleap sn_staff        145 Feb  4 22:44 PANTHER
+
drwxrwxr-x 5 jshleap sn_staff        4096 Apr 19 17:24 PFAM
+
drwxrwxr-x 7 jshleap sn_staff        4096 Mar 29 09:52 SILVA
+
drwxrwxr-x 6 jshleap sn_staff        257 Feb  4 22:46 SVHN
+
drwxrwxr-x 4 jshleap sn_staff        189 Apr 19 17:59 UNIPROT
+
drwxrwx--- 5 jshleap voxceleb-optin    98 Apr 23 15:15 VoxCeleb
+
</syntaxhighlight>
+
 
+
 
+
Below a detailed description of each dataset and how to access them.
+
 
+
== Bioinformatics ==
+
Bioinformatics software often uses reference datasets (often referred to as databases) to work properly. In [www.sharcnet.ca SHARCNET] we are providing a set of these datasets for bioinformatics:
+
 
+
=== 1000 Genomes ===
+
In human genetics, the [https://en.wikipedia.org/wiki/1000_Genomes_Project 1000 genomes project (1KGP)] was an effort to catalogue human genetic variation and has become a reference and a comparison point to many studies. We provide their data from their [http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ FTP site], and will be checked for updates twice a year (June and December).
+
 
+
==== Directory structure ====
+
 
+
<div class="toccolours mw-collapsible mw-collapsed">
+
1000 Genomes directory tree (up to level 2):
+
<div class="mw-collapsible-content">
+
<pre>
+
├── CHANGELOG
+
├── data_collections
+
│  ├── 1000G_2504_high_coverage
+
│  ├── 1000G_2504_high_coverage_SV
+
│  ├── 1000_genomes_project
+
│  ├── gambian_genome_variation_project
+
│  ├── gambian_genome_variation_project_GRCh37
+
│  ├── geuvadis
+
│  ├── han_chinese_high_coverage
+
│  ├── HGDP
+
│  ├── HGSVC2
+
│  ├── hgsv_sv_discovery
+
│  ├── HLA_types
+
│  ├── illumina_platinum_pedigree
+
│  ├── index.html
+
│  ├── README_data_collections.md
+
│  └── simons_diversity_data
+
├── historical_data
+
│  ├── former_toplevel
+
│  ├── index.html
+
│  └── README_historical_data.md
+
├── index.html
+
├── phase1
+
│  ├── analysis_results
+
│  ├── data
+
│  ├── index.html
+
│  ├── phase1.alignment.index
+
│  ├── phase1.alignment.index.bas.gz
+
│  ├── phase1.exome.alignment.index
+
│  ├── phase1.exome.alignment.index.bas.gz
+
│  ├── phase1.exome.alignment.index.HsMetrics.gz
+
│  ├── phase1.exome.alignment.index.HsMetrics.stats
+
│  ├── phase1.exome.alignment.index_stats.csv
+
│  ├── README.phase1_alignment_data
+
│  └── technical
+
├── phase3
+
│  ├── 20130502.phase3.analysis.sequence.index
+
│  ├── 20130502.phase3.exome.alignment.index
+
│  ├── 20130502.phase3.low_coverage.alignment.index
+
│  ├── 20130502.phase3.sequence.index
+
│  ├── 20130725.phase3.cg_sra.index
+
│  ├── 20130820.phase3.cg_data_index
+
│  ├── 20131219.populations.tsv
+
│  ├── 20131219.superpopulations.tsv
+
│  ├── data
+
│  ├── index.html
+
│  ├── integrated_sv_map
+
│  ├── README_20150504_phase3_data
+
│  └── README_20160404_where_are_the_phase3_variants
+
├── pilot_data
+
│  ├── data
+
│  ├── index.html
+
│  ├── paper_data_sets
+
│  ├── pilot_data.alignment.index
+
│  ├── pilot_data.alignment.index.bas.gz
+
│  ├── pilot_data.sequence.index
+
│  ├── README.alignment.index
+
│  ├── README.bas
+
│  ├── README.sequence.index
+
│  ├── release
+
│  ├── SRP000031.sequence.index
+
│  ├── SRP000032.sequence.index
+
│  ├── SRP000033.sequence.index
+
│  └── technical
+
├── PRIVACY-NOTICE.txt
+
├── README_ebi_aspera_info.md
+
├── README_file_formats_and_descriptions.md
+
├── README_ftp_site_structure.md
+
├── README_missing_files.md
+
├── README_populations.md
+
├── README_using_1000genomes_cram.md
+
├── release
+
│  ├── 2008_12
+
│  ├── 2009_02
+
│  ├── 2009_04
+
│  ├── 2009_05
+
│  ├── 2009_08
+
│  ├── 20100804
+
│  ├── 2010_11
+
│  ├── 20101123
+
│  ├── 20110521
+
│  ├── 20130502
+
│  └── index.html
+
└── technical
+
    ├── browser
+
    ├── index.html
+
    ├── method_development
+
    ├── ncbi_varpipe_data
+
    ├── other_exome_alignments
+
    ├── other_exome_alignments.alignment_indices
+
    ├── phase3_EX_or_LC_only_alignment
+
    ├── pilot2_high_cov_GRCh37_bams
+
    ├── pilot3_exon_targetted_GRCh37_bams
+
    ├── qc
+
    ├── README.reference
+
    ├── reference
+
    ├── retired_reference
+
    ├── simulations
+
    ├── supporting
+
    └── working
+
</pre>
+
</div>
+
</div>
+
 
+
As per '''their''' README, the directory structure is:
+
 
+
<span style="font-size:110%">'''changelog_details'''</span><br>
+
 
+
This directory contains a series of files detailing the changes made to the FTP site over time.
+
 
+
<span style="font-size:110%">'''data_collections'''</span><br>
+
 
+
The data_collections directory contains directories for various collections of data, typically generated by different projects. Among the data collections is the '''1000 Genomes Project''' data.
+
 
+
For each collection of data, within the directory for that collection, README and index files provide information on the collection. Under each collection directory, there is a data directory, under which files are organised by population and then sample. Further information can be found in/datashare/1000genomes/data_collections/README_data_collections.md.
+
 
+
<span style="font-size:110%">'''historical_data'''</span><br>
+
 
+
This directory was created during a rearrangement of the FTP site in September 2015. It houses README and index files that were formerly present at the toplevel of this site, including dedicated index directories. Further information is available in /datashare/1000genomes/historical_data/README_historical_data.md.
+
 
+
<span style="font-size:110%">'''phase1'''</span><br>
+
 
+
This directory contains data that supports the publications associated with phase 1 of the 1000 Genomes Project.
+
 
+
<span style="font-size:110%">'''phase3'''</span><br>
+
 
+
This directory contains data that supports the publications associated with phase 3 of the 1000 Genomes Project.
+
 
+
<span style="font-size:110%">'''pilot_data'''</span><br>
+
 
+
This directory contains data that supports the publications associated with the pilot phase of the 1000 Genomes Project.
+
 
+
<span style="font-size:110%">'''release'''</span><br>
+
 
+
The release directory contains dated directories which contain analysis results sets plus README files explaining how those data sets were produced.
+
 
+
Originally, the date in release subdirectory names was the date on which the given release was made. Thereafter, the release subdirectory dates were based on the date in the name of the corresponding YYYYMMDD.sequence.index file. In future, the date in the directory name will be chosen in a manner appropriate to the data and the nature of the release.
+
 
+
Examples of release subdirectories are:
+
- /datashare/1000genomes/release/2008_12/
+
 
+
In cases where release directories are named based on the date of the YYYYMMDD.sequence.index, the SNP calls, indel calls, etc. in these directories are based on alignments produced from data listed in the  YYYYMMDD.sequence.index file.
+
 
+
For example, the directory
+
/datashare/1000genomes/release/20100804/
+
contains the release versions of SNP and indel calls based on the
+
/datashare/1000genomes/historical_data/former_toplevel/sequence_indices/20100804.sequence.index
+
file.
+
 
+
<span style="font-size:110%">'''technical'''</span><br>
+
 
+
The technical directory contains subdirectories for other data sets such as simulations, files for
+
method development, interim data sets, reference genomes, etc..
+
 
+
An example of data stored under technical is /datashare/1000genomes/datashare/1000genomes/technical/simulations/.
+
 
+
<div class="warning">
+
'''WARNING: /datashare/1000genomes/technical/working/'''
+
  The working directory under technical contains data that has experimental (non-public release) status
+
  and is suitable for internal project use only. Please use with '''caution'''.
+
</div>
+
 
+
=== BLASTDB ===
+
[https://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST] uses a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches. These databases contain the sequence information deposited in the NCBI and are made available here as pre-formatted databases with the same structure as the /db directory of the [ftp://ftp.ncbi.nlm.nih.gov/blast/db/ BLAST ftp site].
+
 
+
The pre-formatted databases offer the following advantages:
+
* Pre-formatting removes the need to run [https://www.ncbi.nlm.nih.gov/books/NBK569841/ makeblastdb]
+
* Species-level taxonomy ids are included for each database entry
+
* Sequences in FASTA format can be generated from the pre-formatted databases by using the [https://www.ncbi.nlm.nih.gov/books/NBK569853/ blastdbcmd utility]
+
 
+
<div class="warning">
+
'''IMPORTANT:''' The BLAST databases found in this folder are version 5 (v5). Information on newly enabled features with the v5 databases can be find [https://ftp.ncbi.nlm.nih.gov/blast/db/blastdbv5.pdf here].
+
</div>
+
 
+
All Pre-formatted databases available are located in Graham's <code>/datashare/BLASTDB</code> and will be updated every 3 months (Jan, Apr, Jul, Oct).
+
 
+
==== Directory structure ====
+
<code>/datashare/BLASTDB</code> contains all the pre-formatted without any subfolder. We include the Following:
+
 
+
{| class="wikitable"
+
|-
+
!|Name
+
!|Type
+
!|Title
+
|-
+
|16S_ribosomal_RNA
+
|DNA
+
|16S ribosomal RNA (Bacteria and Archaea type strains)
+
|-
+
|18S_fungal_sequences
+
|DNA
+
|18S ribosomal RNA sequences (SSU) from Fungi type and reference material
+
|-
+
|28S_fungal_sequences
+
|DNA
+
|28S ribosomal RNA sequences (LSU) from Fungi type and reference material
+
|-
+
|Betacoronavirus
+
|DNA
+
|Betacoronavirus
+
|-
+
|GCF_000001405.38_top_level
+
|DNA
+
|Homo sapiens GRCh38.p12 [GCF_000001405.38] chromosomes plus unplaced and unlocalized scaffolds
+
|-
+
|GCF_000001635.26_top_level
+
|DNA
+
|Mus musculus GRCm38.p6 [GCF_000001635.26] chromosomes plus unplaced and unlocalized scaffolds
+
|-
+
|ITS_RefSeq_Fungi
+
|DNA
+
|Internal transcribed spacer region (ITS) from Fungi type and reference material
+
|-
+
|ITS_eukaryote_sequences
+
|DNA
+
|ITS eukaryote BLAST
+
|-
+
|env_nt
+
|DNA
+
|environmental samples
+
|-
+
|nt
+
|DNA
+
|Nucleotide collection (nt)
+
|-
+
|patnt
+
|DNA
+
|Nucleotide sequences derived from the Patent division of GenBank
+
|-
+
|pdbnt
+
|DNA
+
|PDB nucleotide database
+
|-
+
|ref_euk_rep_genomes
+
|DNA
+
|RefSeq Eukaryotic Representative Genome Database
+
|-
+
|ref_prok_rep_genomes
+
|DNA
+
|Refseq prokaryote representative genomes (contains refseq assembly)
+
|-
+
|ref_viroids_rep_genomes
+
|DNA
+
|Refseq viroids representative genomes
+
|-
+
|ref_viruses_rep_genomes
+
|DNA
+
|Refseq viruses representative genomes
+
|-
+
|refseq_rna
+
|DNA
+
|NCBI Transcript Reference Sequences
+
|-
+
|refseq_select_rna
+
|DNA
+
|RefSeq Select RNA sequences
+
|-
+
|env_nr
+
|Protein
+
|Proteins from WGS metagenomic projects (env_nr)
+
|-
+
|landmark
+
|Protein
+
|Landmark database for SmartBLAST
+
|-
+
|nr
+
|Protein
+
|All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
+
|-
+
|pdbaa
+
|Protein
+
|PDB protein database
+
|-
+
|pataa
+
|Protein
+
|Protein sequences derived from the Patent division of GenBank
+
|-
+
|refseq_protein
+
|Protein
+
|NCBI Protein Reference Sequences
+
|-
+
|refseq_select_prot
+
|Protein
+
|RefSeq Select proteins
+
|-
+
|swissprot
+
|Protein
+
|Non-redundant UniProtKB/SwissProt sequences
+
|-
+
|split-cdd
+
|Protein
+
|CDD split into 32 volumes
+
|-
+
|tsa_nr
+
|Protein
+
|Transcriptome Shotgun Assembly (TSA) sequences
+
|}
+
 
+
==== Usage ====
+
The most efficient way to use these databases is to copy the specific database to <code>$SLURM_TMPDIR</code> at the begining of your sbatch script. This will add between 5 to 30 minutes (depending on the database you are moving), so use it only when you know that your blast run will take longer than one hour. For example, your sbatch script can look something like this:
+
 
+
 
+
    #!/bin/bash
+
    #SBATCH --time=02:00:00
+
    #SBATCH --mem=32G
+
    #SBATCH --cpus-per-task=8
+
    #SBATCH --account=def-someuser
+
    module load  StdEnv/2020  gcc/9.3.0 blast+/2.11.0 # load blast and dependencies
+
    tar cf - /datashare/BLASTDB/nr | (cd ${SLURM_TMPDIR}; tar xvf -) && # copy the required database (in this case nr) to $SLURM_TMPDIR
+
    blastp -db ${SLURM_TMPDIR}/nr -num_threads ${SLURM_CPUS_PER_TASK} -query myquery.fasta
+
 
+
 
+
Note that the example above assumes that you have launched the job from the same directory where myquery.fasta is located, that myquery.fasta is a set of protein sequences, and that nr is required as database.
+
 
+
You can also use <code>/datashare/BLASTDB/nr</code> (as per example), but it might be slower than having the databases in the local disk.
+
 
+
==== Other Compute Canada Sources ====
+
Blast databases can also be found in all cluster through a CVMFS repository (see https://docs.computecanada.ca/wiki/Genomics_data) unfortunately, these databases are based on the cloud ftp from NCBI which is out of date.
+
 
+
=== BLAST_FASTA === 
+
 
+
=== DIAMONDDB_2.0.9 === 
+
 
+
=== EggNog ===
+
 
+
=== hg38 === 
+
 
+
=== kraken2_dbs === 
+
 
+
=== NCBI_taxonomy === 
+
 
+
=== PANTHER === 
+
 
+
=== PFAM ===
+
 
+
=== SILVA ===
+
 
+
=== SVHN ===
+
 
+
=== UNIPROT === 
+
 
+
 
+
 
+
== AI  ==
+
 
+
=== CIFAR-10 === 
+
 
+
=== CIFAR-100 ===
+
 
+
=== COCO ===
+
 
+
=== ImageNet ===
+
 
+
=== MNIST ===
+
 
+
=== MPI_SINTEL ===
+
 
+
=== VoxCeleb ===
+

Latest revision as of 11:33, 3 September 2021

Moved to https://helpwiki.sharcnet.ca/wiki/Graham%E2%80%99s_Reference_Dataset_Repository