From Documentation
Jump to: navigation, search


Overview

The overall goal of this project was to develop software to conduct a comprehensive genome-wide survey of mutations in the human SNPs (single nucleotide polymorphisms) based on information theory. The four goals, along with the time estimates, outlined in the second round proposal were:

  1. to develop software for the purpose of comprehensibly predicting the effects of known SNPs on splice information in the human transcriptome,
  2. to develop software to compute and populate a database with the Ri values (information measures) of sequences that are required for mRNA splicing for all annotated human genes,
  3. to use structured queries to predict which SNP related changes in splice site information are likely to affect mRNA splicing of the gene in which the SNP occurs, and
  4. to organize results obtained in aim 3 in an OWL ontology framework, based on the type of splicing mutation predicted from a SNP and the an- notation of the gene in which it resides, and to compare predicted mutant mRNA with the structures of expressed transcripts.

Result

The final result of this project is a Perl C library. This library provides

  1. a FASTA parser,
  2. a RIBL parser,
  3. a SNP parser/application system, and
  4. a R_i calculation routine.

The FASTA parser is a highly optimized, error-checking, Ragel-generated, C, finite-state machine that easily outpaces single disk systems (e.g., it does 70MB/s from a hot-cache on my Core 2 laptop). The RIBL parser, while not as highly optimized (it is not on the critical path), is both error checking and verbose. The SNP parser/application system, similar to the FASTA parser, is a highly optimized, error checking, Ragel-generated, C, finite-state machine. The R_i calculation routine is also highly optimized, breaking out into various case-specific tight inner loops that were designed to optimize cache usage.

Design

Internally the computation is done entirely in C, but the results are returned as appropriate Perl types using the Perl C API. While the marshaling involved in this means it can never be as fast as a pure C based solution, such a comparison is meaningless as the output always has to be further processed in a non-predetermined way. This would require, for example, a C program writing to a file or pipe, and that being re-parsed by (most likely) Perl code. Obviously directly supplying the values to the Perl code is the more flexible and efficient.

Complex SNPs

One significant feature is support for arbitrarily complex SNPs and the associated SNP alignment. SNPs can be replaced an arbitrary number of alleles with a different arbitrary number of alleles. Given such a SNP, the system will try all possible ways to make it map onto the sequence, any combination of: flipping the strand, reversing the orientation, and switching around the reference. As some SNPs can be mapped in multiple ways, a mask flag is provided to limit the number of mappings that will be tried to just those desired. Like the FASTA parser, the SNP parser is a highly-optimized state machine written in C using the Ragel state machine compiler.

Lazy Cache

A major optimization is reformatting of the R_i calculation into a delta calculation. Instead of recomputing the R_i values from scratch after applying a SNP, the prior R_i value can be modified by subtracting and adding the contribution of the alleles removed and added, respectively, by the SNP. This reduced the order of the underlying computation. To further speed things up, the prior R_i values were all pre-computed at start up so they could be reused across multiple SNP queries. The downside of this is a significantly increased start up time.

The program library combines the best of both worlds. The delta computation works with complex SNPs as long as the number of alleles removed equals those added. In other cases the system falls back to re-computation. R_i values are computed and cached for reuse across multiple SNPs at a page sized granularity (configurable at compile time) on demand. This means start up time is almost instantaneous. In all cases the cache serves as a direct source for R_i values outside the range of those affected by the SNP in question.

Perl API

The latest version of the library is available from the SHARCNET git repository. The perl man pages for the 1.00 release follow as an overview of the API.

Details regarding the underlying calculation can be found in

  • Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. 'J. Mol. Biol', 188:415-431, 1986.
  • Rogan PK, Faux BM, Schneider TD. Information analysis of human splice site mutations. 'Hum Mutat.', 12(3):153-71, 1998. Erratum in: 'Hum Mutat.', 13(1):82, 1999.

Rogan

Perl extension for Peter Rogan's information calculation


SYNOPSIS

 use Rogan::FASTA
 use Rogan::RIBL
 use Rogan::Query
 open($fh, '<', 'myfastafile');
 $fasta = Rogan::FASTA::parse($fh);
 close($fh);
 open($fh, '<', 'myriblfile');
 $ribl = Rogan::FASTA::ribl($fh);
 close($fh);
 $query = Rogan::Query::query($fasta, $ribl);
 $new_information_forward = $query->apply('A/CT', Rogan::Query::SNP_ANY, 100, -10, 10,
   Rogan::Query::RESULT_NEW_INFORMATION_FORWARD)

DESCRIPTION

The sub-module under this one can be used to calculate the information content along a FASTA sequence according to a Ri(b,l) (RIBL) weight matrix under single (or more complex) nucleotide polymophisms (SNPs).


EXPORT

None by default.

Rogan::FASTA

Rogan::FASTA - Interface to loading FASTA files


SYNOPSIS

 use Rogan::FASTA
 open($fh, '<', 'myfastafile');
 @fastas = Rogan::FASTA::parse($fh);
 close($fh);

DESCRIPTION

This module provides access to an extremely fast FASTA parser. The interface supports object-orientated access, so, for example, "@headers = Rogan::FASTA::headers($fasta)" can also be written as "@headers = $fasta->headers()".


@fastas = Rogan::FASTA::parse($handle)

Returns a list @fastas' of FASTA objects decoded from the the Perl'IO stream $handle.


@headers = Rogan::FASTA::headers($fasta)

Returns a list @headers of headers (a list of references to each header's keys) for the given FASTA object $fasta.


$number = Rogan::FASTA::number($fasta)

Returns the $number of alleles for the given FASTA object $fasta.


($sequence_forard, $sequence_reverse) = Rogan::FASTA::sequence($fasta, $o_dump_offset0, $o_dump_offset1)

Returns strings ($sequence_forward, $sequence_reverse) encoding the alleles (in FASTA format) from $o_dump_offset0 to $o_dump_offset1 for the given FASTA object $fasta.


EXPORT

None by default.

Rogan::RIBL

Interface to loading RIBL files


SYNOPSIS

 use Rogan::RIBL
 open($fh, '<', 'myriblfile');
 $ribl = Rogan::RIBL::parse($fh);
 close($fh)

DESCRIPTION

This module provides access to a RIBL parser. The interface supports object-orientated access, so, for example, "$header = Rogan::RIBL::header($ribl)" can also be written as "$header = $ribl->header()".


$ribl = Rogan::RIBL::parse($handle)

Returns the $ribl' RIBL object decoded from the the Perl'IO stream $handle.


$header = Rogan::RIBL::header($ribl)

Returns the $header header for the given RIBL object $ribl.


@offset = Rogan::RIBL::offset($ribl)

Returns the IV packed, forward and backward offsets @offset (in row-major order with dimensions [2][2]) for the given RIBL object $ribl.


@matrix = Rogan::RIBL::matrix($ribl)

Returns the float packed, concatenated, forward and backward information matrices @matrix (in row-major order with dimensions [2][offset1-offset0+1][16]) for the given RIBL object $ribl. The final dimension corresponds to a bit encoding of which of the A, C, G, and T (least significant bit to most) alleles are present.


$default = Rogan::RIBL::default($ribl)

Returns the sum of the sum of the values corresponding to no alleles in the matrices (i.e., the default Ri value to use when there are no alleles) for the given RIBL object $ribl.


@count = Rogan::RIBL::count($ribl)

Returns the IV packed, concatenated, forward and backward count matrices @count (in row-major order with dimensions [2][offset1-offset0+1][16]) for the given RIBL object $ribl. The final dimension corresponds to a bit encoding of which of the A, C, G, and T (least significant bit to most) alleles are present.


$mean = Rogan::RIBL::mean($ribl)

Returns the mean statistic $mean for the given RIBL object $ribl.


$deviation = Rogan::RIBL::deviation($ribl)

Returns the standard-deviation statistic $deviation for the given RIBL object $ribl.


$ri_consensus = Rogan::RIBL::riConsensus($ribl)

Returns the Ri value $ri_consensus of the consensus sequence for the given RIBL object $ribl.


$ri_anticonsensus = Rogan::RIBL::riAnticonsensus($ribl)

Returns the Ri value $ri_anticonsensus of the anticonsensus sequence for the given RIBL object $ribl.


$ri_random = Rogan::RIBL::riRandom($ribl)

Returns the average Ri value $ri_anticonsensus of a random sequence for the given RIBL object $ribl.


$number = Rogan::RIBL::number($ribl)

Returns the $number of sequences used to create the information matrix for the given RIBL object $ribl.


$ri_bound = Rogan::RIBL::riBound($ribl)

Returns the lower bound $ri_bound on Ri for the given RIBL object $ribl.


$z_bound = Rogan::RIBL::zBound($ribl)

Returns the lower bound $z_bound on Z for the given RIBL object $ribl.


$p_bound = Rogan::RIBL::pBound($ribl)

Returns the upper probability bound $p_bound for the given RIBL object $ribl.


EXPORT

None by default.

Rogan::Query

Interface to querying SNP effects based on information content


SYNOPSIS

 use Rogan::FASTA
 use Rogan::RIBL
 use Rogan::Query
 open($fh, '<', 'myfastafile');
 $fasta = Rogan::FASTA::parse($fh);
 close($fh);
 open($fh, '<', 'myriblfile');
 $ribl = Rogan::RIBL::parse($fh);
 close($fh);
 $query = Rogan::Query::query($fasta, $ribl);

DESCRIPTION

This module provides a way of apply RIBL data to FASTA sequences and query the resulting information values and the effects of SNPs (including indels). The interface supports object-orientated access, so, for example, "($new_information_forward) = Rogan::Query::apply($query, 'A/CT', Rogan::Query::SNP_ANY, 100, -10, 10, Rogan::Query::RESULT_NEW_INFORMATION_FORWARD)" can also be written as "($new_information_forward) = $query->apply('A/CT', Rogan::Query::SNP_ANY, 100, -10, 10, Rogan::Query::RESULT_NEW_INFORMATION_FORWARD)".


$query = Rogan::Query::query($fasta, $ribl)

Returns a query object for querying the application of the RIBL object $ribl to the FASTA object $fasta.


$applicable = Rogan::Query::applicable($query, $snp, $location)

Returns the possible ways SNP $snp can be applied at location $location.

The possible ways a SNP can be applied are formed by a bitwise OR of SNP_OLD_FORWARD_FORWARD, SNP_OLD_REVERSE_REVERSE, SNP_NEW_FORWARD_FORWARD, SNP_NEW_REVERSE_REVERSE, SNP_OLD_FORWARD_REVERSE, SNP_OLD_REVERSE_FORWARD, SNP_NEW_FORWARD_REVERSE, and SNP_NEW_REVERSE_FORWARD.

The OLD|NEW pair determines whether the right-hand side or left-hand of the SNP, respectively, is considered to be the polymorphism. The first FORWARD|REVERSE pair determines whether the SNP is specified relative to the explicit FASTA sequence or the implied complementary one, respectively. The last FORWARD|REVERSE pair determines whether the SNP is orientated along the 5' to 3' or 3' to 5' direction of the explicit FASTA sequence, respectively. For convience, the following bitwise combinations of the above are also predefined: SNP_FORWARD_FORWARD, SNP_REVERSE_REVERSE, SNP_FORWARD_REVERSE, SNP_REVERSE_FORWARD, SNP_OLD_DIRECTION_5TO3, SNP_OLD_DIRECTION_3TO5, SNP_NEW_DIRECTION_5TO3, SNP_NEW_DIRECTION_3TO5, SNP_OLD_STRAND_FORWARD, SNP_OLD_STRAND_REVERSE, SNP_NEW_STRAND_FORWARD, SNP_NEW_STRAND_REVERSE, SNP_OLD_DIRECTION_FORWARD, SNP_OLD_DIRECTION_REVERSE, SNP_NEW_DIRECTION_FORWARD, SNP_NEW_DIRECTION_REVERSE, SNP_DIRECTION_5TO3, SNP_DIRECTION_3TO5, SNP_STRAND_FORWARD, SNP_STRAND_REVERSE, SNP_DIRECTION_FORWARD, SNP_DIRECTION_REVERSE, SNP_OLD, SNP_NEW, and SNP_ANY.


($old_sequence_forward, $old_sequence_reverse, $new_sequence_forward, $new_sequence_reverse, $old_information_forward, $old_information_reverse, $new_information_forward, $new_information_reverse) = Rogan::Query::apply($query, $snp, $application, $location, $range0, $range1, $results)

Returns the old and new sequence and information values associated with applying the SNP $snp at the location $location in the interval [$range0, $range1] about the location.

The specified SNP should be of the form "[ACGTRYKMSWBDHVNX]i*/[ACGTRYKMSWBDHVNX]i*" (i.e., any number of FASTA letters in either case followed by '/' followed by any number of FASTA letters in either case). The SNP application mask $application is a bitwise OR of SNP_OLD_FORWARD_FORWARD, SNP_OLD_REVERSE_REVERSE, SNP_NEW_FORWARD_FORWARD, SNP_NEW_REVERSE_REVERSE, SNP_OLD_FORWARD_REVERSE, SNP_OLD_REVERSE_FORWARD, SNP_NEW_FORWARD_REVERSE, and SNP_NEW_REVERSE_FORWARD. Each of the allowed applications will be tried, in the order given, until either one succeeds or they all fail. See applicable for a description of the masks and convenient predefined combinations of them.

The list of information returned is masked by $results, which can be a bitwise OR of RESULT_OLD_SEQUENCE_FORWARD, RESULT_OLD_SEQUENCE_REVERSE, RESULT_NEW_SEQUENCE_FORWARD RESULT_NEW_SEQUENCE_REVERSE, RESULT_OLD_INFORMATION_FORWARD, RESULT_OLD_INFORMATION_REVERSE RESULT_NEW_INFORMATION_FORWARD, and RESULT_NEW_INFORMATION_REVERSE. The OLD|NEW pair determines whether the pre-SNP or post-SNP, respectively, data is returned. The SEQUENCE|INFORMATION pair determines whether the sequence or information, respectively, data is returned. The FORWARD|REVERSE pair determines whether data along the explicit FASTA sequence or the implied complementary one, respectively, is returned. For convience, the following bitwise combinations of the above are also predefined: RESULT_OLD_FORWARD, RESULT_OLD_REVERSE, RESULT_NEW_FORWARD, RESULT_NEW_REVERSE, RESULT_OLD_SEQUENCE, RESULT_NEW_SEQUENCE, RESULT_OLD_INFORMATION, RESULT_NEW_INFORMATION, RESULT_SEQUENCE_FORWARD, RESULT_SEQUENCE_REVERSE, RESULT_INFORMATION_FORWARD, RESULT_INFORMATION_REVERSE, RESULT_FORWARD, RESULT_REVERSE, RESULT_OLD, RESULT_NEW, RESULT_SEQUENCE, RESULT_INFORMATION, and RESULT_ALL.


($old_sequence_forward, $old_sequence_reverse, $old_information_forward, $old_information_reverse) = Rogan::Query::current($query, $location, $range0, $range1, $results)

Return the old sequence and information values at location $location in the interval [$range0, $range1] about the location.

The list of information returned is masked by $results, which can be a bitwise OR of RESULT_OLD_SEQUENCE_FORWARD, RESULT_OLD_SEQUENCE_REVERSE, RESULT_OLD_INFORMATION_FORWARD, RESULT_OLD_INFORMATION_REVERSE. See apply for a description of the masks and convenient predefined combinations of them. of this masks.


EXPORT

None by default.