oxfordfeb162012:start

1. Analysis of Measurements of a Polymer

1. Analysis of Measurements of a Polymer

Priority: Feb. 16, 2012

1.1 Nanopore Structure [0001]-[0009]

• [0003] Original nanopore idea by Branton et al. (US-5,795,782) in 1996.
• [0003] Early experiments performed on homopolymer of nucleotides.
• [0003] Translocation was very vast 5-us/base (and difficult to measure).
• [0004] Branton proposed addition of polymerase to control speed of translocation.
• [0006] In 2010 Akeson showed Phi29 DNAP can function on top of nanopore.

1.2 Sequence Determination [0010]-[0012]

• [0010] “Generally in these approaches, each of the states within the signal have been analysed independently by comparing the current levels of these states to known current levels from reference data.” Conversion from signal space to sequence space*.
• [0011] Blunt reader heads read k-mers.
• [0011] If there are n possible polymer units you have <m>n^k</m> different k-mers to resolve.
• [0012] Much research has focused on measurement systems that try to resolve individual polymer units. A very difficult prospect.
• [0012] Other work has accepted k-mer measurement, but still tries to distinguish them. But signals from different k-mers often overlap.

1.3 Present Invention [0013]-[0020]

• [0013] The present invention shows a method for analyzing measurements dependent on the identity of k-mers consisting of…
• [0014] 1.) deriving a feature vector from the measurements.
• [0015] 2.) determining the similarity to at least one other feature vector.
• [0017] The present invention does not try to extract the exact sequence. Many applications don't need this.

1.4 Information from Feature Vectors [0021]-[0032]

• [0021] Compare one feature vector (derived or measured) to another that is stored in memory. Determine similarity between entirety or parts of vectors.
• [0022] Degree of similarity can be basis for polymer classification.
• [0023] Can compare to library of features stored in memory.
• [0024] Can compare to combined feature vector (bunch of stored features spliced together).
• [0027] Derived feature vectors can be compared to one another.

1.5 Performing a Series of Measurements [0033]-[0037]

• [0033] Can make continuous series of measurements.
• [0034] Looking for presence\absence\amount of target.

1.6 Applications [0038]-[0045]

1.7 What Molecules Can Be Used [0072]-[0081]

• All sorts of molecules.

1.8 Samples That Can Be Used [0082]-[0086]

1.9 Membranes That Can Be Used [0087]-[0094]

1.10 Nanopores That Can Be Used [0095]-[0142]

1.11 Strand Sequencing Strategies [0143]-[0155]

1.12 Measurement Systems [0156]-[0165]

1.13 Nature of Measurements [0166]-[0171]

1.14 Preferable Apparatus [0172]-[0177]

1.15 Measurement and Signals [0178]-[0195]

• [0178] A typical set-up.
• [0179] A typical signal.
• [0188] Known polymers can be used to determine measurement dependence on k-mers.
• [0189] Use polymers with identical sequences except for k-mer at pre-determined position.
• [0190] Note from the histogram how distinctive a level a 6-mer can produce (surrounded by variable sequences 11 bases to the left and 8 bases to the right).
• [0191] This approach can be used to identify the location of a minimal k-mer!!!. That is if you are looking for particular k-mers, just learn their levels first and just look for these patterns!!! You don't have to sequence the whole thing first and then look for the patterns!
• [0193] Can't expect k-mers to be perfectly resolvable. Look at the histogram above. The 6-mer has a range of about 2 pA around its average. The net current range is about 40-pA (from 30 pA to 70 pA). Clearly the 4096 6-mer possibilities will be overlapping.

1.16 Invention: Method for Analyzing Time-Ordered Sequence of Measurements [0196]-[0213]

• [0200] State detection S1 identifies successive groups of measurements (i.e. event levels).
• [0201] S1 may be performed by looking for short-term increases in the derivative S1-1 of the input signal.

• [0203] S1-2 LPFs the noisy differential.
• [0204] S1-3 thresholds the filtered signal.
• [0205] Derive values for group features (e.g. average, median, etc.).
• [0206] The feature values form the feature vector 12.
• [0210] Simple state detection is with sliding window method. This compares the means of two adjacent window of data. A threshold can be put on difference in mean or based on variance in the two windows (e.g. using Student's t-statistic).
• [0211] Example of reduction using moving window t-test.
• [0213] S3 can be implemented in a variety of manners.

1.17 S3 Example 1 [0214]-[0220]

• [0214] Compare a derived feature vector 12 with one or more stored feature vectors 14.
• [0216] Can do those over many measurements 11 of other polymers.
• [0217] S4 counts the number of polymers in each class.
• [0218] S5 compares features 12 and 14 classified to same class.
• [0218] S5 identifies regions where classified 12 and 14 differ.

1.18 S3 Example 2 [0221]-[0225]

• [0221] S3 works on plural feature vectors 12.
• [0221] E.g. polymers from same sample or fragments of common polymer.
• [0223] S3-1 determines similarity between the plurality of vectors 12.
• [0224] S3-2 produces classification data 16.
• [0224] S3-2 clusters features 12 based on similarity.
• [0224] Clusters are identified as a class.
• [0225] S4 and S5 can process classification data 16 as discussed above.

1.19 S3 Example 3 [0226]

• [0226] S3 compares plural feature vectors 12 derived from the same polymer.

1.20 S3 Example 4 [0227]

• [0227] Compare 12 to known 14.
• [0227] Similar advantages to step S5 above, but is applicable where the expected type of polymer is known in advance (thus allowing comparison to 14).

1.21 Mathematics of Similarity (S3 and S5) [0228]-[0241]

• [0229] Can modify existing pairwise dynamic programming sequence alignment algorithms for matching feature vectors.

Needleman-Wunsch: global alignment
Smith-Waterman: local alignment

• [0230] Example modification replaces substitution matrix with distance measure:

Distance could be absolute difference in current between data points.
Distance could consider multiple measurements at each position (e.g. mean, var).

• [0231] Possibly make modifications to gap scoring (const. gap penalties, linear gap penalties, affine gap penalties, etc.).
• [0232] Algorithms determine similarity based on alignment score.
• [0232] Alignment score a function of:

two feature vectors
distance function
gap penalties

• [0233] Modified alignment algorithms can be used for:

clustering
consensus building
pattern matching

although other algos exist for these as well.
• [0234] Can also modify multiple alignment algorithms.
• [0235] The matching methods above are gapped alignment techniques.
• [0235] Alternatively can express vector (1,2,3,4,5) with sub-vectors {(1,2,3),(2,3,4),(3,4,5)}.
• [0236] Similarity based on how closely sub-vectors match.
• [0236] Potentially more efficient than gapped approach since sub-vectors compared directly without gaps.
• [0237] Discretizing sub-vectors (e.g. to nearest 0.1) allows exact or partial sub-vector matches to be used and thus similarity to be calculated in terms of full/partial matches.
• [0237] Discretization allows integer arithmetic to be used.
• [0237] Hash functions may be applied to sub-vectors. See: Efficient randomized pattern-matching algorithms
• [0238] Similar ideas of splitting data in short fragment substrings to match against a large library are used by BLAST. See: Basic local alignment search tool.
• [0239] HMM Viterbi path is alternate approach.
• [0240] In alignment-based and sub-vector-based methods, measures are symmetric so similarity of A to B is the same as similarity of B to A.
• [0241] However if one vector is a library feature vector it is natural to treat it as if it were the model or training sequence.
• [0241] In this case alignment can be performed using MEW methods with models constructed in a similar manner to the forced path training models (Forced Viterbi Alignment ?!?!?!?)
• [0241] Forwards-Backwards algorithm could be used in place of Viterbi.
• [0241] In Viterbi the alignment score is total likelihood of path. Total likelihood not guaranteed to be equal if role of vectors swapped, but this is not generally an issue.

1.22 Clustering [0242]-[0260]

• [0243] Clustering involves determining the number and/or types of polymer present according to some similarity criteria.
• [0244] Given a matrix of distances methods for hierarchical clustering are well known (see Classification).
• [0244] Hierarchical agglomerative methods also used for sequence alignment in packages such as CLUSTAL.
• [0245] Typically hierarchical agglomerative clustering is used on a similarity matrix.
• [0246] Two extremes of agglomerative clustering scoring for a pair of clusters are:

single-link:score on most similar feature vector pair
complete-link:score on least similar feature vector pair

• [0247] single-link clustering and local alignment scores are appropriate for feature vectors showing high similarity over overlapping fragments. For example if clusters made up of feature vectors with overlapping fragments of pairs. E.g. see [0326]-[0333] where sequences 1 and 2 overlap and sequences 2 and 3 overlap. If we wished to identify these as a single cluster we would most likely succeed if we used a local alignment score to identify the short overlapping regions. Single link would join the sequences into the same cluster (1 and 2 plus 2 and 3), but complete link would be a poor choice as 1 and 3 have no overlap.
• [0248] complete-link cluster and global alignment scores appropriate where clusters are expected to be near identical across the entire feature vector. For example where feature vector start and stop is known relative to a known reference and just looking for subtle differences.
• [0249] Iterative algorithm to generate a single reference feature vector for a cluster:

Generate landmark vector (long initial feature vector). [0257] discusses means of deciding on this initial vector.
Align each vector to landmark (see [0229])
Generate new empty landmark
Move along aligned vectors from 2 and when proportion p lie within range r add the mean to empty landmark. [0259] elaborates.
Repeat 2-4 until landmark at 4 is same for consecutive iterations (or max iterations reached)

• [0256] Landmark thus constructed produces consensus of feature vectors.
• [0260] This consensus building process provides a multiple alignment algorithm in terms of feature vectors.

1.23 Classification [0261]-[0276]

• [0262] Classification assigns a query feature vector to one of m classes. There is a library of target feature vectors 14 belonging to these m classes.
• [0263] Method of solution depends on whether target feature vectors are:

heterogeneous: mutually dissimilar at global level
homogenous: globally similar with typically subtle, localized differences

• [0264] In HETEROGENOUS: calculate similarity between query and targets and assign to class of maximum similarity.
• [0265] Derive summary target if multiple targets in class.
• [0266] Alternatively can deal with each target independently.
• [0267] Can use learning algorithms to derive classifiers (i.e. an algorithm that implements classification)

multi-class linear discriminant analysis: a multivariate learning technique, feed it the vector of differences to target feature vector to produce improved classifier
standard hashing algorithms: produce fixed length vector from sub-vectors and input to learning algorithm

• [0269] In HOMOGENOUS: same methods apply but random variations in feature may mask systematic local differences between classes.
• [0270] Thus often more efficient to learn key differences between targets and learn rules for correct classification.
• [0271] Standard supervised machine-learning classification techniques include:

decision tree classifier: rules based on where the feature vector aligns
Bayesian networks: where expert knowledge may be incorporated
Black Box Methods (NNs, random forests, SVMs): not necessarily generating interpretable rules

* [0274] To avoid over-fitting use cross-validation and hold-out sets.

1.24 Localized Region Determination [0277]-[0281]

• [0278] Align to target and identify positions that vary between query and target.
• [0279] For more than one target per class make landmark (consensus building) from targets.

1.25 Assembly [0282]-[0286]

• [0282] Assembly of large feature vectors from small feature vectors.
• [0284] Feature vectors first discretized

representing feature vector as series of deltas
representing feature vector as series of classes based on current level
representing feature vector as series of milestone (well characterized) features

• [0285] Apply standard assembly algorithms.
• [0286] See Velvet or Batzoglou.

1.26 Specific Applications [0288]-[0317]

1.26.1 Counting Molecules Against Known Library [0288]-[0299]

• [0288] Counting molecules against a know library. Like 1st example of S3.
• [0289] Library's feature vectors 14 generated using supervised or unsupervised learning (i.e. put in know samples and store for ensuing analysis).
• [0290] E.g. features of DNA/RNA of known disease may be known beforehand (from measurements or generative model) and measured molecules compared to such.
• [0291] Examples of things to count relative to library:

expression profiles: compare abundance of mRNA transcripts
abundance of biomarker miRNA: 20-25-mer RNA circulating in blood; expression associated with disease/cancers
foetal copy number variation: fragmented foetal DNA circulates in maternal blood; some have additional chromosome copies ( aneuploidy); use capture probes to exons of chromosomes of interest and then sequence these; foetal and maternal DNA could be distinguished by looking for methylation features
comparative genomic hybridization (CGH): profile copy number changes; copy number of various genomic regions can be altered in tumour cells (and in foetuses as above)
viral or bacterial load: measure of infection severity; number of pathogen RNA and DNA per ml of blood is measured (possibly with enrichment); does not have to be done on whole pathogen genome; early/late state measurements could identify antigenic drift and/or antigenic variation; applications in epidemiology like identification (strain typing), disease spread/evolution
probes: e.g. aptamers to a biomarker panel some of which attach to a target molecule; count those that did/didn't bind

• [0299] Identity of organisms could be determined, e.g. in food or cultures

1.26.2 Quantification of Major Populations [0300]-[0301]

• [0301] Possibly a quality control step. When DNA is being synthesized the output could be read to make sure that errors aren't creeping in. Current quality control methods are PAGE, HPLC, and mass spectrometry

1.26.3 Measurement of Differences [0302]-[0305]

• [0303] Example 1: Calling a SNP. This may enable identification of new loci (i.e. location of a gene). This may enable identification of paralog-specific variants (PSVs) in NAHR.
• [0304] Example 2: Methylation. Can do estimation of bulk methylation state of molecules (e.g. whether 100% of the population is 50% modified or whether 50% of the population is 100% modified). Methylation state of certain genes can be used as a biomarker for cancer.
• [0305] Example 3: Splice variants and/or translocation breakpoints; identify position where feature vectors stop matching; or, where half a feature vector maps to locus A and the other half maps to locus B.

1.26.4 Identification of Presence (with Confidence) [0306]-[0311]

• [0308] Example 1: Identify populations related to some degree, but not identical to known molecule. E.g. homology of DNA or protein in rapidly mutating diseases.
• [0309] Example 2: Fusion transcripts (as in splice variants); detection of fusion transcripts is used in cancer diagnosis; e.g. presence of Bcl-abl fusion transcript indicates leukemia.
• [0310] Example 3: Diagnosis of NAHR; messed up recombination (between non-alelic loci) can be detected in change in copy number (see CGH above).
• [0311] Example 4: Comparing plural parts of DNA to plural stored features; e.g. DNA sequence for known protein domains may be stored in library; part of derived feature vector may match a catalytic domain and part may match a DNA binding domain thus the function of the protein may be deduced.

1.26.5 Assembly Application [0312]-[317]

•

1.27 Use Examples [0318]-[0351]

1.27.1 Data Acquisition [0318]-[0325]

• [0323] Single-channel currents measured on Axopatch 200B equipped with 1440A digitizers.
• [0323] Via Pt electrodes cis connected to ground of Axopatch head stage and trans connected to active electrode of the headstage.

1.27.2 Identification and Quantification of DNA [0326]-[0333]

• [0326] This example describes the process of identification of DNA molecules in a solution from a pre-determined library of feature vectors.
• [0327] Library construction performed as follows:

Take PhiX174 5-kb genome
Chop it up into 18 400-mer sequences overlapping adjacent strands by 100 bases
All strands contain sequence at beginning and end common to all strands and not part of the larger genome (these are artificially attached on?!?!?!?)
Library feature vectors constructed for mean current (5-mer model)

• [0329] Candidate molecule are reduce to feature vectors (by measurement)
• [0330] Candidates compared against library using alignment algorithm. Output score from alignment used to measure similarity to each library member.
• [0331] Comparison by alignment score used a gap penalty of -1 and a scoring function of reciprocal absolute difference (i.e. closer matches are higher scores). See the match to library component 13.
• [0333] All test molecules used in this experiment were molecule 13. Occasionally (< 10% of the time by my guess) these are incorrectly identified as molecule 12 from the library.

1.27.3 Measurement of SNPs [0334]-[0339]

• [0335] The candidate molecule is a 19th pattern. Identical to library 13 molecule except for the 3 SNPs [old][position][new]: T335A, G357T, C385A. The effect of these SNPs on the measured current is shown below.
• [0336] Alignment based identification method used previously was repeated. Majority of molecules still correctly identified with library 13 (erroneous correspondence to 12 rose to maybe just above 10%).
• [0337] For SNP calling:

HMM and Viterbi path used for alignment
This has better path constraint (i.e. will align better through mismatched SNP regions) than…
Needleman-Wunsch

Alignments shown below compare well with idealized library mutations shown above.
• [0338] 176 molecules were aligned and their SNPs clearly identified. Figure below shows the difference in current between Viterbi aligned library and candidate feature vectors. The 3 SNPs are visible.
• [0338] In the case of 335 and 357 changes seen at several position (i.e. not just 1 as intended). This is because several measured features are affected by each single change (i.e. a single change to a sequence affects several adjacent k-mers).

1.27.4 Identification of Population and Sub-Population [0340]-[0343]

• [0341] This example is worked through with simulated data.
• [0341] A set of 60 mean current feature vectors of library component 13 is simulated.
• [0341] 10 of these contain a SNP.
• [0341] Gaussian noise of std. dev. 1-pA is added to each value and 5% of values within each vector are deleted at random.
• [0342] Using this dataset a consensus is constructed using the landmark process outlined earlier. The difference at SNP location 337 is clear.
• [0343] SNP of populations 51-60 gathered from consensus is clear in the figure below.

1.27.5 Identification of a Number of Populations [0344]-[0347]

• [0345] Experiments on 2 and 3 simulated species were done. That is populations of 2 and 3 different DNA samples were fed into the machine. Pairwise alignment is used to obtain similarity scores and hence a similarity tree is made by neighbour joining.
• [0347] Identifications (as in example 2) were run for both experiments identifying the population distributions (see histograms).

1.27.6 Assembly of a Library [0348]-[0351]

• [0349] Using the 1-18 overlapping strands mentioned above (but without the extra common pattern tacked on to both ends of each strand).
• [0350] A tree by neighbour joining on pairwise alignment scores was constructed.
• [0350] Since relatively large non-similar regions were expected, a scoring function that does not penalize gaps at the beginning or end of the alignment as strongly as those within alignment was used (see result below).
• [0350] All sequences have similar relation to two other sequences representing the ~100 base overlap on either side.
• [0351] Progressing through the tree in order of relatedness:

consensus landmarks for the aligned sequences are constructed
the landmark from that alignment then serves as the feature vector for alignment with the next sequence
output of the process is a fully assembled feature vector

Table of Contents