Home >> Nanopore training course >> Analyzing your data >> Analysis example: double-stranded DNA
Table of Contents
The examples in this tutorial were done using Nanolyzer™, Northern Nanopore's data analysis software.
1. Exploring the data
As an example of the full workflow, we're going to analyze a simple DNA experiment consisting of 4000 bp dsDNA translocating a 5.6nm diameter nanopore in 10nm thick SiNx using a negative voltage. You will have received a copy of this data along with your initial Nanolyzer license, and we encourage you to follow along and ensure that you are able to reproduce the results seen here. To load the data, refer to the previous post on the subject.
The first task when analyzing a dataset is usually to simply look through the raw data and get a sense of the events that will be fitted, which we can do simply using the Raw Data tab in Nanolyzer.
Looking through, we see a modest rate of high-SNR events, which usually makes for easy analysis.
2. Analysis setup
Once we have visually inspected the data and decided it is worth analyzing, we click Data->New Analysis, choose a location to store the output, and wait a moment as Nanolyzer automatically detects the baseline current statistics. This may take a few seconds if you have chosen a long segment of baseline to plot, or if the data is very densely sampled, for example from the Chimera VC100.
Once it has found the baseline, verify visually that it is accurate, that the green box covers the baseline range closely, and that the red threshold line does not intersect with the noise. it should look like the image below, with the mean current and baseline RMS noise reported in the figure legend. We will need these values later.
You'll note that a new control panel has appeared, labelled "CUSUM Analysis Setup Controls". This is the panel into which we input the information Nanolyzer needs in order to fit events. Most of the parameters are already filled out for you and just need to be checked quickly. A full reference to the meaning of each parameter can be found in the previous section on the subject.
If the baseline was not accurately fitted or the threshold line needs to be adjusted, navigate through the drop-down labelled "IO Settings" to "Event Segmentation Settings" and tweak the baseline parameters. Alternatively, plot a different section of the current and simply press the "Fit Baseline" button to have it try again.
For this tutorial we're going to use all of the defaults, except for 2 parameters that regularly need to change: Sensitivity, and Minimum Step Size. Navigate through the drop-down labelled "IO Settings" and select "Event Fitting Settings". To choose a sensitivity value, we need to zoom in on one of our events. I pick one at random that looks like it has some internal structure, as shown below. We want to be able to fit both those internal sub-levels, which means we need to set Sensitivity to be equal to the smallest significant current change we want to fit. In this case there are three current changes. Hovering the mouse on the figure we can see the current values, and we estimate that each of the smaller step changes is about 1200 pA. Dividing by my baseline RMA, 1200/56.5 is about 21, which is what I set as my Sensitivity parameter. Minimum Step Size we set to half of this value, 10.5. Since this dataset contains only events in a narrow range of passage times, there is no need to use Elasticity so we leave this at zero.
Satisfied with our setup, we press "Save and Run Analysis" and wait a few moments while event fitting proceeds.
3. Sanity checking
After analysis, we get a report on the fits. It looks like the one below, and summarizes the amount of good baseline current detected, the number of events, and the breakdown of those events into successful fits or gives specific reasons for fitting failure. Note that if you used a different region of the data for your baseline fitting, you may end up with slightly different results depending on the event fitting settings chosen.
Locating events...
Read 573.8 seconds of good baseline
Read 0 seconds of bad baseline
--------------------------------------------------------
Event Summary: 631 events detected
Success: 96.4
Failed: 3.65 stepfit success
--------------------------------------------------------
Event Type Count Percentage Reason
0 608 96.4 % cusum success
1 0 0 % stepfit success
--------------------------------------------------------
2 0 0 % baseline differs
3 0 0 % too long
4 0 0 % too short
5 0 0 % too few levels
6 0 0 % cannot read data
7 0 0 % cannot pad event
8 23 3.65 % fitted step too small
9 0 0 % stepfit found zero
10 0 0 % stepfit degenerate
11 0 0 % maxiters reached
12 0 0 % stepfit failed (f)
13 0 0 % stepfit failed (o)
14 0 0 % stepfit failed (p)
15 0 0 % stepfit out of memory
16 0 0 % stepfit invalid input
17 0 0 % stepfit user interrupt
18 0 0 % stepfit found NaN
--------------------------------------------------------
Cleaning up memory usage...
The meanings of the error codes are presented in a previous post.
This dataset was pretty well-behaved overall. Most events were fitted successfully, while a handful were rejected for being too shallow. Usually, this means that data briefly got noisy and passed the threshold, so unless this is an overwhelming majority of events, we are not overly concerned with rejecting 5% of my events this way. If we go back and re-plot the raw data, we can now see an overlay showing successfully fitted events marked in green, and rejected events in red. The dashes yellow line shows the local threshold used for event finding.
In this case the rejected "event" is clearly not DNA, and so there is no issue. You should spend a bit of time plotting the data and observing the events that are rejected and making sure that you are happy with the breakdown.
Next, we look through the fits themselves to make sure they are reasonable. Move to the Event Viewer tab and click Next a few times to observe the fits. Note that the green bar you see denotes regions where the Intra-event Threshold (green dashed line) was crossed and before the Intra-event Hysteresis (yellow dashed line) was crossed; because we didn't set these parameters for this purposes of this analysis, we can ignore that for now. A typical example is shown below.
4. Visualizing the results
Satisfied with our fits, we move to the Statistics tab, which is now accessible. The default setup gives us the plot of Sublevel Duration versus Sublevel Blockage. Tick "Log" for Sublevel Duration, and Update Plot.
This plot reports the sub-levels within events, without considering the event they came from, which is to say, an N-level event contributed N points to this plot. This means that should a long-lasting event (e.g. a clog) be misidentified as an event with hundreds of sub-levels, it could contribute hundreds of datapoints and skew the visualization. In this case, our data It looks reasonable, with a single-file DNA level around 1200 pA and a folded level around 2400 pA, but just to be sure that we aren't biasing our results with overzealous fits, let's plot a 1D histogram of the Number of Levels, which gives us the plot below.
Note that Nanolyzer counts baseline before and after the event as a sub-level, so the minimum number of levels a valid event can possibly have is 3. Here, we see that roughly 250 events have 3 levels (i.e. they are not folded), while roughly 350 events have some folding. While there are a handful of events with sub-level counts up to 9, they are a small minority. We'll come back to that later.
For fun, let's consider a 3D scatterplot. In this case we will use event metadata: Dwell Time, Maximum Blockage, and Number of Levels. Since these are single numbers for each event, each event contributes only one point to this plot.
It's also possible to mix column types. If I want to consider total Dwell Time, but still see the blockages from individual levels, I can select those columns. In this case, Nanolyzer will duplicate the Dwell Time and Number of Levels points to match the sub-level metadata counts. Each event with N levels now contributed N levels to the plot. Note that where we had a single point in the plot above with 9 sub-levels (since only one event has 9 sub-levels), we now have 9 points in that region.
There are a huge variety of possible plot combinations that can be visualized with Nanolyzer. Play around with the plots and see what interesting views you can come up with. A full list of available metadata for plotting is presented in a previous post.
5. Identifying sub-levels
Next, we'll use the Clustering tab to separate the events into sub-levels, which is described in more detail in a previous post. This dataset has two very clearly separated sub-level populations so this is easy using both Clustering algorithms. First, using Gaussian Mixtures and 2 clusters, we can separate out the folded and unfolded states easily using the log of the Sublevel Duration and the Sublevel Blockage columns by pressing Update Sublevels, as below.
HDBscan performs similarly, but note the main difference: HDBscan also automatically detects and labels outliers with the -1 cluster label.
Having performed this clustering, Nanolyzer defines new columns in the database that can be used in the statistics plotting window: Sublevel Labels and Sublevel Label Confidence. These can now be used for visualization and event database filtering, which we will discuss shortly.
We can also use Clustering in this case to separate out the events that contain a folder from those that do not. We'll use HDBscan with the Dwell Time, Maximum Blockage, and Number of Levels columns for this purpose, and Cluster using the Update Event IDs button. This will define new columns fo ruse in plotting and filtering, Cluster ID and Cluster Confidence.
Because we added the extra Number of Levels feature, our clustering actually gives us additional information. Cluster -1 is outliers as before. Cluster 0 is unfolded events. Cluster 1 is events that are entirely folded, with no single file level, and cluster 2 is events that contain both a folded and an unfolded level. Note that this operation was applied to Subset 0. This is important because only Subset 0 will have these properties defined.
6. Filtering our event database
Next, we want to clean up our event database for further analysis. First, go back to the Statistics tab and plot a 2D scatterplot of Dwell Time against Maximum Blockage, which should look like this:
Let's remove all the events that our clustering classified as outliers. We open the Data Manager (Data->Data Manager), click New Filter, rename it, and build a simple filter based on the Cluster ID we just defined, as shown below in the section labelled 1. Add filtering operation(s). Review the filter you made in section 2. Review filter operations, and in section 3. Select root operation to build filter, select the operation you just defined. Apply it to the subset Full Dataset, since that is where we defined our Cluster IDs, and give the new subset to be created a name (no_outliers). We'll tick Auto-update Plot so that we can easily see the effect we are having on our data. If you are having trouble finding any of these controls, the dedicated section of the tutorial has a full controls reference.
Hit the Filter and Create Subset button to the right and go back to the Statistics tab to see the new plot. We're now showing both subsets, with the orange plot having removed all of the outliers for us. We can use this subset as the basis for defining further subsets, which will by definition only contain events we have decided are not outliers.
Next, let's again use the Cluster ID column to split up our datasets into subsets by cluster label. Go back to the Data Manager and repeat the exercise of filtering to make a subset for every Cluster ID (-1, 0, 1, and 2). After you're done, your Data Manager and plot window should look something like the images below.
It's a bit crowded, so let's disable the plot for the Full Dataset and for the no_outliers subset, since while these are useful for defining more subsets, we aren't interested in seeing them by themselves. Go back to the Data Manager, select the offending subsets in the bottom right, and untick Enable Plot. Don't worry; they are still available for filtering operations and plotting later on. Our cleaned up plot looks much better.
Explore the Statistics window with extra subsets created. For example, a 1D histogram of the Equivalent Charge Deficit (ECD) looks like this:
While one expects that the ECD would be conserved and independent of folding state, it's clearly not perfect!
7. Fitting the capture rate
The last thing we'll do in our demo is fit the capture rate. Right now, this is very simple: go to the Capture Rate tab, select the subset for which you would like a to extract the capture rate, and press the button. In this case we'll consider the unfolded events, which have a capture rate of 0.34 Hz:
The fits will automatically run, and you'll get a visual representation and a report. For details on what the reported number means practically, see the dedicated discussion on the subject.
8. Wrapping up
This tutorial covers the simplest operations that can be performed with Nanolyzer, and the possibilities are endless. The tools we provide are sufficiently general that we're not only confident that you will find them useful in your research, but we're sure that you can find ways to apply them that we haven't even thought of, yet. We love to see creative applications of these tools to data analysis, so if you have something to show off, or you need help constructing the filters and visualizations you need, reach out and tell us about it.
If some analysis you need is not yet supported, not to worry. Nanolyzer is being added to every day, driven largely by requests by our users. Let us know how we can enable your research, and we'll work on building it in to a future release.
Last edited: 2021-08-02
Comments