miRMaster Tutorial

Getting started

Introduction

With miRMaster you can analyse your NGS miRNA samples. To get started you only need two things:

  • Sequencing files in FASTQ format (uncompressed or .gz compressed) without barcode sequence.
  • The 3' adapter used in the library preparation.

Once you have these information click on the Launch experiment button on the homepage or in the navigation bar. There you will be guided through 3 steps, where you can name your experiment, provide the used 3' adapter and finally upload your sequencing files.

If you have samples from multiple cohorts you can specify their groups by either clicking on the add second group button or by uploading a tab separated sample-to-group file. Once you have chosen your files and clicked the launch button, your data will be preprocessed and sent to the server. You can follow the progress in real-time by clicking the follow button.

This will open the experiment status page in a new tab, where you will be able to track the progress of the analysis of all your uploaded samples. Real-time web reports are provided for each sample that has been uploaded, allowing you to directly inspect your data. They provide information on the preprocessing, mapping, quantification and prediction steps.

As soon as all samples have been analyzed, the results can be downloaded and an overall web-report is created and will be displayed on the top of the page.

Demo run and example samples

We provide two ways for trying out miRMaster:

  1. You can directly inspect a pre-generated demo run of 70 samples (48 Alzheimer's Disease, 22 controls) from GEO (GSE46579).
  2. You can download an example set of 4 samples (2 Alzheimer's Disease, 2 controls), configure your own run and follow the result generation in real-time.

Want to know more?

For a more detailed explanation of the options provided for the analysis setup please read the Starting an analysis section. If you would like more information on the report pages please read the Inspecting results section.


Starting an analysis

The analysis configuration is divided into the following three steps.

1 - Experiment information

Email (optional) We will notify you as soon as the analysis of your uploaded samples is done.
Experiment Name (required) Assign a name to your experiment run so that you will be able to recognize it later.
Aggregated Data Analysis You acknowledge and agree that your data are included in an aggregated analysis. Please note: no experimental information or other data than summarized reads are considered. With the aggregated data analysis you will learn about the specificity of findings in your experiment and at the same time you further the development of miRMaster and miRCarta.

2 - Settings

The settings are divided in the following five categories. All options except the 3' adapter have reasonable default settings. The options for each category can be shown by clicking on the category bars. Even more options can be shown by toggling the expert mode switch.

Preprocessing

3'-Adapter (required) Provide or choose the 3' adapter of your sequencing run.
Minimum read length Only trimmed reads with the specified length will be kept.
Maximum adapter mismatches Allow a certain number of mismatches when trying to find the adapter in the read.
Minimum read/adapter overlap How many nucleotides must overlap between the found adapter and read, in case the adapter is not fully contained in the read.
Trim leading Ns The first X leading bases are discarded if they contain Ns.
Trim trailing Ns The last X trailing bases are discarded if they contains Ns.
Discard reads containing Ns If reads still contain Ns after the trimming they will be discarded.
Sliding window size The size of the sliding window to look at when averaging base qualities.
Sliding window required quality The minimum average window quality needed to keep the sequence. If the quality falls below the minimum the sequence is trimmed from that position on.
Number of threads The number of threads to use for preprocessing your samples in the browser.

Mapping

Genome Reference genome for your reads.
Allowed number of mappings Reads mapping to more than the allowed number of locations will be discarded.
Mapping seed length Only mismatches in the seed region count.
Allowed mismatches Maximum number of allowed mismatches in the seed region.

Precursor Excision

Minimum read stack height The minimum count for reads from which precursors are excised.
Downstream window size In downstream (5' to 3') direction, the highest read stack is searched in a window size of X nucleotides, from which two potential precursors are excised.
Short/long sequence extension Related to the highest read mapping stack. On one side a short sequence extends the mapping region, while on the other side a long sequence extends it. As result for that region, two potential precursors are excised: short-mapped-long and long-mapped-short.

Filtering

Minimal/Maximal length of potential major miRNA If the length of a potential mature miRNA is lower/higher than this threshold, then the corresponding precursor will be discarded.
Minimal required base pair ratio in the potential major miRNA If a potential mature miRNA has a base pair ratio lower than this threshold, the corresponding precursor will be discarded.
Maximal allowed difference in length between the 5p- and 3p-segment If the length difference between a 5p- and 3p-segment is higher than this threshold, then the corresponding precursor will be discarded.
Maximal portion of reads mapping inconsistently according to Dicer processing If the portion of reads mapping inconsistently to a precursor according to Dicer processing is higher than the specified maximum, that precursor will be discarded.

Merging and Global Signature Filtering

Maximal read score portion of reads mapping inconsistently according to Dicer processing If the read score portion of reads mapping inconsistently to a precursor according to Dicer processing is higher than the specified maximum, that precursor will be discarded.
Use default penalty parameters Reads mapping with mismatches are penalized per default by a dividing factor if they occur in at most 10% of all samples (but at most 10 samples). The dividing factor is the limit of occurring samples minus 1, but at least 2. This means that penalties are only introduced when uploading 10 or more samples.
Mismatch score penalty
(modifiable when default penalty = NO)
When scoring the global signature we penalize reads mapping with mismatches to the precursor by a dividing factor.
Maximum occurring number of samples for score penalty
(modifiable when default penalty = NO)
If a read mapping with a mismatch occurs in more than the specified number of samples its score is not penalized.
Maximum merging distance Precursors are merged if the distances of their start and end positions are within the specified maximum distance.

Quantification

Maximal allowed mismatches for the quantification of other ncRNAs The maximum number of mismatches allowed during the quantification of other ncRNAs.
Maximal allowed mismatches for miRNA quantification The maximum number of mismatches allowed during the quantification of known and novel miRNAs.
Maximal allowed distance from the 5'/3' end for miRNA/isomiR/mismatch quantification The maximum allowed distance of the read 5'/3' end to the 5'/3' end of the annotated miRNA position.
Maximal allowed mismatches for isomiR quantification The maximum number of mismatches allowed inside a read (the flanks are excluded to account non-template nucleotide additions) during the isomiR quantification of known and novel miRNAs.
Maximal allowed mismatches for quantification of mismatches in miRNAs The maximum number of mismatches allowed during the quantification of mismatches in known and novel miRNAs.

3 - Data upload

In the last step you can optionally enable a group-based analysis by adding group names using the Add a second group button, or by uploading a sample-to-group tab separated file. Afterwards your samples can be chosen via drag-and-drop or by clicking in the upload box. When enabling group-based analysis the groups of each file will be displayed in the table allowing to cross-check the assignments.

Click the launch button once you have added all your files to start the analysis. A button to follow the real-time progress of the analysis will then appear. The preprocessing and upload progress are shown in progress bars in a table. Once the bar reaches 100% and the file has been upload it will turn green. During the preprocessing adapter will be removed from the reads, bad quality reads will be filtered out or trimmed and reads with identical sequences will be collapsed, retaining only the number of identical reads and their sequence.


Inspecting results

Status page (example)

The status page displays the real-time progress of your analysis (). You can see how many samples have been transferred to the server and their current analysis progress. Each row in the status table shows the timestamp when the sample reached the server, the sample name, its current analysis progress, the status and a link to its real-time report. Groups are displayed in the table as well, if specified.

The progress bar shows 10 steps that are performed for each sample. If an analysis fails for some reason, the bar will turn red and a failure icon will be shown in the status column. Samples that have not been processed yet will have grey bars and a pending status icon.

Once all individual samples have been processed an overall report will be generated. When done, the results will be available for download at the top of the page, as well as a link to the web-report.

Single sample report (example)

The report is divided into the following six categories. All plots and tables can be downloaded in multiple formats (png, svg, pdf, csv, xlsx).

1 - Overview

The overview page provides information about the experiment id, name and the sample name.

2 - Preprocessing

The preprocessing page provides information about the sequences found in the sample and is subdivided into seven tabs.

Overview

Basic information on the number of sequences processed, the mean GC-content and the detected FASTQ quality encoding.

Base quality

A plot displaying the mean quality and standard deviation of all valid sequences at each position.

Sequence quality distribution

A plot displaying the number of sequences having a specific mean quality. Most sequences should have a mean quality near 40.

Base distribution

A plot displaying the base distribution at each position found in all valid sequences. Unknown bases that could not be recognized by the sequencer are described by N.

GC content distribution

A plot showing the distribution of the mean GC content in percent among the valid sequences.

Sequence length distribution

A plot showing the distribution of the sequence length among the sequences. For miRNA sequencing data peaks should be observed around 22 nucleotides.

Overrepresented sequences

A table showing the 50 most frequently found sequences, displaying their sequence, count and percentage of all valid sequences.

3 - Mapping

The mapping page provides information on the number of mapped sequences against different references. We report the mapping statistics regarding all reads on one hand, and regarding the collapsed reads (leading to only unique sequences and counting each as occurring only once) on the other. It is subdivided into five tabs.

Genome

Reads are mapped against the human genome (hg38) per default without mismatches in the first 18 nucleotides and reads mapping to more than 5 locations (per default) are discarded. Pie charts for the number and portion of reads mapping to the genome are shown.

miRNA

Reads are mapped against the human miRBase v21 [1] while allowing per default one mismatch. We report the number and portion of reads mapping to known miRNAs within a 2 bases upstream and 5 bases downstream window. Further we report the portion of reads mapping exclusively to a miRBase stem-loop, but not to a miRNA.

Other ncRNA

Reads are mapped against other ncRNAs of Ensembl (release 85) [2] without mismatches.

Bacteria/Viruses

We map non-human reads (all reads that did not align to the human genome with at most one mismatch) to all 7,556 bacteria and 7,026 virus sequences of NCBI RefSeq release 74 [3] and report the number of perfectly mapping reads. Reads mapping to bacteria or viruses can indicate exogenous miRNAs, but also reagent contamination or diseases such as sepsis.

4 - Quantification other ncRNA

The ncRNA quantification page provides information about the amount of found known ncRNAs from Ensembl, piRBase and GtRNAdb. It is subdivided into seven tabs.

lincRNA

lincRNAs are taken from Ensembl and counts are reported for each detected sequence.

piRNA

piRNAs are taken from piRBase and counts are reported for each detected sequence.

rRNA

rRNAs are taken from Ensembl and counts are reported for each detected sequence.

tRNA

tRNAs are taken from GtRNAdb and counts are reported for each detected sequence.

snoRNA

snoRNAs are taken from Ensembl and counts are reported for each detected sequence.

scaRNA

scaRNAs are taken from Ensembl and counts are reported for each detected sequence.

snoRNA

snRNAs are taken from Ensembl and counts are reported for each detected sequence.

5 - Quantification miRNA

The miRNA quantification page provides information about the amount of found known miRNA sequences in the sample. All found known miRNAs are listed in a table and include the number of reads that mapped to them within a 2 bases upstream and 5 bases downstream window.

Each miRNA is clickable and additional information are provided for the selected miRNA and its precursor stem-loop. The sequence and the minimum free energy are reported, as well as the secondary structure. The secondary structure nucleotides are per default colored according to the type of structure they belong to:

Green:: Stems (canonical helices)
Red: Multiloops (junctions)
Yellow: Interior loops
Blue: Hairpin loops
Orange: 5' and 3' unpaired region

The 3' and 5' miRNA are highlighted specifically as well.

The pileup plot of the reads mapping to the stem-loop can be inspected. It can filtered according to the minimum number of reads mapping to a position. Further the plot can be dragged and zoomed to visualize the exact sequences. The number of reads with a shown sequence is shown on the left of the pileup plot (e.g. x1234 is a read occuring 1,234 times).

The read distribution on the precursor is visualized showing the total number of sequences mapping at each position. The distribution plot can be displayed either as bar plot or as area plot.

6 - Prediction

The prediction page provides information about the predicted precursors in the sample, before the merging step performed for the overall report. All predictions are listed in a table and include the number of reads mapping to the 5' and 3' predicted miRNAs, the genomic location of the precursor, its predicted probability of being a true precursor and whether it contains already known miRNAs. Since we don't exclude known precursors, many entries in this table will contain re-discovered known precursors.

Each precursor is clickable and additional information are provided for the selected prediction, as in the quantification tab. The sequence and the minimum free energy are reported, as well as the secondary structure. The additional information can be shown for the precursor sequence (determined by the start of the 5' miRNA and the end of the 3' miRNA) and also for the precursor stem-loop which was determined during the precursor excision step.

The secondary structure nucleotides are per default colored according to the type of structure they belong to:

Green:: Stems (canonical helices)
Red: Multiloops (junctions)
Yellow: Interior loops
Blue: Hairpin loops
Orange: 5' and 3' unpaired region

The 3' and 5' miRNA are highlighted specifically as well.

The pileup plot of the reads mapping to the precursor/stem-loop can be inspected. It can filtered according to the minimum number of reads mapping to a position. Further the plot can be dragged and zoomed to visualize the exact sequences. The number of reads with a shown sequence is shown on the left of the pileup plot (e.g. x1234 is a read occuring 1,234 times).

The read distribution on the precursor/stem-loop is visualized showing the total number of sequences mapping at each position. The distribution plot can be displayed either as bar plot or as area plot.

Overall report (example)

The report is divided into six categories. It is built similar to the single sample report. All plots and tables can be downloaded in multiple formats (png, svg, pdf, csv, xlsx). For all except the first category at least an overview table is shown. A detailed per sample table is available for the same categories as well, however it is hidden by default to allow faster page loading. The overview table provides the median, mean, standard deviation and coefficient of variation for each provided information. If groups were specified the measures are provided for each group.

1 - Overview

The overview page provides information about the experiment id, name, the total number of analyzed samples and reads.

2 - Preprocessing

The preprocessing page provides information about the preprocessing of the sequences and their mean GC content.

Overview table.

The detailed sample information table shows all numbers per sample.

3 - Mapping

The mapping page provides information on the number of mapped sequences against different references. It is subdivided into five tabs.

Genome

Reads are mapped against the human genome (hg38) per default without mismatches in the first 18 nucleotides and reads mapping to more than 5 locations (per default) are discarded. We report the mapping statistics regarding all reads on one hand, and regarding the collapsed reads (leading to only unique sequences and counting each as occurring only once) on the other.

Overview table.

The detailed sample information table shows all mapping statistics per sample.

miRNA

Reads are mapped against the human miRBase v21 [1] while allowing one mismatch per default. We report the number and portion of reads mapping to known miRNAs within a 2 bases upstream and 5 bases downstream window. Further we report the portion of reads mapping exclusively to a miRBase stem-loop, but not to a miRNA.

Overview table.

The detailed sample information table shows all mapping statistics per sample.

Other ncRNA

Reads are mapped against other ncRNAs of Ensembl (release 85) [2] without mismatches per default.

Overview table.

The detailed sample information table shows all mapping statistics per sample.

Bacteria/Viruses

We map non-human reads (all reads that did not align to the human genome with at most one mismatch) to all 7,556 bacteria and 7,026 virus sequences of NCBI RefSeq release 74 [3] and report how many reads per million (RPM) are perfectly mapping to each species. Reads mapping to bacteria or viruses can indicate exogenous miRNAs, but also reagent contamination or diseases such as sepsis.

Overview table.

The detailed sample information table shows all mapping statistics per sample.

4 - Quantification other ncRNA

The ncRNA quantification page shows the expression of all known ncRNAs from Ensembl, piRBase and GtRNAdb. It is subdivided into seven tabs.

lincRNA

lincRNAs are taken from Ensembl. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

piRNA

piRNAs are taken from piRBase. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

rRNA

rRNAs are taken from Ensembl. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

tRNA

tRNAs are taken from GtRNAdb. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

snoRNA

snoRNAs are taken from Ensembl. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

scaRNA

scaRNAs are taken from Ensembl. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

snRNA

snRNAs are taken from Ensembl. In the overview the expression statistics are reported for all samples regarding the RPM normalized expression.

The detailed sample information table shows all expression values per sample.

5 - Quantification miRNA

The miRNA quantification page shows the expression of all known miRNAs of miRBase v21 [1].

miRNA enrichment analysis using miEAA [4] can be performed by clicking one of the provided options in the toolbar that will send the expressed miRNAs ranked according to their mean or median expression when no grouping was specified. If groups were given, miRNAs are sorted according to their fold-change.

The target networks of all miRNAs can be inspected by following the link to miRTargetLink [5] provided in the overview table for each miRNA.

If groups are given the overview table contains the false discovery rate adjusted [6] and unadjusted Kruskal-Wallis p-values, and the (log) fold changes of all group comparisons in addition to the usual measures.

The detailed sample expression shows the expression of each miRNA in each analyzed sample. The expression can either be in RPM or alternatively the raw counts.

6 - Prediction

The prediction page shows all predicted precursors and their miRNAs. Each predicted precursor belongs to one of six categories:

  • Known: when the prediction is overlapping with a miRBase entry and both miRNAs are overlapping with known miRNAs by at least 75%.
  • Shifted known: when the prediction is only partially overlapping with miRBase and only one miRNA is overlapping by at least 75% with a known miRNA.
  • One annotated: when the prediction is overlapping with a miRBase entry, but only one miRNA is annotated for that entry and this one is overlapping by at least 75% too.
  • Dissimilar overlapping: when the prediction is overlapping with a miRBase entry, but the miRNAs are not overlapping with the annotated ones.
  • Half novel: when the prediction is not overlapping with any miRBase entry, but contains at least 75% of one known miRNA.
  • Novel: when the prediction is not overlapping with any miRBase entry and does not contain any known miRNA.

Per default only novel precursors are shown, but others can be displayed by selecting the corresponding category in the toolbar at the top of the page.

If groups are given the overview table contains the false discovery rate adjusted [6] and unadjusted Kruskal-Wallis p-values, and the (log) fold changes of all group comparisons in addition to the usual measures.

The detailed sample expression shows the expression of each miRNA in each analyzed sample. The expression can either be in RPM or alternatively the raw counts.

The detailed prediction information shows for each predicted precursors multiple properties.

  • The precursor name
  • The probability of being a true precursor
  • The number of samples in which it was predicted (this can be different from the number of samples in which it is expressed)
  • The mean RPM count of the 3p and 5p miRNA
  • The novoMiRank score of the precursor
  • The precursor sequence, secondary structure in bracket notation, 3p and 5p miRNA sequence and the genomic location
  • The motifs found in the precursor. We search five different miRNA motifs: UG, UGU/GUG, CNNC [7], GHG [8], GGAC [9], while allowing matching up to two nucleotides upstream or downstream of the expected motif position.
  • The precursor category
  • The columns blacklist_5p and blacklist_3p show all entries from the Ensembl human non-coding RNA database (release 85) [2] that are not marked as novel, and all entries from NONCODE 2016 with which the 5' and 3' miRNA overlap by at least 90% and differ in at most one position.
  • The columns blacklist_5p_novel and blacklist_3p_novel show all entries from the Ensembl human non-coding RNA database (release 85) [2] that are marked as novel and with which the 5' and 3' miRNA overlap by at least 90% and differ in at most one position.
  • The columns mirbase_5p and mirbase_3p show all miRNAs of miRBase v21 [1] with which the 5' and 3' miRNA overlap by at least 90% and differ in at most one position.

Each precursor is clickable and additional information are provided for the selected prediction. The sequence and the minimum free energy are reported, as well as the secondary structure. The additional information can be shown for the precursor sequence (determined by the start of the 5' miRNA and the end of the 3' miRNA) and also for the precursor stem-loop (15 nucleotides extension on the 5' and 3' ends).

The secondary structure nucleotides are per default colored according to the type of structure they belong to:

Green:: Stems (canonical helices)
Red: Multiloops (junctions)
Yellow: Interior loops
Blue: Hairpin loops
Orange: 5' and 3' unpaired region

The 3' and 5' miRNA are highlighted specifically as well.

The pileup plot of the reads mapping to the precursor/stem-loop can be inspected. It can filtered according to the minimum read per million (RPM) expression of reads mapping to a position. Further the plot can be dragged and zoomed to visualize the exact sequences.

The read distribution on the precursor/stem-loop is visualized showing the total RPM of all sequences mapping at each position. The distribution plot can be displayed either as bar plot or as area plot.

Bibliography

  1. Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68-73 (2014).
  2. Yates, A. et al. Ensembl 2016. Nucleic Acids Res. 44, D710–D716 (2016).
  3. Pruitt, K., Brown, G., Tatusova, T. & Maglott, D. The Reference Sequence ( RefSeq ) Database. NCBI Handb. 1–24 (2002).
  4. Backes, C., Khaleeq, Q. T., Meese, E. & Keller, A. miEAA: microRNA enrichment analysis and annotation. Nucleic Acids Res. 44, W110–W116 (2016).
  5. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
  6. Auyeung, V. C., Ulitsky, I., McGeary, S. E. & Bartel, D. P. Beyond secondary structure: primary-sequence determinants license pri-miRNA hairpins for processing. Cell 152, 844–858 (2013).
  7. Fang, W. & Bartel, D. P. The Menu of Features that Define Primary MicroRNAs and Enable De Novo Design of MicroRNA Genes. Mol. Cell 60, 131–145 (2015).
  8. Alarcón, C. R., Lee, H., Goodarzi, H., Halberg, N. & Tavazoie, S. F. N6-methyladenosine marks primary microRNAs for processing. Nature 519, 482–5 (2015).