Introduction

Built environments impact human health and disease, especially in countries where people spend a major part of the day indoors1. The indoor microbiome originates from many different sources, such as the communities of microbes that reside in/on the human body, from building components such as plumbing and ventilation, as well as from outdoor environmental sources that are brought inside2. Studying the indoor microbiome may help us understand how the indoor environment affects human health3,4,5.

Several studies have investigated the taxonomic diversity of bacterial communities in dust samples from buildings6,7,8,9,10. Amplification of the 16S rRNA gene coupled with high-throughput sequencing (HTS), allows for deep investigations into microbial communities. Technological advances continue to drive down costs, making HTS affordable and available for use in a wide range of novel research areas. Although several sequencing platforms and standardized protocols are available for HTS analysis11, there are differences between them and results may, therefore, diverge. Illumina sequencing platforms producing very high quality, but short (~ 300 bp) reads have been widely employed in the field of 16S rRNA amplicon sequencing11. This approach only permits analysis of a sub region of the 16S rRNA gene and taxonomic assignment of reads at the species level may be elusive.

In 2015, Oxford Nanopore Technologies (ONT) made the ultraportable mobile phone-sized MinION platform based on the ONT single molecule sequencing technology commercially available. The nearly unrestricted read length possible with the MinION sequencer allows for sequencing of full-length 16S rRNA gene amplicons, albeit with a slightly lower per read accuracy than many other HTS platforms. Despite the higher error rate, the increased sequence length provided by MinION might make possible the identification of bacterial taxa to the species level12.

Although the potential of using the MinION platform to analyze the bacterial composition at the species level is promising, this has not been comprehensively explored. The major aim of the present study, although restricted to a relatively small number of samples, was to investigate if the ONT MinION sequencing platform might offer promise for investigating the structure of the microbiota in dust collected from kindergartens and nursing homes. We consider how long-read sequences (ca. 1400 bp) obtained from the MinIon sequencer compare to short-read sequences (ca. 300 bp) obtained from Illumina MiSeq for classification of bacteria present in the indoor environment.

Results

Generation of 16S rRNA gene amplicon sequences

Illumina 300-bp paired-end sequencing generated a total of 2203794 sequence reads, with on average 183650 sequence reads per dust sample. After quality filtering a total of 582032 sequence reads, with on average 48503 amplicon sequence variants (ASVs) per dust sample were kept for analysis (Table 1).

Table 1 Sequence reads generated per sample for both short-read and long-read amplicons.

Sequencing of long-read 16S rRNA amplicons on Nanopore MinION generated a total of 2408076 sequence reads after basecalling, with on average 200673 sequence reads per sample. After quality filtering of the basecalled sequences, 1156807 sequence reads were retained with an average of 96401 sequence reads per sample (Table 1).

Taxonomic assignment of 16S rRNA gene amplicon sequences

For the short-read sequences, 582032 ASVs were taxonomically assigned using vsearch against Greengenes (GG) and SILVA. The DADA2 pipeline uses an ASV approach where the sequences themselves function as the unique identifier for taxons, rather than grouping reads into operational taxonomic units (OTU). 1156751 long-read sequence reads were passed from quality control to taxonomic assignment and aligned using LAST against GG and SILVA. The full SILVA and Greengenes databases contain approximately 190 000 and 99 000 sequences, respectively9.

The degree of assignment of long and short read sequences at different taxonomic levels, obtained when using GG and SILVA reference databases, is shown in Table 2. With respect to short read sequences, SILVA achieved a higher degree of identification at all taxonomic levels. However, for long read amplicons there was more variation in the performance of the databases. SILVA performed better at the species level and GG was able to assign more taxa at the higher levels, particularly at the order level (Table 2).

Table 2 Taxonomic assignment of short-read (Illumina Miseq) and long-read (Nanopore MinION) amplicons against the Greengenes (GG) and SILVA 16S rRNA gene reference databases.

Efficiency of taxonomic assignments based on long- and short-reads

When using GG, in total 732 taxa were identified at the species level based on long- and short-reads. Of these, 91.7% could only be assigned based on long-reads generated by the MinION platform (Table 3). When using SILVA, 10475 bacterial species were identified. Of these 99.5% were only found by analysis of long-read sequences.

Table 3 Number of taxa identified at the different taxonomic levels using GG and Silva.

Bacterial taxa in dust samples revealed by short and long-read 16S rRNA gene sequencing

Both short-read amplicons sequenced by Illumina MiSeq and long-read amplicons sequenced by Nanopore MinION were taxonomically assigned against the GG and SILVA databases. The microbial classifications obtained were compared at different taxonomic levels (order, family, genus, and species) for all 12 samples. The relative abundance of the 15 most abundant taxa determined at genus and species level with each platform are shown using heatmaps in Figs. 1, 23 and 4. Heatmaps for order and family level are shown in Supplement 14.

Figure 1
figure 1

Heatmap of the 15 most abundant genera identified by mapping 16S rRNA gene amplicons sequenced on Illumina MiSeq and Nanopore MinION against the Greengenes reference database.

Figure 2
figure 2

Heatmap of the 15 most abundant genera identified by mapping 16S rRNA gene amplicons sequenced on Illumina MiSeq and Nanopore MinION against the SILVA reference database.

Figure 3
figure 3

Heatmap of the 15 most abundant species identified by mapping 16S rRNA gene amplicons sequenced on Illumina MiSeq and Nanopore MinION against the Greengenes reference database.

Figure 4
figure 4

Heatmap of the 15 most abundant species identified by mapping 16S rRNA gene amplicons sequenced on Illumina MiSeq and Nanopore MinION against the SILVA reference database.

At the species level only a few taxa were identified by both long-read and short-read sequences (Table 3, Figs. 3 and 4). This is most notable for alignments against the SILVA database, where most of the taxa were identified only by the long-read sequencing platform, e.g. Micrococcus luteus, Streptococcus salivarius subsp. thermophilus and Haemophilus influenzae. The opportunistic pathogen Stenotrophomonas maltophilia was identified at low relative abundance across all samples but only when using the SILVA database. Species-level assignments also reveal signature differences between intake and indoor samples. The commensal M. luteus, although identified in all samples using both databases, is indicated at consistently higher relative abundances in samples originating in the indoor space, particularly floor dust. A somewhat similar trend, again only revealed by long read sequencing, was found for the nasopharynx commensal Haemophilus influenza. In almost every instance, only long read sequences were able to indicate the presence of these species (Figs. 3 and 4).

At the genus level, both GG and SILVA alignments showed that the short-read Illumina amplicons gave higher relative abundancies of Pseudomonas in all samples. Samples from outdoor sources (BC01, BC02, BC04, and BC06) showed the largest differences between long and short reads.

As Silva and GG performed somewhat differently in assigning long-read amplicons, the dataset was also analysed using BLAST against the NCBI 16S rDNA database. The curated NCBI 16S database contains approximately 20 000 sequences, compared to the 190 000 and 99 000 sequences in the full SILVA and Greengenes databases, respectively. Table 4 illustrates the most abundant taxa at all sample sites using all three databases, GG, SILVA and NCBI. The results with NCBI were most similar to those obtained with GG. Samples from heating, ventilation and air conditioning (HVAC) exhaust filter dust (BC03, BC05, BC07, and BC12) and floor dust (BC08-BC11) had a higher abundance of genera associated with human activity (e.g. Streptococcus, Micrococcus, Staphylococcus, Corynebacterium) (Table 4, Figs. 1 and 2). Conversely, genera commonly found in soil and water (e.g. Janthinobacterium, Hymenobacter, Pedobacter) were generally abundant in samples BC01, BC02, BC04, and BC06, which were intake air dust samples originating from outdoor sources (Table 4, Figs. 1 and 2).

Table 4 The most abundant taxa at the genus level identified from the three different sample types using Illumina short-read sequences and Nanopore long-read sequences and three different databases.

Long-read and short-read sequencing correlation

Spearman’s rank correlation illustrated that the sequencing platforms revealed similar bacterial composition at the level of order and family, while the results at the genus and species levels differed to a higher degree for some samples (Fig. 5, Supplement 512).

Figure 5
figure 5

Correlation of identified taxa at (a) the genus level against GG, (b) genus level against SILVA, (c) species level against GG, and (d) species level against SILVA between sequencing platforms for all 12 samples. The dashed lines mark a 0.01% relative abundance threshold for each taxa for Nanopore and Illumina sequence data.

Analysis of individual samples showed a strong or moderate positive correlation between the sequencing platforms at the order level for all samples (Supplement 5 and 9). At the family level, eight samples had a strong positive correlation between the sequencing platforms when aligned against GG, whereas eight samples had a moderate positive correlation (Supplement 6). When aligned against SILVA, two samples had a moderate positive correlation (BC08 and BC12), and six samples had a weak positive correlation. The remaining samples had either a negligible or non-significant correlation (Supplement 10). At the genus level, the results obtained with long and short-reads against GG showed a moderate positive correlation for samples BC03 and BC07. For the remaining samples, the correlations were either a negligible or non-significant (Supplement 6). All samples had either a neglible or non-significant correlation at the genus level when aligned against SILVA (Supplement 11). At the species level, all samples had either a negligible or non-significant correlation between the sequencing platforms, when aligned against both GG and SILVA. (Supplement 8 and 11)

In the correlation plot of the identified taxa (Fig. 5) it can be seen that a larger proportion of the Nanopore sequences fall below 0.01% abundance compared to Illumina sequences. This is seen at both the genus and species level for identifications against both GG and SILVA.

Discussion

We analyzed 16S rRNA gene amplicons generated from 12 dust samples collected from kindergartens and nursing homes in Norway. Two types of sequencing libraries were prepared: Short-read amplicons for sequencing on Illumina MiSeq were prepared by amplifying the V3-V4 hypervariable regions (approximately 464 bp) of the 16S rRNA gene. Long-read amplicons for sequencing on Nanopore MinION covered the V1-V9 hypervariable regions (approximately 1465 bp), making up nearly the full length of the 16S rRNA gene.

Because of the different read length capabilities of the two sequencing platforms, different regions and different primer pairs were used for Nanopore MinION and Illumina MiSeq sequencing. The 16S rRNA regions are variably informative, and the region analyzed is, therefore, likely to affect the taxonomic outcome. Soergel et al.13 computed the classification rate for 374 pairings of 22 forward primers and 22 reverse primers for 16S rRNA and read lengths across different environments. They found that primer choices greatly affect taxonomic informativeness and that the most informative primers differed with respect to the material under investigation. For dust and skin samples, primer 1492R combined with 341F, was shown to produce robust predictions at the genus level13. In the present study, the primer pair 1492R/27F was used for the MinION procedure. The Illumina analyses were performed by a commercial laboratory which routinely uses the primer pair 341F/805R.

An additional factor long known to affect taxonomic classifications is the choice of reference databases, as the number and origins of reference sequences included in different databases varies greatly14.

Since few microbiome-studies exist with full-length 16S rRNA sequences, the genus level is commonly used for comparison of samples or environments. The major genera identified in the present study are in general agreement with previously published works on indoor dust microbiomes15,16,17. Both long and short read sequences when accessed against the databases used in this study revealed the same signature differences between the bacterial content of outdoor and indoor samples – i.e., a relative preponderance of taxa associated with human activity in the latter. Furthermore, both sequencing platforms (including here primer choice) resulted in similar taxonomic classifications for all samples at the order and family level. Both platforms performed similarly for samples originating from the indoor environment (i.e. HVAC exhaust and floor dust samples) whereas samples of outdoor origin (i.e. HVAC intake samples) manifest greater differences between the sequencing platforms. Thus, either approach could be used where the aim is to reveal the major structural differences in bacterial content of the indoor and outdoor spaces.

However, at the genus and particularly species levels, some key differences emerge in the datasets with respect to the sequencing technologies used and the databases accessed. The MinION platform, which provided nearly full-length 16S rRNA gene sequences, gave a significantly higher resolution at the species level (Table 3). A number of species were identified only with long-read sequences (Figs. 3 and 4), suggesting that a partial sequence region of the 16S rRNA gene cannot provide the same taxonomic resolution as full-length sequences18. This is in line with Shin et al. who compared the mouse microbiome as revealed by the same two sequencing platforms19. Taken together, these two studies suggest that MinION may be able to provide high taxonomic resolution of fundamentally different microbiomes. However, some studies show that analysis of the whole rrn operon (16S rRNA–ITS–23S rRNA) represents a more powerful tool than analysis of merely the 16S rDNA gene for resolution of taxa at the species level20. Basing their analyses on the rrn operon, Cusco et al.16 were able to delineate a greater number of species in the sequence data, further illustrating the limitations of the 16S rDNA alone in species allocation20,21. Identification to the species level is important not only because it provides a more detailed description of the microbial communities of interest, but also because pathogenicity is usually a species or strain level phenomenon22. For example, some species of potentially medical importance were only identified using long read sequences and only with one or another database. S. maltophilia was only detected when matching long, and for some samples short sequences, against the SILVA database (Fig. 4). S. maltophilia is an environmental opportunistic pathogen. The incidence of nosocomial and community-acquired infections (particularly respiratory) of immunocompromised individuals caused by this species, is an increasing concern23. Furthermore, only short-read Illumina sequences when accessed against the SILVA database produced a species-level identification for a member of the genus Pseudomonas. The genus Pseudomonas houses some opportunistic human-pathogenic-species, most especially P. aeruginosa. However, particularly when drawing conclusions concerning genus and species level identification using sequencing, one has to consider the risk of wrongly assigned taxonomies. The use of reference databases that contain larger numbers of sequences could increase the risk of false positive identifications. The most widely used databases in similar studies are Greengenes and SILVA, as these are included in many of the commonly used piplines for analysis of 16S rRNA sequencing data. Therefore, although more limited in terms of the number of sequences, the highly curated NCBI 16S rRNA database was also included to assigned taxonomies at the genus level (Table 4). The results with NCBI are most similar to those obtained with Greengenes, providing support for the continued use of the latter.

Conclusion

Results for 16S rRNA amplicon analysis obtained with MinION are promising. Oxford Nanopore’s long-read chemistry could make species level identification of the bacteria comprising building-dust microbiomes more accessible, thus improving classifications of these bacterial communities. The present study is to our knowledge the first attempt to investigate the indoor microbiome using the Nanopore MinION sequencing technology. We demonstrate that species level identification may be possible, which could be useful when studying potential routes of disease transmission in the indoor space. However, more comprehensive analyses using a larger number of replicates are required to confirm the suggestions put forth in this paper. The low sampling volume provides an insufficient number of biological replicates to make accurate profiles of the dust microbiomes. Following on, it would also be useful to analyze larger data sets with additional, curated rRNA genes databases to see if these reveal similar structures to those presented here, or if new details emerge.

Methods

Samples

Building dust samples were collected from kindergartens and nursing homes in Norway. Samples BC01-BC05 (Table 5) are dust samples collected from HVAC filters from HVAC units located in nursing homes. Samples BC06, BC07, and BC12 are collected from HVAC filters in kindergartens. Samples BC08-BC11 are floor dust samples collected from a kindergarten. HVAC filter dust samples were collected as described in Nygaard and Charnock15. Procedures for sampling of floor dust samples were as given in Nygaard and Charnock24.

Table 5 Sample identification, description and origin.

DNA extraction

DNA was extracted from approximately 100 mg dust from each sample using the PowerWater DNA isolation kit (MO BIO, CA, USA) as previously described by Nygaard et al.15. DNA concentrations were measured using Qubit 3.0. fluorometer and Qubit dsDNA HS Assay kit (Thermo Fisher Scientific, Waltham, MA, USA).

Sequencing

Long-read 16S Nanopore sequencing

Five ng DNA from each sample were used in PCR reactions with 16S primers 27 F and 1492 R (MWG Eurofins, GmBh) for amplification of the near full-length bacterial 16S rRNA gene (Table 6). Amplicons (800 ng) from each sample were end repaired and dA-tailed using NEBNext End-Repair and NEBNext dA-Tailing modules (New England Biolabs) according to the manufacturer’s instructions. Using the 1D Native barcoding genomic DNA kit EXP-NBD103, R9 version (Oxford Nanopore Technologies, Oxford, UK) barcodes were ligated to the dA-tailed DNA using Blunt/TA Ligase Master Mix (New England Biolabs). Then sequencing adapters were ligated to the pooled barcoded reads according to the manufacturer’s instructions using sequencing kit 1D SQK-LSK108, R9 version (Oxford Nanopore Technologies) to complete the library building. Sequencing was performed using a FLO-MAP R7.3 flowcell for 48 hours on the MinION portable sequencer (Oxford Nanopore Technologies). Nanopore sequence data are deposited in the European Nucleotide Archives (ENA) and is available through accession numbers ERS2702700-ERS2702711.

Short-read 16S Illumina Miseq sequencing

DNA from the same samples was sent to a commercial laboratory, Omega Bioservices (Atlanta, Georgia, USA), for 2 × 300 bp paired-end sequencing. The libraries were prepared using Illumina 16S Metagenomic Sequencing kit (Illumina, Inc., San Diego, CA, USA) according to the manufacturer’s protocol. The V3-V4 region of the bacterial 16S rRNA gene sequences was amplified using the primer pair 341F-805R, containing the gene‐specific sequences and Illumina adapter overhang nucleotide sequences. Primer sequences are shown in Table 6. Illumina sequence data has been deposited in the ENA and is available through accession numbers ERS2702688-ERS2702699.

Table 6 Primers used for generating short-read and long-read amplicons.

Sequence analysis

Taxonomic reference databases

After sequence data processing (described below) both long- and short-read amplicons were taxonomically assigned using the GG 13_8 97% reference sequences25 and the SILVA 132 99% reference sequences. In addition, long-read amplicons were taxonomically assigned using the NCBI 16S rDNA database.

Long-read 16S sequencing data processing, taxonomic assignment and analysis

Raw fast5 reads were basecalled, sorted by their respective barcodes and converted to fastq files using Albacore (version 2.1.10). Sequencing adapters were removed using Porechop (version 0.2.3) (https://github.com/rrwick/Porechop) and the trimmed sequences quality filtered using NanoFilt (version 1.8.0) (https://github.com/wdecoster/nanofilt). Sequences were filtered on a minimum average read quality score, and only sequences with an average quality score of 9 or above were retained. Resulting fastq files were converted to fasta using Fastx-Toolkit. The trimmed and quality filtered reads were then aligned against the GG 13_8 97% reference sequences25 and the SILVA 132 99% reference sequences using the LAST aligner (v.921) (http://last.cbrc.jp/) with the following parameters: -r 1 -q 1 -a 1 -b 1 (match score of 1, mismatch cost of 1, gap opening cost of 1, and gap extension cost of 1). For each read, the highest scoring alignment was retained and assigned with the taxonomic id of the corresponding GG reference sequence. Taxonomic IDs with only one aligned sequence read were discarded from the sample.

The basecalled long-read 16S-sequences were also taxonomically assigned using the cloud-based EPI2ME Fastq 16S workflow provided by Nanopore. Here, basecalled sequences are mapped against the NCBI 16S bacterial database using BLAST. After that, each read is classified based on % coverage and identity.

Short-read 16S sequencing data processing, taxonomic assignment and analysis

Demultiplexed paired-end fastq files and a mapping file were used as input files. Sequences were pre-processed, quality filtered and analyzed using QIIME2 (2018.2 release) (https://qiime2.org/). DADA226 in QIIME2 was used for sequence correction and removal of chimeras. Paired sequence reads were joined and quality-filtered using the paired-end DADA2 pipeline, using default settings. Primers were trimmed using the –p-trim-left function. The forward reads were truncated to 290 bases and the reverse reads to 200 bases, allowing for an overlap of 25 bases in merged sequences. To generate taxonomy tables, sequences were assigned taxonomies using vsearch27 on the GG 13_8 97% reference database25 and the SILVA 132 99% reference database. The QIIME2 taxa barplot command was used for viewing the taxonomic composition of the samples and generating abundance data.

Statistical analysis

Spearman rank correlation was used to compare the samples microbial community compositions as revealed by the sequencing platforms. Correlations between sequencing platforms were considered to be very strong if Spearmans rho (rs) was +/−0.9 to 1, strong if rs was +/−0.7 to 0.9, moderate if rs was +/−0.5 to 0.7, weak if rs was +/−0.3 to 0.5, or neglible if rs was +/−0.0 to 0.3, and if p < 0.0528,29.