The analysis pipeline is detailed in the paper (preprint). Below follows a brief summary of the main points.

Initial steps and filtering

Chromium samples were typically sequenced one sample per lane, run through the Chromium cellranger pipeline and output manually inspected. Samples that showed no obvious structure in their t-SNE plots were excluded from further analysis. All subsequent analyses were automated using the adolescent mouse analysis pipeline based on the cytograph library.

Quality controls

Cells with less than 600 detected molecules (UMIs), or less than 1.2-fold molecule to gene ratio, were marked invalid. Genes detected in fewer than 20 cells or more than 60% of all cells were marked invalid. These filters were applied separately to each input file.

Preliminary clustering and classification

We trained a classifier to automatically detect the major class of each single cell, as well as classes representing doublets. We first manually annotated clusters to indicate major classes of cells: Neurons, Oligodendrocytes, Astrocytes, Bergman glia, Olfactory ensheathing cells, Satellite glia, Schwann cells, Ependymal, Choroid, Immune, and Vascular. For some of these classes, we distinguished proliferating cells (e.g. Cycling oligodendrocytes, i.e. OPCs). We also manually identified clusters that were clearly doublets between these major classes (e.g. Vascular-Neurons) as well as clusters that were of poor quality. We then trained a support vector classifier to discriminate all of these labels, and used this classifier to individually assess the class identity of each cell in each dataset, and to pool cells by major class into new files (with neurons further separated by tissue).

Removing doublets

To eliminate many doublets, we (1) removed clusters classified with ambiguous labels; (2) removed cells classified with a different label from the majority of cells in its clusters; (3) removed outliers when clustering, typically on the fringes of clusters in t-SNE space.

Level 1 analysis

We pooled samples by tissue and performed manifold learning, clustering, classification, gene enrichment, and marker gene detection (see below for details on these procedures).

Level 2 analysis

We split cells by major class according to the class assignment probability. For each cluster at level 1, we removed cells with conflicting classification (i.e. cells classified as Neuron in a cluster where the majority of cells are classified as Vascular). We performed the same analysis steps as for Level 1.

Level 3 analysis (neurons)

Because of the way we had dissected the brain, we would expect some clusters to appear in multiple tissues. For example, our olfactory sample included the anterior-most part of the cortex and underlying tissue (intended to cover the anterior olfactory nucleus), and could overlap with the cortex samples. In order to allow clusters to merge across such boundaries, and in order to improve resolution in clustering, we pooled cells in broader categories, and split them by (mostly) neurotransmitter.

Level 4 analysis

We manually curated all clusters, merging some and eliminating others. We then recomputed the manifolds, but did not recluster.

Level 5 analysis

To create the final consolidated dataset, we extensively annotated and named each cluster. We pooled all cells into a single file along with all metadata and annotations, and performed gene enrichment analysis and marker gene set discovery on this dataset.

Level 6 analysis

Level 6 is identical to level 5, but organized into subsets according to the taxonomy. This provides gene enrichment analysis and marker gene set discovery, individually for each taxon.

Manifold learning

We first selected 1000 informative genes with higher-than-expected variance, and used principal component analysis (PCA) to both reduce noise and to reduce the gene expression space further. We next sought to learn the shape of the manifold of cells (that is, the underlying lower-dimensional gene expression space on which cells are preferentially located) by constructing a balanced mutual k nearest-neighbor graph, performing Jaccard multilevel community clustering on this graph to define a preliminary set of cell types/states, allowing for selection of an even more informative set of genes, and then repeating the procedure (PCA, mutual KNN, clustering). We projected the multiscale KNN graph to two dimensions using a modified t-SNE algorithm by directly projecting the multiscale KNN. The result was a more accurate projection of the graph itself then standard tSNE, with more compact and well-defined neighborhoods. We stored the embedding as column attributes _X and _Y in the Loom files.

Clustering

We performed clustering on the multiscale KNN graph. We used Louvain multilevel community clustering. We first used Louvain clustering on the graph to find most clusters, and then isolated and re-clustered each cluster using DBSCAN in the low-dimensional space.

Gene enrichment

To aid interpretation of the data (and for gene selection, as noted above), we computed a set of genes enriched in each cluster. Enrichment scores are available as matrix layer enrichment in the aggregated Loom files (named “…agg.loom”). We also computed an enrichment q value by shuffling the expression matrix, available as layer enrichment_q. To find genes enriched at a 10% false discovery rate, for example, simply select genes with q scores below 0.1.

Trinarization

We used a Bayesian beta-binomial model to trinarize the raw data into calls “expressed”, “not expressed”, and “indeterminate”. A gene was considered expressed if it was estimated to be present in at least 20% of the cells with no more than 5% posterior error probability in our model. Trinarization scores are available in layer trinaries in the aggregated loom files.

Marker gene set discovery

We designed an algorithm to automatically propose marker sets for all clusters. Here, we define a marker gene set as a set of genes that are all expressed in a given cluster, but not all expressed in any other cluster. We used trinarization to judge if a gene is expressed or not in each cluster. Given a cluster, we first selected the most highly enriched gene, which would often not be unique to that cluster, but highly selective for a small number of closely related clusters. Next, we added the most specific gene, based on trinarization with a FDR of 0.05. This gene was very often specific to a very small number of clusters, and using the first two genes together would often lead to fully specific marker combinations. However, sometimes adding more genes would be necessary. We added genes one at a time by picking the most selective gene, in combination with the previously selected genes. When more than one gene was equally selective, we picked the one that was most highly enriched. We defined selectivity as the reciprocal of the number of clusters that would be assigned auto-annotation given the current gene set and the trinarization scores. That is, an annotation that would apply to k clusters would have selectivity 1/k. Adding more genes rapidly drove selectivity towards 1. We generated gene sets in this manner for all clusters, with up to six genes per cluster. We also calculated the the cumulative selectivity, specificity (difference between the posterior probability for the best cluster and that of the second-best cluster), and robustness (the posterior probability that all genes would be detected in the cluster, based on trinarization scores). We reported these statistics cumulatively for n = 1, 2, 3, 4, 5 and 6 genes. Generally, robustness drops as more genes are added, while selectivity increases. Specificity tends to increase as the gene set becomes more selective, but then decrease as it becomes less robust. We note that marker gene sets are excellent candidates to use for experimentally identifying cell types, e.g. based on genetic or antibody labelling. Marker gene sets and associated statistics for all clusters are provided in the wiki, and in the Loom files under column attributes MarkerGenes, MarkerSelectivity, MarkerSpecificity and MarkerRobustness.

Dendrogram construction

The starting point of the dendrogram construction was the 265 clusters. For each gene, we computed average expression, trinarization with f = 0.2, trinarization with f = 0.05 and enrichment score. For each cluster we also know the number of cells, annotations, tissue distribution and samples of origin. We defined major classes of cell types based on prior knowledge: neurons, astroependymal, oligodendrocytes, vascular (without VLMC), immune cells and neural crest-like. For each class, we defined pan-enriched genes based on the trinarization 5% score. Each class (except neurons) was tested against neurons, to find all the genes where the fraction of clusters with trinarization score = 1 in the class was greater than the fraction of clusters with trinarization score > 0.9 among neurons. We bounded the number of detected genes in each cluster to the top 5000 genes expressed, followed by scaling the total sum of each cluster profile to 10,000. Next, we selected genes for linkage analysis: from each cluster select the top N=28 enriched genes (based on pre-calculated enrichment score), perform initial clustering using linkage (Euclidean distance, Ward in Matlab), and cut the tree based on distance criterion 50. This clustering aimed to capture the coarse structure of the hierarchy. For each of the resulting clusters, we calculated the enrichment score as the mean over the cluster divided by the total sum and selected the 1.5N top genes. These were added to the previously selected genes. Finally, we built the dendrogram using linkage (correlation distance and Ward method).

Spatial correlation analysis

The Allen brain institute atlas was summarize into a 200um cube voxel dataset, providing the gene expression profile of all genes for each voxel. In this analysis we use simple correlation between the voxel gene expression (from in-situ hybridization) and the cluster gene expression profile (from scRNAseq). Each gene has an “energy” value representing the expression in each voxel. In order to achieve finer resolution and smoother images we used linear interpolation of the coarse (200um) in-situ data into the Allen brain reference atlas finer 25um grid. For annotation we always kept the color code of the Allen reference atlas. Genes in each domain with high quality in-situ were selected using a thresholding procedure. From these were selected the genes that also were selected during dendrogram construction, and the remaining were used to calculate correlation between each and voxel and each cell type. Finally a region fold enrichment was calculated: for each cell-types take the top 100 pixels and calculate the fold enrichment of the region ID that are among them by normalizing to frequency within the 100 to the overall frequency of each region ID.