🔧 Extended Preprocessing Guidance for CIRCE¶

Here are some tips and observations from our benchmarks on preprocessing choices for CIRCE (Trimbour et al., 2026).

1. Count treatment: raw counts, binarization, count correction¶

Use raw counts (peak-by-cell matrix) by default. This preserves quantitative information about accessibility that seems to help CIRCE’s co-accessibility inference.

Binarization can be tried — especially if you want to reduce biases from highly accessible peaks or differences in coverage — but be aware that in the CIRCE context it did not yield benefit.

Avoid applying Cicero’s “count correction” (or other heavy total-count normalization) prior to co-accessibility inference with CIRCE, unless you have a very strong reason and you benchmark performance (because in the published report it degraded results).

Additional nuance / recommendation: Before analysis your dataset, consider doing quality filtering and cell QC (remove low-quality cells or potential doublets), which is standard in scATAC-seq preprocessing workflows. Several scATAC best-practice resources recommend filtering based on metrics such as fragment counts, fraction of reads in peaks, TSS enrichment, etc. You can check guidelines here

Notes on Cicero-Style Preprocessing and Why We Avoid It¶

Cicero’s recommended preprocessing pipeline historically includes binarization and a normalization step based on total accessibility counts per cell. Our evaluations, supported by published analyses, show that neither step is beneficial for CIRCE.

Binarization¶

Previous studies demonstrate that binarizing scATAC-seq counts does not improve data quality, statistical fit, or downstream biological interpretation.

Martens et al., 2023
“Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration.”
DOI: https://doi.org/10.1038/s41592-023-02112-6

This aligns with our benchmarks: binarized counts did not outperform raw counts for co-accessibility inference.

Total-count normalization¶

Cicero-style correction also divides accessibility by a per-cell total count. However, this assumption — that every cell should have the same total accessibility — is unsupported biologically and methodologically.

Kwok et al., 2025
“Dividing by total count is a sound strategy for bulk sequencing […]. However, in scATAC-seq data […] the total count of each cell is different. Therefore, after TF (Ed.: Term Frequency transformation) transformation, the largest variation between cells will naturally be due to their denominators, that is, the total counts per cell or sequencing depth.”
DOI: https://doi.org/10.1186/s13059-025-03735-y

2. Input type: single-cell vs metacells / pseudobulk¶

Single-cell inputs: As already noted, single-cell resolution yielded better validation against promoter-capture Hi-C (PC-HiC) interactions in the benchmarks with CIRCE. (Trimbour et al., 2026)

Metacells: Aggregating cells (into metacells) can reduce computational cost and mitigate sparsity, but you still need to generate enough of them or it may lead to decreased AUC.

When to use metacells:
a. If the dataset is extremely large (many tens/hundreds of thousands of cells) and memory/compute is limiting — or if individual cells are very sparse — metacells may be acceptable. But you should re-evaluate performance (e.g., link recovery, network robustness) compared to a single-cell run.
b. If you need both negative and positive co-accessibility scores, and that you observe a very high proportion of positive score and very few negative co-accessibility scores. It typically indicate that you have a very high level of sparsity and metacells helps correcting this bias.

Metacell algorithms: Dimensionality reduction & neighbor graph construction (“neighbors space”)¶

To identify neighboring cells (i.e. define cell proximity / similarity), using LSI (latent semantic indexing) space rather than low-dimensional nonlinear embeddings such as UMAP or t-SNE works better to preserve relative distances.

Rationale: Nonlinear embeddings (UMAP/t-SNE) are optimized for visualization and can distort distances in a way that may not reflect true similarity structure relevant for co-accessibility. LSI (or other linear/distance-preserving dimensionality reduction) tends to be more robust for neighbor-graph building.

Recommendation: Use LSI (or PCA / other linear methods) for constructing the neighbor graph. Avoid using UMAP or t-SNE for that purpose.

Tip / extension: Depending on dataset size and complexity, you might consider exploring different numbers of LSI components (e.g. carry out a small sweep: 50, 100, 200 LSI dims) to see how stable the inferred networks / CCANs are. This can help choose a setting robust to technical noise and overfitting.

3. AnnData matrix format choice¶

CIRCE can handle both sparse and dense matrices. A dense matrix will be faster to process, since the graphical lasso model ultimately needs dense matrix chunks. On large atlases, you should store your peak-by-cell matrix in CSC format (Compressed Sparse Column) to accelerate column extraction (i.e. per-cell operations) in CIRCE.

4. Benchmarking new metacells/preprocessing strategies¶

Our benchmark was limited to a comparison of Cicero and CIRCE’s standard practices.

If you want to test your own preprocessing or metacells strategy, you can have a look at our benchmark pipeline that we used to compare the different methods. :)

You can then simply add the Snakemake rule corresponding to your own method, and compare it to the other methods and the PC-HiC data considered there as the ground truth. Don’t hesitate to open a GitHub issue there if you need additional guidance.

Summary¶

Binarization: no demonstrated benefit for scATAC or CIRCE.
Total-count normalization: methodologically unsound for sparse single-cell chromatin data.
Recommended: use raw counts, without Cicero-style correction.
Metacells: Only if you need to correct extra-sparse data