# 🔧 Extended Preprocessing Guidance for CIRCE Here are some tips and observations from our benchmarks on preprocessing choices for CIRCE ([Trimbour et al., 2026](https://doi.org/10.1093/bioinformatics/btag092)). ## 1. Count treatment: raw counts, binarization, count correction Use **raw counts** (peak-by-cell matrix) by default. This preserves quantitative information about accessibility that seems to help CIRCE’s co-accessibility inference. **Binarization can be tried** — especially if you want to reduce biases from highly accessible peaks or differences in coverage — but be aware that in the CIRCE context it did not yield benefit. **Avoid applying Cicero’s “count correction”** (or other heavy total-count normalization) prior to co-accessibility inference with CIRCE, unless you have a very strong reason and you benchmark performance (because in the published report it degraded results). **Additional nuance / recommendation:** Before analysis your dataset, consider doing quality filtering and cell QC (remove low-quality cells or potential doublets), which is standard in scATAC-seq preprocessing workflows. Several scATAC best-practice resources recommend filtering based on metrics such as fragment counts, fraction of reads in peaks, TSS enrichment, etc. [You can check guidelines here](https://www.sc-best-practices.org/chromatin_accessibility/introduction.html) ### Notes on Cicero-Style Preprocessing and Why We Avoid It Cicero’s recommended preprocessing pipeline historically includes **binarization** and a **normalization step** based on total accessibility counts per cell. Our evaluations, supported by published analyses, show that neither step is beneficial for CIRCE. #### Binarization Previous studies demonstrate that binarizing scATAC-seq counts does **not** improve data quality, statistical fit, or downstream biological interpretation. - **Martens et al., 2023** *“Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration.”* DOI: https://doi.org/10.1038/s41592-023-02112-6 This aligns with our benchmarks: **binarized counts did not outperform raw counts** for co-accessibility inference. #### Total-count normalization Cicero-style correction also divides accessibility by a per-cell total count. However, this assumption — that every cell should have the same total accessibility — is unsupported biologically and methodologically. - **Kwok et al., 2025** *“Dividing by total count is a sound strategy for bulk sequencing [...]. However, in scATAC-seq data [...] the total count of each cell is different. Therefore, after TF (Ed.: Term Frequency transformation) transformation, the largest variation between cells will naturally be due to their denominators, that is, the total counts per cell or sequencing depth.”* DOI: https://doi.org/10.1186/s13059-025-03735-y ## 2. Input type: single-cell vs metacells / pseudobulk **Single-cell inputs:** As already noted, single-cell resolution yielded better validation against promoter-capture Hi-C (PC-HiC) interactions in the benchmarks with CIRCE. ([Trimbour et al., 2026](https://doi.org/10.1093/bioinformatics/btag092)) **Metacells:** Aggregating cells (into metacells) can reduce computational cost and mitigate sparsity, but you still need to generate enough of them or it may lead to decreased AUC. **When to use metacells:**
a. If the dataset is extremely large (many tens/hundreds of thousands of cells) and memory/compute is limiting — or if individual cells are very sparse — metacells may be acceptable. But you should re-evaluate performance (e.g., link recovery, network robustness) compared to a single-cell run.
b. If you need both **negative and positive co-accessibility scores**, and that you observe a **very high proportion of positive score** and very few negative co-accessibility scores. It typically indicate that you have a very high level of sparsity and metacells helps correcting this bias. #### Metacell algorithms: Dimensionality reduction & neighbor graph construction (“neighbors space”) To identify neighboring cells (i.e. define cell proximity / similarity), using LSI (latent semantic indexing) space rather than low-dimensional nonlinear embeddings such as UMAP or t-SNE works better to preserve relative distances. Rationale: Nonlinear embeddings (UMAP/t-SNE) are optimized for visualization and can distort distances in a way that may not reflect true similarity structure relevant for co-accessibility. LSI (or other linear/distance-preserving dimensionality reduction) tends to be more robust for neighbor-graph building. Recommendation: Use LSI (or PCA / other linear methods) for constructing the neighbor graph. Avoid using UMAP or t-SNE for that purpose. Tip / extension: Depending on dataset size and complexity, you might consider exploring different numbers of LSI components (e.g. carry out a small sweep: 50, 100, 200 LSI dims) to see how stable the inferred networks / CCANs are. This can help choose a setting robust to technical noise and overfitting. ## 3. AnnData matrix format choice CIRCE can handle both sparse and dense matrices. A dense matrix will be **faster to process**, since the graphical lasso model ultimately needs dense matrix chunks. On large atlases, you should store your peak-by-cell matrix in **CSC format** ([Compressed Sparse Column](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html)) to accelerate column extraction (i.e. per-cell operations) in CIRCE. ## 4. Benchmarking new metacells/preprocessing strategies Our benchmark was limited to a comparison of Cicero and CIRCE's standard practices. If you want to test your own preprocessing or metacells strategy, you can have a look at our [benchmark pipeline](https://www.github.com/cantinilab/circe_reproducibility) that we used to compare the different methods. :) You can then simply add the Snakemake rule corresponding to your own method, and compare it to the other methods and the PC-HiC data considered there as the ground truth. _Don't hesitate to open a GitHub issue there if you need additional guidance._ ## Summary - Binarization: **no demonstrated benefit** for scATAC or CIRCE. - Total-count normalization: **methodologically unsound** for sparse single-cell chromatin data. - Recommended: **use raw counts, without Cicero-style correction**. - Metacells: Only if you need to correct extra-sparse data