3. Building CellOracle base-GRN with CIRCE

# pip celloracle
# pip muon
# pip circe

The base GRN of CellOracle are based on cis-accesibility networks obtained from scATAC-seq data. CIRCE now allows users to build the base GRN using a single Python environment.
When CellOracle got published, Cicero, implemented in R, was the current state-of-the-art method to obtain these networks. (Tutorial: base GRN CellOracle¹)

import celloracle as co
which: no R in (/opt/gensoft/exe/Jupyter-Notebook/7.4.4/venv/bin:/opt/gensoft/exe/bedtools/2.31.1/scripts:/opt/gensoft/exe/bedtools/2.31.1/bin:/opt/gensoft/exe/Jupyter-Notebook/7.4.4/bin:/opt/gensoft/adm/Modules/bin:/opt/gensoft/adm/Modules/4.4.0/bin:/opt/hpc/slurm/current/bin:/opt/hpc/slurm/current/sbin:/opt/gensoft/exe/apptainer/1.3.5/scripts:/opt/gensoft/exe/apptainer/1.3.5/bin:/opt/gensoft/exe/apptainer/1.4.1/scripts:/opt/gensoft/exe/apptainer/1.4.1/bin:/opt/gensoft/exe/gcc/10.4.0/bin:/pasteur/appa/homes/rtrimbou/miniconda3/envs/snakemake/bin:/pasteur/appa/homes/rtrimbou/miniconda3/envs/snakemake/condabin:/pasteur/appa/homes/rtrimbou/.local/bin:/pasteur/appa/homes/rtrimbou/bin:/opt/gensoft/adm/Modules/bin:/opt/hpc/slurm/current/bin:/opt/hpc/slurm/current/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)

1. Importing scATAC-seq data

import muon as mu
import circe as ci

The data of this tutorial can be downloaded from this Zenodo repository If you have any trouble to install both CellOracle and CIRCE in the same environement, you try this conda env configuration : CIRCE+CellOracle

data = mu.read_h5mu("pbmc10x/pbmc10x.h5mu")

2. Inferring DNA regions network from CIRCE

atac = data["atac"]
atac, atac.var_names[:3]
atac.var_names = atac.var_names.str.replace('-', '_')
atac = ci.add_region_infos(atac, sep=('_', '_'))

2.a. Computing CIRCE metacells

#atac = ci.metacells.compute_metacells(atac)

2.b. Inferring DNA region interactions

ci.compute_atac_network(atac)
Calculating alpha ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:45 0:00:00

2025-09-19 09:12:45,763 - INFO - Remove client Client-e3ecd7f5-9527-11f0-a7a7-3ceceff9f2fe
2025-09-19 09:12:45,767 - INFO - Received 'close-stream' from tcp://127.0.0.1:48784; closing.
2025-09-19 09:12:45,768 - INFO - Remove client Client-e3ecd7f5-9527-11f0-a7a7-3ceceff9f2fe
2025-09-19 09:12:45,769 - INFO - Close client connection: Client-e3ecd7f5-9527-11f0-a7a7-3ceceff9f2fe
2025-09-19 09:12:45,771 - INFO - Retire worker addresses (stimulus_id='retire-workers-1758265965.7718186') (0,)
2025-09-19 09:12:45,781 - INFO - Received 'close-stream' from tcp://127.0.0.1:48782; closing.
2025-09-19 09:12:45,781 - INFO - Remove worker addr: tcp://127.0.0.1:36253 name: 0 (stimulus_id='handle-worker-cleanup-1758265965.7818978')
2025-09-19 09:12:45,782 - INFO - Lost all workers
2025-09-19 09:12:46,305 - INFO - Closing scheduler. Reason: unknown
2025-09-19 09:12:46,306 - INFO - Scheduler closing all comms
Calculating co-accessibility scores...
Concatenating results from all chromosomes...

2.c. Formating network output

circe_network = ci.extract_atac_links(atac)
circe_network = circe_network.rename(
    columns = {"score": "coaccess"}
)
circe_network
Peak1 Peak2 coaccess
0 chr16_54012525_54013025 chr16_54013534_54014034 0.669024
1 chr3_133264089_133264589 chr3_133264946_133265446 0.602914
2 chr2_109479794_109480294 chr2_109481034_109481534 0.467789
3 chr5_49599487_49599987 chr5_49600285_49600785 0.452878
4 chr2_88192716_88193216 chr2_88193738_88194238 0.446659
... ... ... ...
6129848 chr9_129777081_129777581 chr9_129869007_129869507 -0.146761
6129849 chr22_38570096_38570596 chr22_38953902_38954402 -0.148798
6129850 chr18_48896769_48897269 chr18_48940694_48941194 -0.148916
6129851 chr22_50469821_50470321 chr22_50542441_50542941 -0.150171
6129852 chr17_82126673_82127173 chr17_82216598_82217098 -0.162922

6129853 rows × 3 columns

3. Connecting DNA region to gene bodies

This notebook section correspond to the CellOracle documentation: Annotate Transcription Starting Sites²

3.1. TSS informations

!module load bedtools

⚠️You need to have bedtools available !
Otherwise, you should encounter

NotImplementedError: "intersectBed" does not appear to be installed or on the path, so this method is disabled.  Please install a more recent version of BEDTools and re-import to use this method.

If you’re on a HPC, you might just have to load the package before launching the notebook: module load bedtools

from celloracle import motif_analysis as ma
# Make sure to specify the correct reference genome here
tss_annotated = ma.get_tss_info(peak_str_list=atac.var_names.values, ref_genome="hg19")

# Check results
tss_annotated.tail()
que bed peaks: 215676
tss peaks in que: 6780
chr start end gene_short_name strand
6775 chr17 4834120 4834620 GP1BA +
6776 chr17 4835384 4835884 GP1BA +
6777 chr19 35484635 35485135 GRAMD1A +
6778 chr7 100545748 100546248 MUC3A +
6779 chr7 100546509 100547009 MUC3A +

3.2. Integrating TSS informations with cis-coaccessible network

integrated = ma.integrate_tss_peak_with_cicero(
    tss_peak=tss_annotated,
    cicero_connections=circe_network)
print(integrated.shape)
integrated.head()
(173536, 3)
peak_id gene_short_name coaccess
0 chr10_101291541_101292041 LINC01475 1.000000
1 chr10_101291541_101292041 NKX2-3 1.000000
2 chr10_101292051_101292551 LINC01475 0.145072
3 chr10_101292051_101292551 NKX2-3 1.000000
4 chr10_101297817_101298317 LINC01475 0.000032

3.3. Filtering peak connections on a specific threshold

⚠️ As noted in Nourisa et al, 2024 - bioXriv, the threshold of 0.8 might be too high to keep any enhancer - promoter connections.
You can plot the distribution of your score, to realise how many connections you’re keeping.

import matplotlib.pyplot as plt
%matplotlib inline
default_threshold = 0.8

threshold = circe_network.coaccess.quantile(0.95) # TO CHOOSE, e.g. 5% of the connections

fig, ax = plt.subplots(1)
circe_network.coaccess.plot.hist(ax=ax, bins=50, label="DNA co-accessibility scores")  # score distribution
ax.vlines(threshold, 0, ax.get_ylim()[1], colors="green", label="DNA co-accessiblity scores - top 5%")  # threshold line
ax.vlines(default_threshold, 0, ax.get_ylim()[1], colors="red", label="DNA co-accessiblity scores - above 0.8)")  # threshold line
plt.legend()
<matplotlib.legend.Legend at 0x7f40c5d5ca30>
../_images/17af7461cb6c4e2baccd78cb769319f7e9682bbcc1f3a4635f4e823258206f71.png
peaks = integrated[integrated.coaccess >= threshold]
peaks = peaks[["peak_id", "gene_short_name"]].reset_index(drop=True)

4. Finding TF binding sites

import celloracle as co
from celloracle import motif_analysis as ma
from celloracle.utility import save_as_pickled_object
co.__version__
'0.20.0'

4.1. Downloading genome - UCSC version

You need a genome sequence in order to identify the TF motifs present in the DNA regions. Genomepy manage the installation by default, but you can also

import genomepy

# This link should be your genome of interest location.
fa = "https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/latest/hg19.fa.gz"

genomepy.install_genome(
    name=fa,
    provider="URL",
    localname="hg19",
)
Fasta("/pasteur/appa/homes/rtrimbou/.local/share/genomes/hg19/hg19.fa")

⚠️ If your reference genome file are installed in non-default location, please speficy the location using the parameter genomes_dir in the following functions.

Check that the genome has been well downloaded.

# Make sure reference genome has been correctly installed.
ref_genome = "hg19"

genome_installation = ma.is_genome_installed(ref_genome=ref_genome,
                                             genomes_dir=None)
print(ref_genome, "installation: ", genome_installation)
hg19 installation:  True
peaks = ma.check_peak_format(peaks, ref_genome, genomes_dir=None)
Peaks before filtering:  20089
Peaks with invalid chr_name:  0
Peaks with invalid length:  0
Peaks after filtering:  20089

4.2. Scanning for TF motifs

# Instantiate TFinfo object
tfi = ma.TFinfo(peak_data_frame=peaks,
                ref_genome=ref_genome,
                genomes_dir=None)
%%time
# Scan motifs. !!CAUTION!! This step may take several hours if you have many peaks!
tfi.scan(fpr=0.02,
         motifs=None,  # If you enter None, default motifs will be loaded.
         verbose=True)

# Save tfinfo object
tfi.to_hdf5(file_path="test1.celloracle.tfinfo")
No motif data entered. Loading default motifs for your species ...
 Default motif for vertebrate: gimme.vertebrate.v5.0. 
 For more information, please see https://gimmemotifs.readthedocs.io/en/master/overview.html 

Initiating scanner... 
2025-09-19 09:28:31,050 - DEBUG - using background: genome hg19 with size 200
Calculating FPR-based threshold. This step may take substantial time when you load a new ref-genome. It will be done quicker on the second time. 

Motif scan started .. It may take long time.
CPU times: user 7min 1s, sys: 5.29 s, total: 7min 7s
Wall time: 7min 14s
# Check motif scan results
tfi.scanned_df.head()
seqname motif_id factors_direct factors_indirect score pos strand
0 chr10_101291541_101292041 GM.5.0.RFX.0001 Rfx5 ARID2, Rfx6, Rfx8, RFX6, Rfx3, RFX5, RFX8, Rfx... 10.484102 260 1
1 chr10_101291541_101292041 GM.5.0.Nuclear_receptor.0005 NR4A2, NR1A4, RXRA, NR1H4, FXR Nr1h5, Nr1h4, Nr1h3, NR1H3 8.070402 454 -1
2 chr10_101291541_101292041 GM.5.0.Nuclear_receptor.0005 NR4A2, NR1A4, RXRA, NR1H4, FXR Nr1h5, Nr1h4, Nr1h3, NR1H3 7.774184 454 1
3 chr10_101291541_101292041 GM.5.0.Mixed.0008 ZBTB33, NR2C2 9.322807 448 1
4 chr10_101291541_101292041 GM.5.0.C2H2_ZF.0020 Egr1, EGR4, EGR2, Egr3, EGR3, EGR1 Egr1, Egr4, EGR4, EGR2, Egr3, EGR3, EGR1 5.976515 124 1

4.4. Transforming results into a TF x (peak, gene) dataframe

df = tfi.to_dataframe()
df.head()
peak_id gene_short_name 9430076C15RIK AC002126.6 AC012531.1 AC226150.2 AFP AHR AHRR AIRE ... ZNF784 ZNF8 ZNF816 ZNF85 ZSCAN10 ZSCAN16 ZSCAN22 ZSCAN26 ZSCAN31 ZSCAN4
0 chr10_101291541_101292041 LINC01475 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 chr10_101291541_101292041 NKX2-3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 chr10_101292051_101292551 LINC01475 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 chr10_101292051_101292551 NKX2-3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 chr10_101380430_101380930 SLC25A28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1095 columns

5. Saving the CellOracle base GRN

df.to_parquet("base_GRN_dataframe.parquet")

6. Fitting CellOracle GRN

This notebook section follows the CellOracle documentation: GRN model construction and network analysis³

⚠️ We shortened this part of the workflow to demonstrate the use of CIRCE for CellOracle GRN.
For all details on scRNA-seq data preprocessing, please refer to CellOracle - scRNA-seq data preparation.

During both data preprocessing and GRN inference, we make use of an Oracle object. This object leverages its built-in functions to calculate and store all required information for these steps. To start, we create a new Oracle instance and provide it with our gene expression dataset anndata along with the transcription factor information (the base GRN).

rna = data["rna"]
# Random downsampling into 30K cells if the anndata object include more than 30 K cells.
n_cells_downsample = 30000
if rna.shape[0] > n_cells_downsample:
    # Let's dowmsample into 30K cells
    sc.pp.subsample(rna, n_obs=n_cells_downsample, random_state=123)

6.1. Creating the CellOracle object

# Instantiate Oracle object
oracle = co.Oracle()

6.2. Preprocessing RNA data

import scanpy as sc
import numpy as np
from scipy import sparse

# 1) Keep a copy of the *raw counts* in a layer
rna.layers["raw_count"] = rna.X.copy()

# 2) (Optional) Now you can normalize/log for your own analyses without touching raw_count
sc.pp.normalize_total(rna, target_sum=1e4)
sc.pp.log1p(rna)

# 2.a) Keep the normalized counts in a layer
rna.layers["normalized_count"] = rna.X.copy()

# 3) Neighbors + clustering for 'louvain_annot'
sc.pp.pca(rna)
sc.pp.neighbors(rna)
sc.tl.louvain(rna, key_added="louvain_annot")  # or sc.tl.leiden(..., key_added="louvain_annot")
rna.obs["louvain_annot"] = rna.obs["louvain_annot"].astype("category")

# 4) ForceAtlas2 embedding (Scanpy's draw_graph)
sc.tl.draw_graph(rna)  # creates rna.obsm["X_draw_graph_fa"]

# 5) Point .X to the unscaled counts for Oracle (optional, Oracle will read from layers below)
rna.X = rna.layers["raw_count"].copy()

# 6) Import to Oracle
oracle.import_anndata_as_raw_count(
    adata=rna,
    cluster_column_name="louvain_annot",
    embedding_name="X_draw_graph_fr",
)
WARNING: adata.X seems to be already log-transformed.
WARNING: Package 'fa2-modified' is not installed, falling back to layout 'fr'.To use the faster and better ForceAtlas2 layout, install package 'fa2-modified' (`pip install fa2-modified`).
18410 genes were found in the adata. Note that Celloracle is intended to use around 1000-3000 genes, so the behavior with this number of genes may differ from what is expected.
WARNING: adata.X seems to be already log-transformed.
../_images/6fe3eff5b04cb1422b5f26d043515cff72552e72689a3a4340283a40d830553b.png

6.3. Loading prior base GRN

# You can load TF info dataframe with the following code.
oracle.import_TF_data(TF_info_matrix=df)
TF dict already exists. The old TF dict data was deleted. 

6.4. Calculating KNN for cell proximity and trajectories

Compute PCA and select a small number of components.

# Perform PCA
oracle.perform_PCA()

# Select important PCs
plt.plot(np.cumsum(oracle.pca.explained_variance_ratio_)[:100])
n_comps = np.where(np.diff(np.diff(np.cumsum(oracle.pca.explained_variance_ratio_))>0.002))[0][0]
plt.axvline(n_comps, c="k")
plt.show()
print(n_comps)
n_comps = min(n_comps, 50)
../_images/68533c3b9c31950438149e322716eb923e0e2c961d409b890ea5cf134be9ea3b.png
6

Defining a number of neighbors for KNN imputation.

n_cell = oracle.adata.shape[0]
print(f"cell number is :{n_cell}")

k = int(0.025*n_cell)
print(f"Auto-selected k is :{k}")
cell number is :9631
Auto-selected k is :240

Computing k-nearest neighbors.

oracle.knn_imputation(n_pca_dims=n_comps, k=k, balanced=True, b_sight=k*8,
                      b_maxl=k*4, n_jobs=4)
# Check clustering data
sc.pl.draw_graph(oracle.adata, color="louvain_annot")

6.5. Inferring cluster-specific GRN

%%time
# Calculate GRN for each population in "louvain_annot" clustering unit.
# This step may take some time.(~30 minutes)
links = oracle.get_links(cluster_name_for_GRN_unit="louvain_annot", alpha=10,
                         verbose_level=10)
Inferring GRN for 0...
Inferring GRN for 1...
Inferring GRN for 10...
Inferring GRN for 11...
Inferring GRN for 12...
Inferring GRN for 2...
Inferring GRN for 3...
Inferring GRN for 4...
Inferring GRN for 5...
Inferring GRN for 6...
Inferring GRN for 7...
Inferring GRN for 8...
Inferring GRN for 9...
CPU times: user 1h 22min 15s, sys: 4min 40s, total: 1h 26min 55s
Wall time: 1h 48min 31s
!conda env export > circe_celloracle_env.yml