Infer network¶

circe.circe.average_alpha(adata, window_size=500000, unit_distance=1000, n_samples=100, n_samples_maxtry=500, max_alpha_iteration=100, s=0.75, distance_constraint=250000, distance_parameter_convergence=1e-22, max_elements=200, chromosomes_sizes=None, init_method='precomputed', seed=42, verbose=False, *, client: Client | None = None, n_workers: int = 1, threads_per_worker: int = 1)¶

” Estimate the global sparsity‐penalty coefficient α used by _sliding graphical lasso_ on scATAC-seq data.

The function samples n_samples genomic windows, fits an “individual” α for each window (via local_alpha()) and returns their average. Windows that do not satisfy quality criteria (size < max_elements, <5 % long-range edges, <20 % co-accessible regions) are skipped.

Parallel implementation¶

One genomic window = one task executed in a Dask cluster.
The full anndata.AnnData object is broadcast to workers.
Tasks stream back through dask.distributed.as_completed(); as soon as n_samples usable α’s are collected, the remaining tasks are cancelled.

param adata:: Input accessibility matrix with var containing at least the columns chromosome, start and end (0-based, half-open).
type adata:: anndata.AnnData
param window_size:: Genomic size (bp) of the sliding window.
type window_size:: int, default 500_000
param unit_distance:: Unit (bp) used to rescale genomic distances prior to penalty weighting.
type unit_distance:: int, default 1_000
param n_samples:: Number of windows retained to compute the average α.
type n_samples:: int, default 100
param n_samples_maxtry:: Maximum number of candidate windows to inspect in order to obtain n_samples valid ones.
type n_samples_maxtry:: int, default 500
param max_alpha_iteration:: Maximum iterations in the fixed-point search performed by local_alpha().
type max_alpha_iteration:: int, default 100
param s:: Long-range penalty exponent (organism specific).
type s:: float, default 0.75
param distance_constraint:: Threshold (bp) above which an edge is considered long-range.
type distance_constraint:: int, default 250_000
param distance_parameter_convergence:: Convergence criterion for α.
type distance_parameter_convergence:: float, default 1e-22
param max_elements:: Upper bound on the number of regions (columns) allowed in a window.
type max_elements:: int, default 200
param chromosomes_sizes:: Mapping {chromosome: size_in_bp}. By default the maximum end coordinate found in adata.var is used for each chromosome.
type chromosomes_sizes:: dict, optional
param init_method:: Initialisation method forwarded to local_alpha().
type init_method:: {“precomputed”, …}, default “precomputed”
param seed:: Random seed used for the window shuffle.
type seed:: int, default 42
param verbose:: Emit warnings when fewer than n_samples usable windows are found.
type verbose:: bool, default False
param Parallel-execution options:
param ————————–:
param client:: Existing Dask client / cluster. When None (default) a local cluster is started with the resources below and shut down on exit.
type client:: dask.distributed.Client, optional
param n_workers:: Number of worker processes in the auto-started local cluster (ignored if client is provided).
type n_workers:: int, default 8
param threads_per_worker:: Number of OS threads per worker process.
type threads_per_worker:: int, default 1
returns:: alpha – Mean sparsity-penalty coefficient across the selected windows. nan if no window satisfied the criteria.
rtype:: float

Warning

A UserWarning is raised (when verbose=True) if fewer than n_samples windows pass the filters.

dask[distributed], rich and anndata >= 0.9 must be available.

circe.circe.calc_penalty(alpha, distance, unit_distance=1000, s=0.75)¶

Calculate distance penalties for graphical lasso, based on the formula from Cicero’s paper: alpha * (1 - (unit_distance / distance) ** 0.75).

Non-finite and negative values are replaced by 0.

Parameters:

alpha (float) – Penalty coefficient.
distance (array) – Distance between regions.
unit_distance (int, optional) – Unit distance (in base pair) to divide distance by. The default is 1000 for 1kb (as in Cicero’s paper).
s (float, optional) – Parameter for penalizing long-range edges. The default is 0.75 (Human/Mouse value). This parameter is organism specific.

Returns:

penalties – Penalty coefficients for graphical lasso.

Return type:

np.array

circe.circe.chr_batch_graphical_lasso(chr_X, chr_var, chromosome, alpha, unit_distance, window_size, init_method, max_elements, n=0, njobs=1, disable_tqdm=False)¶

circe.circe.compute_atac_network(adata, window_size=None, unit_distance=1000, distance_constraint=None, s=None, organism=None, max_alpha_iteration=100, distance_parameter_convergence=1e-22, max_elements=200, n_samples=100, n_samples_maxtry=500, key='atac_network', seed=42, njobs=1, threads_per_worker=1, verbose=0, chromosomes_sizes=None)¶

Compute co-accessibility scores between regions in a sparse matrix, stored in the varp slot of the passed anndata object. Scores are computed using ‘sliding_graphical_lasso’.

First, the function calculates the optimal penalty coefficient alpha.
Alpha is calculated by averaging alpha values from ‘n_samples’ windows, such as there’s less than 5% of possible long-range edges (> distance_constraint) and less than 20% co-accessible regions (regardless of distance constraint) in each window.

2. Then, it will calculate co-accessibility scores between regions in a sliding window of size ‘window_size’ and step ‘window_size/2’.

Results should be very similar to Cicero’s results. There is a strong correlation between Cicero’s co-accessibility scores and the ones calculated by this function. However, absolute values are not the same, because Cicero uses a different method to apply Graphical Lasso.

Finally, it will average co-accessibility scores across windows.

Parameters:

adata (anndata object) – anndata object with var_names as region names.
window_size (int, optional) – Size of sliding window, in which co-accessible regions can be found. The default is None and will be set to 500000 if organism is None. This parameter is organism specific.
unit_distance (int, optional) – Distance between two regions in the matrix, in base pairs. The default is 1000.
distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is None and will be set to 250000 if organism is None. This parameter is organism specific.
s (float, optional) – Parameter for penalizing long-range edges. The default is None and will be set to 0.75 if organism is None. This parameter is organism specific.
organism (str, optional) – Organism name. The default is None. If s, window_size and distance_constraint are None, will use organism-specific values. Otherwise, will use the values passed as arguments.
max_alpha_iteration (int, optional) – Maximum number of iterations to calculate optimal penalty coefficient. The default is 100.
distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.
max_elements (int, optional) – Maximum number of regions in a window. The default is 200.
n_samples (int, optional) – Number of windows used to calculate optimal penalty coefficient alpha. The default is 100.
n_samples_maxtry (int, optional) – Maximum number of windows to try to calculate optimal penalty coefficient alpha. Should be higher than n_samples. The default is 500.
key (str, optional) – Key to store the results in adata.varp. The default is “atac_network”.
seed (int, optional) – Seed for random number generator. The default is 42.
njobs (int, optional) – Number of jobs to run in parallel. The default is 1.
threads_per_worker (int, optional) – Number of threads per worker. The default is 1.
verbose (int, optional) –

Verbose level.
0: no output at all 1: tqdm progress bar 2:detailed output

The default is 0.
chromosomes_sizes (dict, optional) – Dictionary with chromosome sizes. If None, will use the maximum end position of each chromosome in adata.var. The default is None.

Return type:

None.

circe.circe.get_distances_regions(adata)¶

Get distances between regions, var_names from an anndata object. ‘add_region_infos’ should be run before this function.

Parameters:: adata (anndata object) – anndata object with var_names as region names.
Returns:: distance – Distance between regions.
Return type:: np.array

circe.circe.get_distances_regions_from_dataframe(df)¶

Get distances between regions, var_names from a dataframe object. ‘add_region_infos’ should be run before this function.

Parameters:: df (pd.DataFrame) – Dataframe with var_names as region names.
Returns:: distance – Distance between regions.
Return type:: np.array

circe.circe.local_alpha(X, zrow, distances, maxit=100, s=0.75, distance_constraint=250000, distance_parameter_convergence=1e-22, max_elements=200, unit_distance=1000, init_method='precomputed')¶

Calculate optimal penalty coefficient alpha for a given window. The alpha coefficient is fitted based on the number of long-range edges (> distance_constraint) and short-range edges in the window.

Parameters:

X (np.array) – Matrix of regions in a window.
zrow (int) – Number of rows removed from X (0 if no rows filled with zeros). It will be used to correct covariance matrix once calculated from only non-zero rows.
distances (np.array) – Distance between regions in the window.
maxit (int, optional) – Maximum number of iterations to converge alpha. The default is 100.
s (float, optional) – Parameter for penalizing long-range edges. The default is 0.75 (Human/Mouse value). This parameter is organism specific.
distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is 250000 (Human/Mouse value). This parameter is organism specific and usually half of window_size.
distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.
max_elements (int, optional) – Maximum number of regions in a window. The default is 200.
unit_distance (int, optional) – Unit distance (in base pair) to divide distance by. The default is 1000.
init_method (str, optional) – Method to use to compute initial covariance matrix. The default is “precomputed”. SHOULD BE CHANGED CAREFULLY.

Returns:

distance_parameter – Optimal penalty coefficient alpha.

Return type:

float

circe.circe.quiet_dask(verbose: int)¶: verbose = 0 → completely mute WARNINGS from Dask verbose = 1 → keep WARNINGS (default) verbose ≥ 2 → keep everything (INFO / DEBUG)

circe.circe.reconcile(results_gl, idx_gl, idy_gl)¶

circe.circe.single_graphical_lasso(idx, X, zrow, anndata_var, alpha, unit_distance, init_method, map_indices)¶

circe.circe.sliding_graphical_lasso(adata, window_size: int | None = None, unit_distance=1000, distance_constraint=None, s=None, organism=None, max_alpha_iteration=100, distance_parameter_convergence=1e-22, max_elements=200, n_samples=100, n_samples_maxtry=500, init_method='precomputed', verbose=0, seed=42, njobs=1, threads_per_worker=1, chromosomes_sizes: dict | None = None)¶

Estimate co-accessibility scores between regions penalized on distance. The function uses graphical lasso to estimate the precision matrix of the co-accessibility scores. The function uses a sliding window approach.

The function calculates an optimal penalty coefficient alpha for each window, based on the distance between regions in the window. The function then calculates co-accessibility scores between regions in each window using graphical lasso. The results are averaged across windows.

WARNING: might look generalised for many overlaps but is not yet, that’s why ‘start_sliding’ is hard coded as list of 2 values.

Parameters:

adata (AnnData object) – AnnData object with var_names as region names.
window_size (int, optional) – Size of the sliding window, where co-accessible regions can be found. The default is None and will be set to 500000 if organism is None. This parameter is organism specific.
unit_distance (int, optional) – Distance between two regions in the matrix, in base pairs. The default is 1000.
distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is None and will be set to 250000 if organism is None. This parameter is organism specific.
s (float, optional) – Parameter for penalizing long-range edges. The default is None and will be set to 0.75 if organism is None. This parameter is organism specific.
organism (str, optional) – Organism name. The default is None. If s, window_size and distance_constraint are None, will use organism-specific values. Otherwise, will use the values passed as arguments.
max_alpha_iteration (int, optional) – Maximum number of iterations to calculate optimal penalty coefficient. The default is 100.
distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.
max_elements (int, optional) – Maximum number of regions in a window. The default is 200.
n_samples (int, optional) – Number of windows used to calculate optimal penalty coefficient alpha. The default is 100.
n_samples_maxtry (int, optional) – Maximum number of windows to try to calculate optimal penalty coefficient alpha. Should be higher than n_samples. The default is 500.
init_method (str, optional) – Method to use to compute initial covariance matrix. The default is “precomputed”. SHOULD BE CHANGED CAREFULLY.
verbose (int, optional) –

Verbose level.
0: no output at all 1: tqdm progress bar 2:detailed output
seed (int, optional) – Seed for random number generator. The default is 42.
njobs (int, optional) – Number of jobs to run in parallel. The default is 1.
threads_per_worker (int, optional) – Number of threads per worker. The default is 1.
chromosomes_sizes (dict, optional) – Dictionary with chromosome sizes. If None, will use the maximum end position of each chromosome in adata.var. The default is None.

Returns:

results – Dictionary with keys as window names and values as sparse matrices (csr) of co-accessibility scores.

Return type:

dict