Infer network

circe.circe.average_alpha(adata, window_size=500000, unit_distance=1000, n_samples=100, n_samples_maxtry=500, max_alpha_iteration=100, s=0.75, distance_constraint=250000, distance_parameter_convergence=1e-22, max_elements=200, chromosomes_sizes=None, init_method='precomputed', seed=42, verbose=False, *, client: Client | None = None, n_workers: int = 1, threads_per_worker: int = 1)

” Estimate the global sparsity‐penalty coefficient α used by _sliding graphical lasso_ on scATAC-seq data.

The function samples n_samples genomic windows, fits an “individual” α for each window (via local_alpha()) and returns their average. Windows that do not satisfy quality criteria (size < max_elements, <5 % long-range edges, <20 % co-accessible regions) are skipped.

Parallel implementation

  • One genomic window = one task executed in a Dask cluster.

  • The full anndata.AnnData object is broadcast to workers.

  • Tasks stream back through dask.distributed.as_completed(); as soon as n_samples usable α’s are collected, the remaining tasks are cancelled.

param adata:

Input accessibility matrix with var containing at least the columns chromosome, start and end (0-based, half-open).

type adata:

anndata.AnnData

param window_size:

Genomic size (bp) of the sliding window.

type window_size:

int, default 500_000

param unit_distance:

Unit (bp) used to rescale genomic distances prior to penalty weighting.

type unit_distance:

int, default 1_000

param n_samples:

Number of windows retained to compute the average α.

type n_samples:

int, default 100

param n_samples_maxtry:

Maximum number of candidate windows to inspect in order to obtain n_samples valid ones.

type n_samples_maxtry:

int, default 500

param max_alpha_iteration:

Maximum iterations in the fixed-point search performed by local_alpha().

type max_alpha_iteration:

int, default 100

param s:

Long-range penalty exponent (organism specific).

type s:

float, default 0.75

param distance_constraint:

Threshold (bp) above which an edge is considered long-range.

type distance_constraint:

int, default 250_000

param distance_parameter_convergence:

Convergence criterion for α.

type distance_parameter_convergence:

float, default 1e-22

param max_elements:

Upper bound on the number of regions (columns) allowed in a window.

type max_elements:

int, default 200

param chromosomes_sizes:

Mapping {chromosome: size_in_bp}. By default the maximum end coordinate found in adata.var is used for each chromosome.

type chromosomes_sizes:

dict, optional

param init_method:

Initialisation method forwarded to local_alpha().

type init_method:

{“precomputed”, …}, default “precomputed”

param seed:

Random seed used for the window shuffle.

type seed:

int, default 42

param verbose:

Emit warnings when fewer than n_samples usable windows are found.

type verbose:

bool, default False

param Parallel-execution options:

param ————————–:

param client:

Existing Dask client / cluster. When None (default) a local cluster is started with the resources below and shut down on exit.

type client:

dask.distributed.Client, optional

param n_workers:

Number of worker processes in the auto-started local cluster (ignored if client is provided).

type n_workers:

int, default 8

param threads_per_worker:

Number of OS threads per worker process.

type threads_per_worker:

int, default 1

returns:

alpha – Mean sparsity-penalty coefficient across the selected windows. nan if no window satisfied the criteria.

rtype:

float

Warning

A UserWarning is raised (when verbose=True) if fewer than n_samples windows pass the filters.

dask[distributed], rich and anndata >= 0.9 must be available.

circe.circe.calc_penalty(alpha, distance, unit_distance=1000, s=0.75)

Calculate distance penalties for graphical lasso, based on the formula from Cicero’s paper: alpha * (1 - (unit_distance / distance) ** 0.75).

Non-finite and negative values are replaced by 0.

Parameters:
  • alpha (float) – Penalty coefficient.

  • distance (array) – Distance between regions.

  • unit_distance (int, optional) – Unit distance (in base pair) to divide distance by. The default is 1000 for 1kb (as in Cicero’s paper).

  • s (float, optional) – Parameter for penalizing long-range edges. The default is 0.75 (Human/Mouse value). This parameter is organism specific.

Returns:

penalties – Penalty coefficients for graphical lasso.

Return type:

np.array

circe.circe.chr_batch_graphical_lasso(chr_X, chr_var, chromosome, alpha, unit_distance, window_size, init_method, max_elements, n=0, njobs=1, disable_tqdm=False)
circe.circe.compute_atac_network(adata, window_size=None, unit_distance=1000, distance_constraint=None, s=None, organism=None, max_alpha_iteration=100, distance_parameter_convergence=1e-22, max_elements=200, n_samples=100, n_samples_maxtry=500, key='atac_network', seed=42, njobs=1, threads_per_worker=1, verbose=0, chromosomes_sizes=None)

Compute co-accessibility scores between regions in a sparse matrix, stored in the varp slot of the passed anndata object. Scores are computed using ‘sliding_graphical_lasso’.

  1. First, the function calculates the optimal penalty coefficient alpha.

    Alpha is calculated by averaging alpha values from ‘n_samples’ windows, such as there’s less than 5% of possible long-range edges (> distance_constraint) and less than 20% co-accessible regions (regardless of distance constraint) in each window.

2. Then, it will calculate co-accessibility scores between regions in a sliding window of size ‘window_size’ and step ‘window_size/2’.

Results should be very similar to Cicero’s results. There is a strong correlation between Cicero’s co-accessibility scores and the ones calculated by this function. However, absolute values are not the same, because Cicero uses a different method to apply Graphical Lasso.

  1. Finally, it will average co-accessibility scores across windows.

Parameters:
  • adata (anndata object) – anndata object with var_names as region names.

  • window_size (int, optional) – Size of sliding window, in which co-accessible regions can be found. The default is None and will be set to 500000 if organism is None. This parameter is organism specific.

  • unit_distance (int, optional) – Distance between two regions in the matrix, in base pairs. The default is 1000.

  • distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is None and will be set to 250000 if organism is None. This parameter is organism specific.

  • s (float, optional) – Parameter for penalizing long-range edges. The default is None and will be set to 0.75 if organism is None. This parameter is organism specific.

  • organism (str, optional) – Organism name. The default is None. If s, window_size and distance_constraint are None, will use organism-specific values. Otherwise, will use the values passed as arguments.

  • max_alpha_iteration (int, optional) – Maximum number of iterations to calculate optimal penalty coefficient. The default is 100.

  • distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.

  • max_elements (int, optional) – Maximum number of regions in a window. The default is 200.

  • n_samples (int, optional) – Number of windows used to calculate optimal penalty coefficient alpha. The default is 100.

  • n_samples_maxtry (int, optional) – Maximum number of windows to try to calculate optimal penalty coefficient alpha. Should be higher than n_samples. The default is 500.

  • key (str, optional) – Key to store the results in adata.varp. The default is “atac_network”.

  • seed (int, optional) – Seed for random number generator. The default is 42.

  • njobs (int, optional) – Number of jobs to run in parallel. The default is 1.

  • threads_per_worker (int, optional) – Number of threads per worker. The default is 1.

  • verbose (int, optional) –

    Verbose level.

    0: no output at all 1: tqdm progress bar 2:detailed output

    The default is 0.

  • chromosomes_sizes (dict, optional) – Dictionary with chromosome sizes. If None, will use the maximum end position of each chromosome in adata.var. The default is None.

Return type:

None.

circe.circe.get_distances_regions(adata)

Get distances between regions, var_names from an anndata object. ‘add_region_infos’ should be run before this function.

Parameters:

adata (anndata object) – anndata object with var_names as region names.

Returns:

distance – Distance between regions.

Return type:

np.array

circe.circe.get_distances_regions_from_dataframe(df)

Get distances between regions, var_names from a dataframe object. ‘add_region_infos’ should be run before this function.

Parameters:

df (pd.DataFrame) – Dataframe with var_names as region names.

Returns:

distance – Distance between regions.

Return type:

np.array

circe.circe.local_alpha(X, zrow, distances, maxit=100, s=0.75, distance_constraint=250000, distance_parameter_convergence=1e-22, max_elements=200, unit_distance=1000, init_method='precomputed')

Calculate optimal penalty coefficient alpha for a given window. The alpha coefficient is fitted based on the number of long-range edges (> distance_constraint) and short-range edges in the window.

Parameters:
  • X (np.array) – Matrix of regions in a window.

  • zrow (int) – Number of rows removed from X (0 if no rows filled with zeros). It will be used to correct covariance matrix once calculated from only non-zero rows.

  • distances (np.array) – Distance between regions in the window.

  • maxit (int, optional) – Maximum number of iterations to converge alpha. The default is 100.

  • s (float, optional) – Parameter for penalizing long-range edges. The default is 0.75 (Human/Mouse value). This parameter is organism specific.

  • distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is 250000 (Human/Mouse value). This parameter is organism specific and usually half of window_size.

  • distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.

  • max_elements (int, optional) – Maximum number of regions in a window. The default is 200.

  • unit_distance (int, optional) – Unit distance (in base pair) to divide distance by. The default is 1000.

  • init_method (str, optional) – Method to use to compute initial covariance matrix. The default is “precomputed”. SHOULD BE CHANGED CAREFULLY.

Returns:

distance_parameter – Optimal penalty coefficient alpha.

Return type:

float

circe.circe.quiet_dask(verbose: int)

verbose = 0 → completely mute WARNINGS from Dask verbose = 1 → keep WARNINGS (default) verbose ≥ 2 → keep everything (INFO / DEBUG)

circe.circe.reconcile(results_gl, idx_gl, idy_gl)
circe.circe.single_graphical_lasso(idx, X, zrow, anndata_var, alpha, unit_distance, init_method, map_indices)
circe.circe.sliding_graphical_lasso(adata, window_size: int | None = None, unit_distance=1000, distance_constraint=None, s=None, organism=None, max_alpha_iteration=100, distance_parameter_convergence=1e-22, max_elements=200, n_samples=100, n_samples_maxtry=500, init_method='precomputed', verbose=0, seed=42, njobs=1, threads_per_worker=1, chromosomes_sizes: dict | None = None)

Estimate co-accessibility scores between regions penalized on distance. The function uses graphical lasso to estimate the precision matrix of the co-accessibility scores. The function uses a sliding window approach.

The function calculates an optimal penalty coefficient alpha for each window, based on the distance between regions in the window. The function then calculates co-accessibility scores between regions in each window using graphical lasso. The results are averaged across windows.

WARNING: might look generalised for many overlaps but is not yet, that’s why ‘start_sliding’ is hard coded as list of 2 values.

Parameters:
  • adata (AnnData object) – AnnData object with var_names as region names.

  • window_size (int, optional) – Size of the sliding window, where co-accessible regions can be found. The default is None and will be set to 500000 if organism is None. This parameter is organism specific.

  • unit_distance (int, optional) – Distance between two regions in the matrix, in base pairs. The default is 1000.

  • distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is None and will be set to 250000 if organism is None. This parameter is organism specific.

  • s (float, optional) – Parameter for penalizing long-range edges. The default is None and will be set to 0.75 if organism is None. This parameter is organism specific.

  • organism (str, optional) – Organism name. The default is None. If s, window_size and distance_constraint are None, will use organism-specific values. Otherwise, will use the values passed as arguments.

  • max_alpha_iteration (int, optional) – Maximum number of iterations to calculate optimal penalty coefficient. The default is 100.

  • distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.

  • max_elements (int, optional) – Maximum number of regions in a window. The default is 200.

  • n_samples (int, optional) – Number of windows used to calculate optimal penalty coefficient alpha. The default is 100.

  • n_samples_maxtry (int, optional) – Maximum number of windows to try to calculate optimal penalty coefficient alpha. Should be higher than n_samples. The default is 500.

  • init_method (str, optional) – Method to use to compute initial covariance matrix. The default is “precomputed”. SHOULD BE CHANGED CAREFULLY.

  • verbose (int, optional) –

    Verbose level.

    0: no output at all 1: tqdm progress bar 2:detailed output

  • seed (int, optional) – Seed for random number generator. The default is 42.

  • njobs (int, optional) – Number of jobs to run in parallel. The default is 1.

  • threads_per_worker (int, optional) – Number of threads per worker. The default is 1.

  • chromosomes_sizes (dict, optional) – Dictionary with chromosome sizes. If None, will use the maximum end position of each chromosome in adata.var. The default is None.

Returns:

results – Dictionary with keys as window names and values as sparse matrices (csr) of co-accessibility scores.

Return type:

dict