Infer network¶
- circe.circe.average_alpha(adata, window_size=500000, unit_distance=1000, n_samples=100, n_samples_maxtry=500, max_alpha_iteration=100, s=0.75, distance_constraint=250000, distance_parameter_convergence=1e-22, max_elements=200, chromosomes_sizes=None, init_method='precomputed', seed=42, verbose=False, *, client: Client | None = None, n_workers: int = 1, threads_per_worker: int = 1)¶
” Estimate the global sparsity‐penalty coefficient α used by _sliding graphical lasso_ on scATAC-seq data.
The function samples n_samples genomic windows, fits an “individual” α for each window (via
local_alpha()) and returns their average. Windows that do not satisfy quality criteria (size <max_elements, <5 % long-range edges, <20 % co-accessible regions) are skipped.Parallel implementation¶
One genomic window = one task executed in a Dask cluster.
The full
anndata.AnnDataobject is broadcast to workers.Tasks stream back through
dask.distributed.as_completed(); as soon as n_samples usable α’s are collected, the remaining tasks are cancelled.
- param adata:
Input accessibility matrix with
varcontaining at least the columnschromosome,startandend(0-based, half-open).- type adata:
anndata.AnnData
- param window_size:
Genomic size (bp) of the sliding window.
- type window_size:
int, default 500_000
- param unit_distance:
Unit (bp) used to rescale genomic distances prior to penalty weighting.
- type unit_distance:
int, default 1_000
- param n_samples:
Number of windows retained to compute the average α.
- type n_samples:
int, default 100
- param n_samples_maxtry:
Maximum number of candidate windows to inspect in order to obtain n_samples valid ones.
- type n_samples_maxtry:
int, default 500
- param max_alpha_iteration:
Maximum iterations in the fixed-point search performed by
local_alpha().- type max_alpha_iteration:
int, default 100
- param s:
Long-range penalty exponent (organism specific).
- type s:
float, default 0.75
- param distance_constraint:
Threshold (bp) above which an edge is considered long-range.
- type distance_constraint:
int, default 250_000
- param distance_parameter_convergence:
Convergence criterion for α.
- type distance_parameter_convergence:
float, default 1e-22
- param max_elements:
Upper bound on the number of regions (columns) allowed in a window.
- type max_elements:
int, default 200
- param chromosomes_sizes:
Mapping
{chromosome: size_in_bp}. By default the maximumendcoordinate found in adata.var is used for each chromosome.- type chromosomes_sizes:
dict, optional
- param init_method:
Initialisation method forwarded to
local_alpha().- type init_method:
{“precomputed”, …}, default “precomputed”
- param seed:
Random seed used for the window shuffle.
- type seed:
int, default 42
- param verbose:
Emit warnings when fewer than n_samples usable windows are found.
- type verbose:
bool, default False
- param Parallel-execution options:
- param ————————–:
- param client:
Existing Dask client / cluster. When None (default) a local cluster is started with the resources below and shut down on exit.
- type client:
dask.distributed.Client, optional
- param n_workers:
Number of worker processes in the auto-started local cluster (ignored if client is provided).
- type n_workers:
int, default 8
- param threads_per_worker:
Number of OS threads per worker process.
- type threads_per_worker:
int, default 1
- returns:
alpha – Mean sparsity-penalty coefficient across the selected windows.
nanif no window satisfied the criteria.- rtype:
float
Warning
A
UserWarningis raised (whenverbose=True) if fewer than n_samples windows pass the filters.dask[distributed],richandanndata >= 0.9must be available.
- circe.circe.calc_penalty(alpha, distance, unit_distance=1000, s=0.75)¶
Calculate distance penalties for graphical lasso, based on the formula from Cicero’s paper: alpha * (1 - (unit_distance / distance) ** 0.75).
Non-finite and negative values are replaced by 0.
- Parameters:
alpha (float) – Penalty coefficient.
distance (array) – Distance between regions.
unit_distance (int, optional) – Unit distance (in base pair) to divide distance by. The default is 1000 for 1kb (as in Cicero’s paper).
s (float, optional) – Parameter for penalizing long-range edges. The default is 0.75 (Human/Mouse value). This parameter is organism specific.
- Returns:
penalties – Penalty coefficients for graphical lasso.
- Return type:
np.array
- circe.circe.chr_batch_graphical_lasso(chr_X, chr_var, chromosome, alpha, unit_distance, window_size, init_method, max_elements, n=0, njobs=1, disable_tqdm=False)¶
- circe.circe.compute_atac_network(adata, window_size=None, unit_distance=1000, distance_constraint=None, s=None, organism=None, max_alpha_iteration=100, distance_parameter_convergence=1e-22, max_elements=200, n_samples=100, n_samples_maxtry=500, key='atac_network', seed=42, njobs=1, threads_per_worker=1, verbose=0, chromosomes_sizes=None)¶
Compute co-accessibility scores between regions in a sparse matrix, stored in the varp slot of the passed anndata object. Scores are computed using ‘sliding_graphical_lasso’.
- First, the function calculates the optimal penalty coefficient alpha.
Alpha is calculated by averaging alpha values from ‘n_samples’ windows, such as there’s less than 5% of possible long-range edges (> distance_constraint) and less than 20% co-accessible regions (regardless of distance constraint) in each window.
2. Then, it will calculate co-accessibility scores between regions in a sliding window of size ‘window_size’ and step ‘window_size/2’.
Results should be very similar to Cicero’s results. There is a strong correlation between Cicero’s co-accessibility scores and the ones calculated by this function. However, absolute values are not the same, because Cicero uses a different method to apply Graphical Lasso.
Finally, it will average co-accessibility scores across windows.
- Parameters:
adata (anndata object) – anndata object with var_names as region names.
window_size (int, optional) – Size of sliding window, in which co-accessible regions can be found. The default is None and will be set to 500000 if organism is None. This parameter is organism specific.
unit_distance (int, optional) – Distance between two regions in the matrix, in base pairs. The default is 1000.
distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is None and will be set to 250000 if organism is None. This parameter is organism specific.
s (float, optional) – Parameter for penalizing long-range edges. The default is None and will be set to 0.75 if organism is None. This parameter is organism specific.
organism (str, optional) – Organism name. The default is None. If s, window_size and distance_constraint are None, will use organism-specific values. Otherwise, will use the values passed as arguments.
max_alpha_iteration (int, optional) – Maximum number of iterations to calculate optimal penalty coefficient. The default is 100.
distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.
max_elements (int, optional) – Maximum number of regions in a window. The default is 200.
n_samples (int, optional) – Number of windows used to calculate optimal penalty coefficient alpha. The default is 100.
n_samples_maxtry (int, optional) – Maximum number of windows to try to calculate optimal penalty coefficient alpha. Should be higher than n_samples. The default is 500.
key (str, optional) – Key to store the results in adata.varp. The default is “atac_network”.
seed (int, optional) – Seed for random number generator. The default is 42.
njobs (int, optional) – Number of jobs to run in parallel. The default is 1.
threads_per_worker (int, optional) – Number of threads per worker. The default is 1.
verbose (int, optional) –
- Verbose level.
0: no output at all 1: tqdm progress bar 2:detailed output
The default is 0.
chromosomes_sizes (dict, optional) – Dictionary with chromosome sizes. If None, will use the maximum end position of each chromosome in adata.var. The default is None.
- Return type:
None.
- circe.circe.get_distances_regions(adata)¶
Get distances between regions, var_names from an anndata object. ‘add_region_infos’ should be run before this function.
- Parameters:
adata (anndata object) – anndata object with var_names as region names.
- Returns:
distance – Distance between regions.
- Return type:
np.array
- circe.circe.get_distances_regions_from_dataframe(df)¶
Get distances between regions, var_names from a dataframe object. ‘add_region_infos’ should be run before this function.
- Parameters:
df (pd.DataFrame) – Dataframe with var_names as region names.
- Returns:
distance – Distance between regions.
- Return type:
np.array
- circe.circe.local_alpha(X, zrow, distances, maxit=100, s=0.75, distance_constraint=250000, distance_parameter_convergence=1e-22, max_elements=200, unit_distance=1000, init_method='precomputed')¶
Calculate optimal penalty coefficient alpha for a given window. The alpha coefficient is fitted based on the number of long-range edges (> distance_constraint) and short-range edges in the window.
- Parameters:
X (np.array) – Matrix of regions in a window.
zrow (int) – Number of rows removed from X (0 if no rows filled with zeros). It will be used to correct covariance matrix once calculated from only non-zero rows.
distances (np.array) – Distance between regions in the window.
maxit (int, optional) – Maximum number of iterations to converge alpha. The default is 100.
s (float, optional) – Parameter for penalizing long-range edges. The default is 0.75 (Human/Mouse value). This parameter is organism specific.
distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is 250000 (Human/Mouse value). This parameter is organism specific and usually half of window_size.
distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.
max_elements (int, optional) – Maximum number of regions in a window. The default is 200.
unit_distance (int, optional) – Unit distance (in base pair) to divide distance by. The default is 1000.
init_method (str, optional) – Method to use to compute initial covariance matrix. The default is “precomputed”. SHOULD BE CHANGED CAREFULLY.
- Returns:
distance_parameter – Optimal penalty coefficient alpha.
- Return type:
float
- circe.circe.quiet_dask(verbose: int)¶
verbose = 0 → completely mute WARNINGS from Dask verbose = 1 → keep WARNINGS (default) verbose ≥ 2 → keep everything (INFO / DEBUG)
- circe.circe.reconcile(results_gl, idx_gl, idy_gl)¶
- circe.circe.single_graphical_lasso(idx, X, zrow, anndata_var, alpha, unit_distance, init_method, map_indices)¶
- circe.circe.sliding_graphical_lasso(adata, window_size: int | None = None, unit_distance=1000, distance_constraint=None, s=None, organism=None, max_alpha_iteration=100, distance_parameter_convergence=1e-22, max_elements=200, n_samples=100, n_samples_maxtry=500, init_method='precomputed', verbose=0, seed=42, njobs=1, threads_per_worker=1, chromosomes_sizes: dict | None = None)¶
Estimate co-accessibility scores between regions penalized on distance. The function uses graphical lasso to estimate the precision matrix of the co-accessibility scores. The function uses a sliding window approach.
The function calculates an optimal penalty coefficient alpha for each window, based on the distance between regions in the window. The function then calculates co-accessibility scores between regions in each window using graphical lasso. The results are averaged across windows.
WARNING: might look generalised for many overlaps but is not yet, that’s why ‘start_sliding’ is hard coded as list of 2 values.
- Parameters:
adata (AnnData object) – AnnData object with var_names as region names.
window_size (int, optional) – Size of the sliding window, where co-accessible regions can be found. The default is None and will be set to 500000 if organism is None. This parameter is organism specific.
unit_distance (int, optional) – Distance between two regions in the matrix, in base pairs. The default is 1000.
distance_constraint (int, optional) – Distance threshold for defining long-range edges. It is used to fit the penalty coefficient alpha. The default is None and will be set to 250000 if organism is None. This parameter is organism specific.
s (float, optional) – Parameter for penalizing long-range edges. The default is None and will be set to 0.75 if organism is None. This parameter is organism specific.
organism (str, optional) – Organism name. The default is None. If s, window_size and distance_constraint are None, will use organism-specific values. Otherwise, will use the values passed as arguments.
max_alpha_iteration (int, optional) – Maximum number of iterations to calculate optimal penalty coefficient. The default is 100.
distance_parameter_convergence (float, optional) – Convergence parameter for alpha (penalty) coefficiant calculation. The default is 1e-22.
max_elements (int, optional) – Maximum number of regions in a window. The default is 200.
n_samples (int, optional) – Number of windows used to calculate optimal penalty coefficient alpha. The default is 100.
n_samples_maxtry (int, optional) – Maximum number of windows to try to calculate optimal penalty coefficient alpha. Should be higher than n_samples. The default is 500.
init_method (str, optional) – Method to use to compute initial covariance matrix. The default is “precomputed”. SHOULD BE CHANGED CAREFULLY.
verbose (int, optional) –
- Verbose level.
0: no output at all 1: tqdm progress bar 2:detailed output
seed (int, optional) – Seed for random number generator. The default is 42.
njobs (int, optional) – Number of jobs to run in parallel. The default is 1.
threads_per_worker (int, optional) – Number of threads per worker. The default is 1.
chromosomes_sizes (dict, optional) – Dictionary with chromosome sizes. If None, will use the maximum end position of each chromosome in adata.var. The default is None.
- Returns:
results – Dictionary with keys as window names and values as sparse matrices (csr) of co-accessibility scores.
- Return type:
dict