API reference¶
Core¶
llm_sc_curator.core.LLMscCurator ¶
LLMscCurator(
api_key=None,
model_name=None,
backend=None,
allow_internal_normalization=False,
normalization_target_sum=10000.0,
)
Initialize LLM-scCurator.
Parameters:
-
api_key(str, default:None) –(Legacy / convenience) API key used to instantiate the default Gemini backend. Prefer passing an explicit
backendinstance for full control. -
model_name(str, default:None) –(Legacy / convenience) Model identifier for the default Gemini backend. If not provided, a reasonable default is used.
-
backend(BaseLLMBackend, default:None) –Dependency-injected backend instance (e.g.,
GeminiBackend,OpenAIBackend). Must implement agenerate(prompt: str, json_mode: bool) -> strinterface. -
allow_internal_normalization(bool, default:False) –If True, LLM-scCurator will internally normalize and log-transform inputs that look like raw counts (UMI-like). Internal normalization is performed on a COPY and recorded in
adata.uns['llm_sc_curator']for provenance. -
normalization_target_sum(float, default:1e4) –Target total counts per cell used when
allow_internal_normalization=True.
Notes
- This initializer enforces a backend-agnostic design: users may plug in cloud
APIs or local models via the
BaseLLMBackendinterface. - If neither
backendnorapi_keyis provided, initialization fails fast.
set_global_context ¶
set_global_context(
adata,
balance_by=None,
max_cells_per_group="auto",
min_cells_per_group=50,
random_state=42,
)
Set the global dataset context used for specificity-aware feature distillation.
LLM-scCurator uses a global context to (i) estimate global Gini/sparsity trends,
(ii) detect ubiquitous housekeeping/stress programs, and (iii) support optional
cross-lineage leakage checks. When balance_by is provided, context statistics
are computed on a balanced subsample to prevent dominant groups from skewing
percentiles and thresholds.
Parameters:
-
adata(AnnData) –AnnData containing all cells. Can be raw counts or log1p-normalized; normalization is validated via
_check_normalization. -
balance_by(str or None, default:None) –Column in
adata.obsdefining groups for balanced subsampling (e.g., broad lineage labels or sample/patient IDs). If None, all cells are used. -
max_cells_per_group(int or {auto}, default:"auto") –Target number of cells retained per group. If "auto", the per-group target is set to the median group size with a floor at
min_cells_per_group. -
min_cells_per_group(int, default:50) –Groups with fewer cells than this threshold are excluded from the balanced subsample to avoid unstable global statistics.
-
random_state(int, default:42) –Seed for subsampling.
Returns:
-
None–Initializes
self.masker(aFeatureDistiller) and records configuration inself._global_context_config.
Notes
This method never modifies adata in-place. Any normalization or subsampling
occurs on internal copies.
curate_features ¶
curate_features(
target_adata,
group_col,
target_group,
reference="rest",
n_top=50,
use_statistics=True,
use_hvg=True,
coarse_col=None,
whitelist=None,
batch_key=None,
n_candidates=500,
min_target_mean=0.02,
min_delta_mean=0.02,
min_logfc=0.2,
min_target_pct=0.02,
min_delta_pct=0.02,
)
Perform 4-stage feature distillation for LLM prompting.
The goal is to produce a compact marker list that is enriched for lineage/state identity signals while suppressing ubiquitous biological programs (e.g., stress, housekeeping, cell cycle) and technical confounders. Candidate genes are drawn from HVGs (optionally batch-aware) with additional "high-Gini rescue" to retain lineage-restricted markers that may not be highly variable.
Parameters:
-
target_adata(AnnData) –AnnData object containing the target clusters (subset or full dataset).
-
group_col(str) –Column in
target_adata.obscontaining cluster labels. -
target_group(str or int) –Cluster identifier to analyze.
-
reference(rest, default:"rest") –Reference group used for differential expression. If "rest", the target cluster is contrasted against all other cells.
-
n_top(int, default:50) –Number of final markers to return.
-
use_statistics(bool, default:True) –If True, apply biological noise detection modules (regex-based) to mask confounder programs and rescue sentinel markers.
-
use_hvg(bool, default:True) –If True, restrict candidates to HVGs plus rescue sets (high-Gini + lineage markers).
-
coarse_col(str or None, default:None) –Optional column (in the global context) used for cross-lineage leakage checks.
-
whitelist(list[str] or None, default:None) –Optional list of genes exempt from masking (augmented internally with proliferation sentinels and lineage markers, excluding confounded TCR/Ig/Hb patterns).
-
batch_key(str or None, default:None) –Optional key for batch-aware HVG selection.
-
n_candidates(int, default:500) –Number of top-ranked DE candidates considered before masking/filtering.
-
min_target_mean(float, default:0.02) –Minimum mean expression (log1p) in the target cluster.
-
min_delta_mean(float, default:0.02) –Minimum mean difference (target - reference).
-
min_logfc(float, default:0.2) –Minimum log fold-change threshold if available in the DE table.
-
min_target_pct(float, default:0.02) –Minimum detection fraction in the target cluster.
-
min_delta_pct(float, default:0.02) –Minimum detection fraction difference (target - reference).
Returns:
-
list[str]–Distilled marker genes (length up to
n_top). If all candidates are masked, the method falls back to unmasked DE rankings to avoid empty prompts.
Notes
Input data are validated for log1p normalization via _check_normalization.
For rigorous benchmarking and lineage leakage checks, call set_global_context()
on a broad atlas before per-cluster distillation.
annotate ¶
annotate(
gene_list,
cell_type="",
context=None,
use_auto_context=True,
max_retries=3,
retry_sleep=1.0,
)
Query the configured LLM backend with a distilled marker list and return a structured label.
The backend is called in JSON mode. Outputs are parsed into a stable dictionary schema to support reproducible downstream evaluation and logging.
Parameters:
-
gene_list(list[str]) –Distilled marker genes for a single cluster.
-
cell_type(str, default:"") –Optional parent lineage label to bias the prompt (e.g., coarse lineage).
-
context(dict or None, default:None) –Optional additional context (tissue, condition, dataset notes).
-
use_auto_context(bool, default:True) –If True and
cell_typeis not provided, use the last inferred lineage context when available. -
max_retries(int, default:3) –Maximum number of attempts to obtain a valid JSON object.
-
retry_sleep(float, default:1.0) –Sleep duration (seconds) between retry attempts.
Returns:
-
dict–A dictionary with keys: - cell_type : str Predicted lineage/state label. - confidence : {"High", "Medium", "Low"} Self-reported confidence bucket. - reasoning : str Brief, marker-based justification.
On failure, returns a soft error object with
cell_typeset to "Error" or "ParseError" (rather than raising), so callers may decide how to handle it.
Notes
This method is intentionally tolerant to transient backend failures and occasional JSON formatting issues (e.g., fenced code blocks).
run_hierarchical_discovery ¶
run_hierarchical_discovery(
adata,
coarse_res=0.2,
fine_res=0.5,
n_top=50,
batch_key=None,
global_context=None,
random_state=42,
)
Run coarse-to-fine hierarchical annotation on an AnnData object.
The pipeline performs a coarse Leiden clustering to infer major lineages, then reclusters within each coarse group to infer finer subtypes. Distilled marker lists are generated per cluster and passed to the injected LLM backend for structured JSON outputs.
Parameters:
-
adata(AnnData) –Input dataset.
.Xmust be log1p-normalized (validated via_check_normalization). -
coarse_res(float, default:0.2) –Leiden resolution for the coarse (major lineage) clustering.
-
fine_res(float, default:0.5) –Leiden resolution for the fine (subtype) clustering within each coarse group.
-
n_top(int, default:50) –Marker list size passed to the LLM per cluster.
-
batch_key(str or None, default:None) –Optional key for batch-aware HVG selection.
-
global_context(dict or None, default:None) –Optional user-provided context to embed in the prompt (tissue/condition).
-
random_state(int, default:42) –Random seed used for PCA/neighbors/Leiden and subsampling where applicable.
Returns:
-
AnnData–A COPY/view with added annotations: -
adata.obs['major_type']: coarse lineage labels -adata.obs['fine_type']: fine subtype labels Reasoning logs are stored inadata.uns['llm_reasoning'].
Notes
HVGs are precomputed once when missing, to avoid redundant work across clusters.
Backends¶
llm_sc_curator.backends ¶
BaseLLMBackend ¶
Bases: ABC
Abstract base class for LLM backends.
This interface enforces a backend-agnostic contract:
- Input: a prompt string (optionally requesting strict JSON output).
- Output: a single string (either free-form text or a JSON object encoded as text).
Implementations may wrap cloud APIs (e.g., Gemini/OpenAI) or local models, but must
expose a stable generate() method to keep downstream annotation code unchanged.
generate
abstractmethod
¶
generate(prompt, json_mode=False)
Generate a completion for the given prompt.
Parameters:
-
prompt(str) –Prompt text to send to the backend.
-
json_mode(bool, default:False) –If True, the backend should attempt to return a valid JSON object encoded as a string (not fenced code blocks).
Returns:
-
str–Model output as a string. In JSON mode, this should be a JSON object encoded as a string (e.g., '{"cell_type": "...", "confidence": "...", "reasoning": "..."}').
GeminiBackend ¶
GeminiBackend(
api_key,
model_name="models/gemini-2.0-flash",
temperature=0.0,
)
Bases: BaseLLMBackend
Google Gemini backend for LLM-scCurator.
This backend wraps google-generativeai and supports an optional JSON mode by
setting the response MIME type to application/json.
Parameters:
-
api_key(str) –Gemini API key.
-
model_name(str, default:"models/gemini-2.0-flash") –Gemini model identifier.
-
temperature(float, default:0.0) –Sampling temperature. Use 0.0 to maximize determinism (recommended for JSON mode).
Raises:
-
ImportError–If
google-generativeaiis not installed.
Notes
On exceptions, this backend returns a soft error payload (string or JSON string) rather than raising, to keep downstream annotation pipelines robust.
Initialize the Gemini backend.
Parameters:
-
api_key(str) –Gemini API key.
-
model_name(str, default:"models/gemini-2.0-flash") –Gemini model identifier.
-
temperature(float, default:0.0) –Sampling temperature (0.0 recommended for deterministic JSON responses).
generate ¶
generate(prompt, json_mode=False)
Generate a completion from Gemini.
Parameters:
-
prompt(str) –Prompt text to send to Gemini.
-
json_mode(bool, default:False) –If True, request JSON output via
response_mime_type="application/json".
Returns:
-
str–Model output string. If an exception occurs: - In JSON mode: returns a JSON string with keys {cell_type, confidence, reasoning}. - Otherwise: returns a human-readable error string prefixed with "Gemini Error:".
OpenAIBackend ¶
OpenAIBackend(
api_key, model_name="gpt-4o", temperature=0.0, seed=42
)
Bases: BaseLLMBackend
OpenAI Chat Completions backend for LLM-scCurator.
This backend uses the official openai Python package and supports JSON mode via
response_format={"type": "json_object"}.
Parameters:
-
api_key(str) –OpenAI API key.
-
model_name(str, default:"gpt-4o") –Model identifier.
-
temperature(float, default:0.0) –Sampling temperature (0.0 recommended for structured outputs).
-
seed(int, default:42) –Seed forwarded to the API when supported by the selected model. If the request fails with seed enabled, the backend retries once without the seed.
Raises:
-
ImportError–If the
openaipackage is not installed.
Notes
In contrast to GeminiBackend, OpenAIBackend attempts a seed-based call first and falls back to a seed-free call if needed. On errors it returns a soft error payload (string or JSON string), which enables batch pipelines to continue.
Initialize the OpenAI backend.
Parameters:
-
api_key(str) –OpenAI API key.
-
model_name(str, default:"gpt-4o") –Model identifier.
-
temperature(float, default:0.0) –Sampling temperature (0.0 recommended for structured outputs).
-
seed(int, default:42) –Seed forwarded to the API when supported.
generate ¶
generate(prompt, json_mode=False)
Generate a completion from OpenAI Chat Completions.
Parameters:
-
prompt(str) –Prompt text to send to the OpenAI API.
-
json_mode(bool, default:False) –If True, request a JSON object response format.
Returns:
-
str–Model output string. If an exception occurs: - In JSON mode: returns a JSON string with keys {cell_type, confidence, reasoning}. - Otherwise: returns a human-readable error string prefixed with "OpenAI Error:".
Notes
The backend first attempts a request with seed enabled. If that fails (e.g.,
due to model/endpoint incompatibility), it retries once without seed before
returning a soft error payload.
OllamaBackend ¶
OllamaBackend(
host=None,
model_name=None,
temperature=None,
timeout=None,
)
Bases: BaseLLMBackend
Ollama local backend for LLM-scCurator.
This backend sends prompts to a locally running Ollama server via the REST
Chat API and returns the assistant response as a single string. It is a
drop-in implementation of :class:BaseLLMBackend.
The primary use case is institutional / on-prem environments where outbound calls to cloud LLM APIs are restricted and inference must run locally.
Parameters:
-
host(str, default:None) –Base URL of the Ollama server.
If not provided, the value is resolved in the following order:
1) Environment variable
LLMSC_OLLAMA_HOST2) Default:"http://ollama:11434"(Docker Compose friendly)Trailing slashes are removed automatically.
-
model_name(str, default:None) –Ollama model name/tag (e.g.,
"llama3.1:8b"or"qwen2.5:7b-instruct").If not provided, the value is resolved in the following order:
1) Environment variable
LLMSC_OLLAMA_MODEL2) Default:"llama3.1:8b" -
temperature(float, default:None) –Sampling temperature forwarded to Ollama as
options.temperature. Use0.0for more deterministic classification-style behavior.If not provided, defaults to environment variable
LLMSC_OLLAMA_TEMPERATURE(default0.0). -
timeout(float, default:None) –Per-request timeout (seconds) for the HTTP call to Ollama. CPU-only inference can be slow; raise this value for large prompts.
If not provided, defaults to environment variable
LLMSC_OLLAMA_TIMEOUT(default120).
Notes
API endpoint This backend uses the Ollama Chat endpoint:
- ``POST {host}/api/chat``
JSON mode
If json_mode=True, the request includes format="json" which asks
Ollama to return a single JSON object (not fenced). The returned value is
still a string (JSON text), consistent with the BaseLLMBackend contract.
This backend performs a lightweight validation by attempting ``json.loads``
on the returned content; on failure it returns a "soft error" JSON payload.
Failure behavior
On exceptions, the backend returns a soft error:
- In JSON mode: JSON text with keys cell_type, confidence, reasoning.
- Otherwise: a string prefixed with "Ollama Error:".
When wrapped by :func:`retry_with_backoff`, these failures can be retried.
Examples:
Docker Compose (default host)::
# export LLMSC_OLLAMA_MODEL=llama3.1:8b
backend = OllamaBackend()
out = backend.generate("Return a JSON object: {"x": 1}", json_mode=True)
Local host::
backend = OllamaBackend(host="http://localhost:11434", model_name="llama3.1:8b")
out = backend.generate("Hello", json_mode=False)
generate ¶
generate(prompt, json_mode=False)
Generate a completion from Ollama.
Parameters:
-
prompt(str) –Prompt text to send to Ollama.
-
json_mode(bool, default:False) –If True, request a single JSON object response (returned as JSON text).
Returns:
-
str–Model output as a string.
- If
json_mode=False: free-form assistant text. - If
json_mode=True: a JSON object encoded as text.
On failure: - If
json_mode=True: returns JSON text with keyscell_type="Error",confidence="Low", andreasoning. - Otherwise: returns a string prefixed with"Ollama Error:". - If
LocalLLMBackend ¶
Bases: BaseLLMBackend
Placeholder backend for future local model integrations.
This backend currently returns a fixed JSON payload indicating that local inference is not implemented. It exists to document the intended extension point and to keep the public API stable.
Notes
Use this class as a template when integrating a local LLM runner (e.g., llama.cpp,
vLLM, or an on-premise service) into the BaseLLMBackend interface.
generate ¶
generate(prompt, json_mode=False)
Return a placeholder response (not implemented).
Parameters:
-
prompt(str) –Unused placeholder parameter.
-
json_mode(bool, default:False) –If True, returns a JSON object string.
Returns:
-
str–A JSON string with
cell_type="Local_Pending", indicating that the local backend is not yet implemented.
Masking¶
llm_sc_curator.masking.FeatureDistiller ¶
FeatureDistiller(global_adata)
Initialize a feature distiller using a GLOBAL reference atlas.
The global atlas provides background distributions required for: (i) global Gini-based low-specificity detection, and (ii) module-based masking using regex/gene lists.
Parameters:
-
global_adata(AnnData) –Global AnnData containing all cells/lineages used as background context.
.Xis expected to be log1p-normalized expression.
Notes
This class does not modify global_adata in-place. Computed statistics are
stored in self.gene_stats as a pandas DataFrame indexed by gene name.
calculate_gene_stats ¶
calculate_gene_stats()
Compute global gene statistics used by downstream masking steps.
This method computes, for each gene in the global atlas:
- mean expression (on .X)
- Gini coefficient (a global specificity proxy)
Returns:
-
DataFrame–DataFrame indexed by gene name with columns: - "mean": global mean expression - "gini": global Gini coefficient
Notes
- For sparse matrices (common in Scanpy), column-wise slicing in Python loops is slow. We convert to CSC once for efficient column access.
- On very large atlases (e.g., >2e5 cells × >1e4 genes), this step can be
computationally heavy. Consider constructing a balanced global context
upstream (e.g., via
set_global_context(..., balance_by=...)) to limit the number of cells used for global statistics.
detect_biological_noise ¶
detect_biological_noise(
gini_threshold=None,
gini_q=0.01,
mean_floor=0.05,
whitelist=None,
rescue_mean_floor=0.1,
low_gini_cap=0.15,
rescue_modules=None,
)
Stage 2: Detect globally low-specificity genes and module-defined noise programs.
Noise genes are flagged using two complementary mechanisms: 1) Global low-Gini housekeeping detection (data-driven). 2) Regex- and list-based biological modules (e.g., mitochondrial, stress, cell cycle).
Global low-specificity genes are defined using either:
- an absolute Gini threshold (gini_threshold), if provided, or
- the lower gini_q quantile among genes with mean >= mean_floor,
optionally capped by low_gini_cap.
Parameters:
-
gini_threshold(float or None, default:None) –Absolute Gini cutoff for housekeeping detection. If provided, quantile-based cutoff is not used.
-
gini_q(float, default:0.01) –Quantile defining the low-Gini band among genes passing
mean_floor. -
mean_floor(float, default:0.05) –Mean-expression floor used to exclude extremely lowly expressed genes when estimating the quantile cutoff.
-
whitelist(list[str] or None, default:None) –User-specified genes that should not be masked even if matched by a rule.
-
rescue_mean_floor(float, default:0.1) –Minimum global mean required to keep a rescued sentinel gene for rescue-enabled modules.
-
low_gini_cap(float or None, default:0.15) –Optional upper bound for the quantile-derived cutoff. The effective cutoff is
min(quantile_cutoff, low_gini_cap). -
rescue_modules(tuple[str, ...] or None, default:None) –Module names for which a top-expressed sentinel is rescued while masking the remaining matched genes. If None, uses
RESCUE_MODULES_DEFAULT.
Returns:
-
dict[str, str]–Mapping from gene symbol to a masking reason string (e.g., "Module_Mito", "Low_Gini_Housekeeping(...)", "CrossLineage_Leak(...)", etc.).
Notes
- This function ensures that "sentinel rescue" only occurs for modules explicitly
listed in
rescue_modules(default: LINC and hemoglobin contamination). - If
self.gene_statsis missing, it is computed on demand viacalculate_gene_stats().
calculate_lineage_specificity ¶
calculate_lineage_specificity(
target_genes,
target_adata,
target_group,
group_col,
coarse_col,
expr_percentile=90,
tail_percentile=95,
abs_threshold=0.1,
)
Stage 3: Cross-lineage specificity check using global lineage context.
This step identifies candidate markers that appear disproportionately high in
non-target major lineages (i.e., potential leakage markers), using percentile-
based comparisons between:
- local target cluster expression (within target_adata), and
- global major lineage expression (within the global atlas self.adata).
Parameters:
-
target_genes(list[str]) –Candidate marker genes for the target cluster.
-
target_adata(AnnData) –Local AnnData containing the target clustering.
-
target_group(str or int) –Cluster identifier within
target_adata.obs[group_col]. -
group_col(str) –Column name in
target_adata.obscontaining cluster labels. -
coarse_col(str) –Column name in the GLOBAL atlas
self.adata.obscontaining major lineage labels. This column must also be present in the local subset for lineage assignment. -
expr_percentile(int, default:90) –Expression percentile used for robust per-gene comparison.
-
tail_percentile(int, default:95) –Percentile applied to expression ratios to derive a dynamic leakage threshold.
-
abs_threshold(float, default:0.1) –Absolute expression floor to avoid unstable ratios in near-zero regimes.
Returns:
-
dict[str, str]–Mapping from gene symbol to a leakage reason string if flagged as cross-lineage high expression.
Notes
If required metadata are missing (e.g., coarse_col not found), the method returns
an empty dict and logs a warning rather than raising, to keep batch runs robust.
Utils¶
Helpers for converting per-cluster LLM outputs into tidy tables (CSV/DataFrame) and per-cell labels.
Output table contract¶
export_cluster_annotation_table() produces a cluster summary table intended to be stable across versions.
Required columns:
- {cluster_col} (e.g., seurat_clusters)
- n_cells
- {prefix}_CellType, {prefix}_Confidence, {prefix}_ConfidenceScore, {prefix}_Reasoning, {prefix}_Genes
Extra keys returned by LLM backends may be exported as {prefix}_<UpperCamelCaseKey> columns.
llm_sc_curator.utils ¶
ensure_json_result ¶
ensure_json_result(x)
Normalize LLM outputs into a stable dictionary schema.
LLM backends may return a dict (JSON), a plain string, or unexpected types. This helper enforces a minimal, consistent schema so downstream code can reliably build tables, logs, and per-cell annotations.
Parameters:
-
x(Any) –Raw output from an LLM backend. Expected types include: - dict : a JSON object with keys like "cell_type", "confidence", "reasoning" - str : a plain label string (treated as low-confidence) Other types are treated as errors and stringified into the "reasoning" field.
Returns:
-
dict–A dictionary guaranteed to contain: - cell_type : str Predicted label, or "Unknown"/"Error" on fallback. - confidence : {"High", "Medium", "Low"} Confidence bucket; defaults to "Low" if missing/invalid. - reasoning : str Brief explanation (may be empty).
Any additional keys found in the input dict are preserved to support future extensions (e.g., "candidate_labels", "unknown_score").
Notes
This function is intentionally permissive: it never raises, and it prefers returning a best-effort normalized object to keep cluster loops running.
export_cluster_annotation_table ¶
export_cluster_annotation_table(
adata,
cluster_col,
cluster_results,
genes_by_cluster=None,
prefix="Curated",
)
Build a cluster-level annotation table (DataFrame) from per-cluster LLM outputs.
This utility converts JSON-like outputs (one result per cluster) into a tidy cluster summary table suitable for: - CSV export and sharing with non-bioinformatics users - downstream plotting and reporting (e.g., cluster composition summaries) - reproducible logging of label, confidence, reasoning, and the gene list used
Parameters:
-
adata(AnnData - like) –Object with
adata.obscontaining a cluster column. Onlyadata.obs[cluster_col]is required. -
cluster_col(str) –Column name in
adata.obscontaining cluster IDs (e.g., "seurat_clusters"). Cluster IDs are coerced to string. -
cluster_results(dict) –Mapping: cluster_id (str) -> result dict. Each result dict is expected to include at least: - "cell_type" - "confidence" - "reasoning" Extra keys are allowed and will be appended as additional columns.
-
genes_by_cluster(dict, default:None) –Mapping: cluster_id (str) -> list/sequence of genes used for that cluster. If provided, genes are serialized as a semicolon-separated string.
-
prefix(str, default:"Curated") –Prefix used to namespace output columns, e.g.: - Curated_CellType - Curated_Confidence - Curated_Reasoning - Curated_Genes
Returns:
-
DataFrame–A DataFrame with one row per cluster and at minimum: - {cluster_col} : str - n_cells : int - {prefix}_CellType : str - {prefix}_Confidence : str - {prefix}_Reasoning : str - {prefix}_Genes : str
If
cluster_resultscontains extra keys beyond {"cell_type","confidence","reasoning"}, they are included as: - {prefix}_
Notes
- This function does not call the LLM. It only formats and aggregates results.
- Stability: missing clusters in
cluster_resultsare filled with "Unknown"/"Low". - Extensibility: extra keys (e.g., V2 "unknown_score") are kept as columns.
apply_cluster_map_to_cells ¶
apply_cluster_map_to_cells(
adata,
cluster_col,
df_cluster,
label_col,
new_col="Curated_CellType",
unknown="Unknown",
)
Add per-cell labels to adata.obs by mapping cluster IDs to cluster-level labels.
This is the standard "cluster → cell" propagation step: - users annotate each cluster once - the annotation is assigned to all cells in that cluster
Parameters:
-
adata(AnnData - like) –Object with
adata.obscontaining a cluster column. The function modifiesadata.obsin-place by addingnew_col. -
cluster_col(str) –Column name in
adata.obscontaining cluster IDs (e.g., "seurat_clusters"). Cluster IDs are coerced to string. -
df_cluster(DataFrame) –Cluster-level annotation table (typically produced by
export_cluster_annotation_table). Must contain columns:[cluster_col, label_col]. -
label_col(str) –Column in
df_clustercontaining the label to map (e.g., "Curated_CellType"). -
new_col(str, default:"Curated_CellType") –Name of the new per-cell column to create in
adata.obs. -
unknown(str, default:"Unknown") –Fallback label used when a cluster has no entry in
df_cluster.
Returns:
-
AnnData - like–The input
adatawithadata.obs[new_col]added/updated.
Notes
- This function is intentionally minimal: it performs a direct mapping.
- If you need label harmonization (synonyms / wording), run
harmonize_labelsafter mapping, to keep ontology decisions explicit and user-controlled.
harmonize_labels ¶
harmonize_labels(
adata, col, mapping=None, new_col=None, keep_raw=True
)
Harmonize label wording using an explicit user-provided mapping dictionary.
LLM-generated labels may vary in wording (synonyms, punctuation, long descriptors). This function applies an explicit mapping to standardize labels for plots and downstream summaries, while keeping the raw labels for transparency.
Parameters:
-
adata(AnnData - like) –Object with
adata.obs[col]present. The function updatesadata.obs. -
col(str) –Column name in
adata.obscontaining raw labels to harmonize. -
mapping(dict or None, default:None) –Mapping dict: raw_label -> standardized_label. If None, this function is a no-op and returns
adata. -
new_col(str or None, default:None) –Output column name. If None, defaults to
f"{col}_harmonized". -
keep_raw(bool, default:True) –If True, store the original labels in
adata.obs[f"{col}_raw"]before harmonization.
Returns:
-
AnnData - like–The input
adatawith harmonized labels added as a categorical column.
Notes
- This is intentionally a thin wrapper: ontology decisions stay outside the core
engine and are encoded explicitly in
mapping. - For safety and reproducibility, this function does not attempt fuzzy matching. If desired, fuzzy suggestions can be implemented as a separate helper that proposes candidates without auto-replacing.
Noise modules¶
llm_sc_curator.noise_lists ¶
Noise module definitions for LLM-scCurator.
This module centralizes regex patterns and curated gene lists that represent biological/technical programs which commonly dominate naive marker rankings and confuse LLM-based interpretation (e.g., ribosomal/mitochondrial, stress, cell cycle, TCR/Ig clonotypes, uninformative locus IDs).
These definitions are used by the masking/distillation stage to: - suppress ubiquitous programs in prompt marker lists, and - optionally rescue sentinel markers (e.g., proliferation) to avoid over-filtering.
Notes
- Patterns are written to be Human/Mouse compatible when possible (case-aware).
- This file is intentionally dependency-free and safe to import.
- Edit conservatively: changes may affect benchmarking reproducibility.