Skip to content

API reference

Core

llm_sc_curator.core.LLMscCurator

LLMscCurator(
    api_key=None,
    model_name=None,
    backend=None,
    allow_internal_normalization=False,
    normalization_target_sum=10000.0,
)

Initialize LLM-scCurator.

Parameters:

  • api_key (str, default: None ) –

    (Legacy / convenience) API key used to instantiate the default Gemini backend. Prefer passing an explicit backend instance for full control.

  • model_name (str, default: None ) –

    (Legacy / convenience) Model identifier for the default Gemini backend. If not provided, a reasonable default is used.

  • backend (BaseLLMBackend, default: None ) –

    Dependency-injected backend instance (e.g., GeminiBackend, OpenAIBackend). Must implement a generate(prompt: str, json_mode: bool) -> str interface.

  • allow_internal_normalization (bool, default: False ) –

    If True, LLM-scCurator will internally normalize and log-transform inputs that look like raw counts (UMI-like). Internal normalization is performed on a COPY and recorded in adata.uns['llm_sc_curator'] for provenance.

  • normalization_target_sum (float, default: 1e4 ) –

    Target total counts per cell used when allow_internal_normalization=True.

Notes
  • This initializer enforces a backend-agnostic design: users may plug in cloud APIs or local models via the BaseLLMBackend interface.
  • If neither backend nor api_key is provided, initialization fails fast.

set_global_context

set_global_context(
    adata,
    balance_by=None,
    max_cells_per_group="auto",
    min_cells_per_group=50,
    random_state=42,
)

Set the global dataset context used for specificity-aware feature distillation.

LLM-scCurator uses a global context to (i) estimate global Gini/sparsity trends, (ii) detect ubiquitous housekeeping/stress programs, and (iii) support optional cross-lineage leakage checks. When balance_by is provided, context statistics are computed on a balanced subsample to prevent dominant groups from skewing percentiles and thresholds.

Parameters:

  • adata (AnnData) –

    AnnData containing all cells. Can be raw counts or log1p-normalized; normalization is validated via _check_normalization.

  • balance_by (str or None, default: None ) –

    Column in adata.obs defining groups for balanced subsampling (e.g., broad lineage labels or sample/patient IDs). If None, all cells are used.

  • max_cells_per_group (int or {auto}, default: "auto" ) –

    Target number of cells retained per group. If "auto", the per-group target is set to the median group size with a floor at min_cells_per_group.

  • min_cells_per_group (int, default: 50 ) –

    Groups with fewer cells than this threshold are excluded from the balanced subsample to avoid unstable global statistics.

  • random_state (int, default: 42 ) –

    Seed for subsampling.

Returns:

  • None

    Initializes self.masker (a FeatureDistiller) and records configuration in self._global_context_config.

Notes

This method never modifies adata in-place. Any normalization or subsampling occurs on internal copies.

curate_features

curate_features(
    target_adata,
    group_col,
    target_group,
    reference="rest",
    n_top=50,
    use_statistics=True,
    use_hvg=True,
    coarse_col=None,
    whitelist=None,
    batch_key=None,
    n_candidates=500,
    min_target_mean=0.02,
    min_delta_mean=0.02,
    min_logfc=0.2,
    min_target_pct=0.02,
    min_delta_pct=0.02,
)

Perform 4-stage feature distillation for LLM prompting.

The goal is to produce a compact marker list that is enriched for lineage/state identity signals while suppressing ubiquitous biological programs (e.g., stress, housekeeping, cell cycle) and technical confounders. Candidate genes are drawn from HVGs (optionally batch-aware) with additional "high-Gini rescue" to retain lineage-restricted markers that may not be highly variable.

Parameters:

  • target_adata (AnnData) –

    AnnData object containing the target clusters (subset or full dataset).

  • group_col (str) –

    Column in target_adata.obs containing cluster labels.

  • target_group (str or int) –

    Cluster identifier to analyze.

  • reference (rest, default: "rest" ) –

    Reference group used for differential expression. If "rest", the target cluster is contrasted against all other cells.

  • n_top (int, default: 50 ) –

    Number of final markers to return.

  • use_statistics (bool, default: True ) –

    If True, apply biological noise detection modules (regex-based) to mask confounder programs and rescue sentinel markers.

  • use_hvg (bool, default: True ) –

    If True, restrict candidates to HVGs plus rescue sets (high-Gini + lineage markers).

  • coarse_col (str or None, default: None ) –

    Optional column (in the global context) used for cross-lineage leakage checks.

  • whitelist (list[str] or None, default: None ) –

    Optional list of genes exempt from masking (augmented internally with proliferation sentinels and lineage markers, excluding confounded TCR/Ig/Hb patterns).

  • batch_key (str or None, default: None ) –

    Optional key for batch-aware HVG selection.

  • n_candidates (int, default: 500 ) –

    Number of top-ranked DE candidates considered before masking/filtering.

  • min_target_mean (float, default: 0.02 ) –

    Minimum mean expression (log1p) in the target cluster.

  • min_delta_mean (float, default: 0.02 ) –

    Minimum mean difference (target - reference).

  • min_logfc (float, default: 0.2 ) –

    Minimum log fold-change threshold if available in the DE table.

  • min_target_pct (float, default: 0.02 ) –

    Minimum detection fraction in the target cluster.

  • min_delta_pct (float, default: 0.02 ) –

    Minimum detection fraction difference (target - reference).

Returns:

  • list[str]

    Distilled marker genes (length up to n_top). If all candidates are masked, the method falls back to unmasked DE rankings to avoid empty prompts.

Notes

Input data are validated for log1p normalization via _check_normalization. For rigorous benchmarking and lineage leakage checks, call set_global_context() on a broad atlas before per-cluster distillation.

annotate

annotate(
    gene_list,
    cell_type="",
    context=None,
    use_auto_context=True,
    max_retries=3,
    retry_sleep=1.0,
)

Query the configured LLM backend with a distilled marker list and return a structured label.

The backend is called in JSON mode. Outputs are parsed into a stable dictionary schema to support reproducible downstream evaluation and logging.

Parameters:

  • gene_list (list[str]) –

    Distilled marker genes for a single cluster.

  • cell_type (str, default: "" ) –

    Optional parent lineage label to bias the prompt (e.g., coarse lineage).

  • context (dict or None, default: None ) –

    Optional additional context (tissue, condition, dataset notes).

  • use_auto_context (bool, default: True ) –

    If True and cell_type is not provided, use the last inferred lineage context when available.

  • max_retries (int, default: 3 ) –

    Maximum number of attempts to obtain a valid JSON object.

  • retry_sleep (float, default: 1.0 ) –

    Sleep duration (seconds) between retry attempts.

Returns:

  • dict

    A dictionary with keys: - cell_type : str Predicted lineage/state label. - confidence : {"High", "Medium", "Low"} Self-reported confidence bucket. - reasoning : str Brief, marker-based justification.

    On failure, returns a soft error object with cell_type set to "Error" or "ParseError" (rather than raising), so callers may decide how to handle it.

Notes

This method is intentionally tolerant to transient backend failures and occasional JSON formatting issues (e.g., fenced code blocks).

run_hierarchical_discovery

run_hierarchical_discovery(
    adata,
    coarse_res=0.2,
    fine_res=0.5,
    n_top=50,
    batch_key=None,
    global_context=None,
    random_state=42,
)

Run coarse-to-fine hierarchical annotation on an AnnData object.

The pipeline performs a coarse Leiden clustering to infer major lineages, then reclusters within each coarse group to infer finer subtypes. Distilled marker lists are generated per cluster and passed to the injected LLM backend for structured JSON outputs.

Parameters:

  • adata (AnnData) –

    Input dataset. .X must be log1p-normalized (validated via _check_normalization).

  • coarse_res (float, default: 0.2 ) –

    Leiden resolution for the coarse (major lineage) clustering.

  • fine_res (float, default: 0.5 ) –

    Leiden resolution for the fine (subtype) clustering within each coarse group.

  • n_top (int, default: 50 ) –

    Marker list size passed to the LLM per cluster.

  • batch_key (str or None, default: None ) –

    Optional key for batch-aware HVG selection.

  • global_context (dict or None, default: None ) –

    Optional user-provided context to embed in the prompt (tissue/condition).

  • random_state (int, default: 42 ) –

    Random seed used for PCA/neighbors/Leiden and subsampling where applicable.

Returns:

  • AnnData

    A COPY/view with added annotations: - adata.obs['major_type'] : coarse lineage labels - adata.obs['fine_type'] : fine subtype labels Reasoning logs are stored in adata.uns['llm_reasoning'].

Notes

HVGs are precomputed once when missing, to avoid redundant work across clusters.

Backends

llm_sc_curator.backends

BaseLLMBackend

Bases: ABC

Abstract base class for LLM backends.

This interface enforces a backend-agnostic contract:

  • Input: a prompt string (optionally requesting strict JSON output).
  • Output: a single string (either free-form text or a JSON object encoded as text).

Implementations may wrap cloud APIs (e.g., Gemini/OpenAI) or local models, but must expose a stable generate() method to keep downstream annotation code unchanged.

generate abstractmethod

generate(prompt, json_mode=False)

Generate a completion for the given prompt.

Parameters:

  • prompt (str) –

    Prompt text to send to the backend.

  • json_mode (bool, default: False ) –

    If True, the backend should attempt to return a valid JSON object encoded as a string (not fenced code blocks).

Returns:

  • str

    Model output as a string. In JSON mode, this should be a JSON object encoded as a string (e.g., '{"cell_type": "...", "confidence": "...", "reasoning": "..."}').

GeminiBackend

GeminiBackend(
    api_key,
    model_name="models/gemini-2.0-flash",
    temperature=0.0,
)

Bases: BaseLLMBackend

Google Gemini backend for LLM-scCurator.

This backend wraps google-generativeai and supports an optional JSON mode by setting the response MIME type to application/json.

Parameters:

  • api_key (str) –

    Gemini API key.

  • model_name (str, default: "models/gemini-2.0-flash" ) –

    Gemini model identifier.

  • temperature (float, default: 0.0 ) –

    Sampling temperature. Use 0.0 to maximize determinism (recommended for JSON mode).

Raises:

  • ImportError

    If google-generativeai is not installed.

Notes

On exceptions, this backend returns a soft error payload (string or JSON string) rather than raising, to keep downstream annotation pipelines robust.

Initialize the Gemini backend.

Parameters:

  • api_key (str) –

    Gemini API key.

  • model_name (str, default: "models/gemini-2.0-flash" ) –

    Gemini model identifier.

  • temperature (float, default: 0.0 ) –

    Sampling temperature (0.0 recommended for deterministic JSON responses).

generate

generate(prompt, json_mode=False)

Generate a completion from Gemini.

Parameters:

  • prompt (str) –

    Prompt text to send to Gemini.

  • json_mode (bool, default: False ) –

    If True, request JSON output via response_mime_type="application/json".

Returns:

  • str

    Model output string. If an exception occurs: - In JSON mode: returns a JSON string with keys {cell_type, confidence, reasoning}. - Otherwise: returns a human-readable error string prefixed with "Gemini Error:".

OpenAIBackend

OpenAIBackend(
    api_key, model_name="gpt-4o", temperature=0.0, seed=42
)

Bases: BaseLLMBackend

OpenAI Chat Completions backend for LLM-scCurator.

This backend uses the official openai Python package and supports JSON mode via response_format={"type": "json_object"}.

Parameters:

  • api_key (str) –

    OpenAI API key.

  • model_name (str, default: "gpt-4o" ) –

    Model identifier.

  • temperature (float, default: 0.0 ) –

    Sampling temperature (0.0 recommended for structured outputs).

  • seed (int, default: 42 ) –

    Seed forwarded to the API when supported by the selected model. If the request fails with seed enabled, the backend retries once without the seed.

Raises:

  • ImportError

    If the openai package is not installed.

Notes

In contrast to GeminiBackend, OpenAIBackend attempts a seed-based call first and falls back to a seed-free call if needed. On errors it returns a soft error payload (string or JSON string), which enables batch pipelines to continue.

Initialize the OpenAI backend.

Parameters:

  • api_key (str) –

    OpenAI API key.

  • model_name (str, default: "gpt-4o" ) –

    Model identifier.

  • temperature (float, default: 0.0 ) –

    Sampling temperature (0.0 recommended for structured outputs).

  • seed (int, default: 42 ) –

    Seed forwarded to the API when supported.

generate

generate(prompt, json_mode=False)

Generate a completion from OpenAI Chat Completions.

Parameters:

  • prompt (str) –

    Prompt text to send to the OpenAI API.

  • json_mode (bool, default: False ) –

    If True, request a JSON object response format.

Returns:

  • str

    Model output string. If an exception occurs: - In JSON mode: returns a JSON string with keys {cell_type, confidence, reasoning}. - Otherwise: returns a human-readable error string prefixed with "OpenAI Error:".

Notes

The backend first attempts a request with seed enabled. If that fails (e.g., due to model/endpoint incompatibility), it retries once without seed before returning a soft error payload.

OllamaBackend

OllamaBackend(
    host=None,
    model_name=None,
    temperature=None,
    timeout=None,
)

Bases: BaseLLMBackend

Ollama local backend for LLM-scCurator.

This backend sends prompts to a locally running Ollama server via the REST Chat API and returns the assistant response as a single string. It is a drop-in implementation of :class:BaseLLMBackend.

The primary use case is institutional / on-prem environments where outbound calls to cloud LLM APIs are restricted and inference must run locally.

Parameters:

  • host (str, default: None ) –

    Base URL of the Ollama server.

    If not provided, the value is resolved in the following order:

    1) Environment variable LLMSC_OLLAMA_HOST 2) Default: "http://ollama:11434" (Docker Compose friendly)

    Trailing slashes are removed automatically.

  • model_name (str, default: None ) –

    Ollama model name/tag (e.g., "llama3.1:8b" or "qwen2.5:7b-instruct").

    If not provided, the value is resolved in the following order:

    1) Environment variable LLMSC_OLLAMA_MODEL 2) Default: "llama3.1:8b"

  • temperature (float, default: None ) –

    Sampling temperature forwarded to Ollama as options.temperature. Use 0.0 for more deterministic classification-style behavior.

    If not provided, defaults to environment variable LLMSC_OLLAMA_TEMPERATURE (default 0.0).

  • timeout (float, default: None ) –

    Per-request timeout (seconds) for the HTTP call to Ollama. CPU-only inference can be slow; raise this value for large prompts.

    If not provided, defaults to environment variable LLMSC_OLLAMA_TIMEOUT (default 120).

Notes

API endpoint This backend uses the Ollama Chat endpoint:

- ``POST {host}/api/chat``

JSON mode If json_mode=True, the request includes format="json" which asks Ollama to return a single JSON object (not fenced). The returned value is still a string (JSON text), consistent with the BaseLLMBackend contract.

This backend performs a lightweight validation by attempting ``json.loads``
on the returned content; on failure it returns a "soft error" JSON payload.

Failure behavior On exceptions, the backend returns a soft error: - In JSON mode: JSON text with keys cell_type, confidence, reasoning. - Otherwise: a string prefixed with "Ollama Error:".

When wrapped by :func:`retry_with_backoff`, these failures can be retried.

Examples:

Docker Compose (default host)::

# export LLMSC_OLLAMA_MODEL=llama3.1:8b
backend = OllamaBackend()
out = backend.generate("Return a JSON object: {"x": 1}", json_mode=True)

Local host::

backend = OllamaBackend(host="http://localhost:11434", model_name="llama3.1:8b")
out = backend.generate("Hello", json_mode=False)

generate

generate(prompt, json_mode=False)

Generate a completion from Ollama.

Parameters:

  • prompt (str) –

    Prompt text to send to Ollama.

  • json_mode (bool, default: False ) –

    If True, request a single JSON object response (returned as JSON text).

Returns:

  • str

    Model output as a string.

    • If json_mode=False: free-form assistant text.
    • If json_mode=True: a JSON object encoded as text.

    On failure: - If json_mode=True: returns JSON text with keys cell_type="Error", confidence="Low", and reasoning. - Otherwise: returns a string prefixed with "Ollama Error:".

LocalLLMBackend

Bases: BaseLLMBackend

Placeholder backend for future local model integrations.

This backend currently returns a fixed JSON payload indicating that local inference is not implemented. It exists to document the intended extension point and to keep the public API stable.

Notes

Use this class as a template when integrating a local LLM runner (e.g., llama.cpp, vLLM, or an on-premise service) into the BaseLLMBackend interface.

generate

generate(prompt, json_mode=False)

Return a placeholder response (not implemented).

Parameters:

  • prompt (str) –

    Unused placeholder parameter.

  • json_mode (bool, default: False ) –

    If True, returns a JSON object string.

Returns:

  • str

    A JSON string with cell_type="Local_Pending", indicating that the local backend is not yet implemented.

Masking

llm_sc_curator.masking.FeatureDistiller

FeatureDistiller(global_adata)

Initialize a feature distiller using a GLOBAL reference atlas.

The global atlas provides background distributions required for: (i) global Gini-based low-specificity detection, and (ii) module-based masking using regex/gene lists.

Parameters:

  • global_adata (AnnData) –

    Global AnnData containing all cells/lineages used as background context. .X is expected to be log1p-normalized expression.

Notes

This class does not modify global_adata in-place. Computed statistics are stored in self.gene_stats as a pandas DataFrame indexed by gene name.

calculate_gene_stats

calculate_gene_stats()

Compute global gene statistics used by downstream masking steps.

This method computes, for each gene in the global atlas: - mean expression (on .X) - Gini coefficient (a global specificity proxy)

Returns:

  • DataFrame

    DataFrame indexed by gene name with columns: - "mean": global mean expression - "gini": global Gini coefficient

Notes
  • For sparse matrices (common in Scanpy), column-wise slicing in Python loops is slow. We convert to CSC once for efficient column access.
  • On very large atlases (e.g., >2e5 cells × >1e4 genes), this step can be computationally heavy. Consider constructing a balanced global context upstream (e.g., via set_global_context(..., balance_by=...)) to limit the number of cells used for global statistics.

detect_biological_noise

detect_biological_noise(
    gini_threshold=None,
    gini_q=0.01,
    mean_floor=0.05,
    whitelist=None,
    rescue_mean_floor=0.1,
    low_gini_cap=0.15,
    rescue_modules=None,
)

Stage 2: Detect globally low-specificity genes and module-defined noise programs.

Noise genes are flagged using two complementary mechanisms: 1) Global low-Gini housekeeping detection (data-driven). 2) Regex- and list-based biological modules (e.g., mitochondrial, stress, cell cycle).

Global low-specificity genes are defined using either: - an absolute Gini threshold (gini_threshold), if provided, or - the lower gini_q quantile among genes with mean >= mean_floor, optionally capped by low_gini_cap.

Parameters:

  • gini_threshold (float or None, default: None ) –

    Absolute Gini cutoff for housekeeping detection. If provided, quantile-based cutoff is not used.

  • gini_q (float, default: 0.01 ) –

    Quantile defining the low-Gini band among genes passing mean_floor.

  • mean_floor (float, default: 0.05 ) –

    Mean-expression floor used to exclude extremely lowly expressed genes when estimating the quantile cutoff.

  • whitelist (list[str] or None, default: None ) –

    User-specified genes that should not be masked even if matched by a rule.

  • rescue_mean_floor (float, default: 0.1 ) –

    Minimum global mean required to keep a rescued sentinel gene for rescue-enabled modules.

  • low_gini_cap (float or None, default: 0.15 ) –

    Optional upper bound for the quantile-derived cutoff. The effective cutoff is min(quantile_cutoff, low_gini_cap).

  • rescue_modules (tuple[str, ...] or None, default: None ) –

    Module names for which a top-expressed sentinel is rescued while masking the remaining matched genes. If None, uses RESCUE_MODULES_DEFAULT.

Returns:

  • dict[str, str]

    Mapping from gene symbol to a masking reason string (e.g., "Module_Mito", "Low_Gini_Housekeeping(...)", "CrossLineage_Leak(...)", etc.).

Notes
  • This function ensures that "sentinel rescue" only occurs for modules explicitly listed in rescue_modules (default: LINC and hemoglobin contamination).
  • If self.gene_stats is missing, it is computed on demand via calculate_gene_stats().

calculate_lineage_specificity

calculate_lineage_specificity(
    target_genes,
    target_adata,
    target_group,
    group_col,
    coarse_col,
    expr_percentile=90,
    tail_percentile=95,
    abs_threshold=0.1,
)

Stage 3: Cross-lineage specificity check using global lineage context.

This step identifies candidate markers that appear disproportionately high in non-target major lineages (i.e., potential leakage markers), using percentile- based comparisons between: - local target cluster expression (within target_adata), and - global major lineage expression (within the global atlas self.adata).

Parameters:

  • target_genes (list[str]) –

    Candidate marker genes for the target cluster.

  • target_adata (AnnData) –

    Local AnnData containing the target clustering.

  • target_group (str or int) –

    Cluster identifier within target_adata.obs[group_col].

  • group_col (str) –

    Column name in target_adata.obs containing cluster labels.

  • coarse_col (str) –

    Column name in the GLOBAL atlas self.adata.obs containing major lineage labels. This column must also be present in the local subset for lineage assignment.

  • expr_percentile (int, default: 90 ) –

    Expression percentile used for robust per-gene comparison.

  • tail_percentile (int, default: 95 ) –

    Percentile applied to expression ratios to derive a dynamic leakage threshold.

  • abs_threshold (float, default: 0.1 ) –

    Absolute expression floor to avoid unstable ratios in near-zero regimes.

Returns:

  • dict[str, str]

    Mapping from gene symbol to a leakage reason string if flagged as cross-lineage high expression.

Notes

If required metadata are missing (e.g., coarse_col not found), the method returns an empty dict and logs a warning rather than raising, to keep batch runs robust.

Utils

Helpers for converting per-cluster LLM outputs into tidy tables (CSV/DataFrame) and per-cell labels.

Output table contract

export_cluster_annotation_table() produces a cluster summary table intended to be stable across versions.

Required columns: - {cluster_col} (e.g., seurat_clusters) - n_cells - {prefix}_CellType, {prefix}_Confidence, {prefix}_ConfidenceScore, {prefix}_Reasoning, {prefix}_Genes

Extra keys returned by LLM backends may be exported as {prefix}_<UpperCamelCaseKey> columns.

llm_sc_curator.utils

ensure_json_result

ensure_json_result(x)

Normalize LLM outputs into a stable dictionary schema.

LLM backends may return a dict (JSON), a plain string, or unexpected types. This helper enforces a minimal, consistent schema so downstream code can reliably build tables, logs, and per-cell annotations.

Parameters:

  • x (Any) –

    Raw output from an LLM backend. Expected types include: - dict : a JSON object with keys like "cell_type", "confidence", "reasoning" - str : a plain label string (treated as low-confidence) Other types are treated as errors and stringified into the "reasoning" field.

Returns:

  • dict

    A dictionary guaranteed to contain: - cell_type : str Predicted label, or "Unknown"/"Error" on fallback. - confidence : {"High", "Medium", "Low"} Confidence bucket; defaults to "Low" if missing/invalid. - reasoning : str Brief explanation (may be empty).

    Any additional keys found in the input dict are preserved to support future extensions (e.g., "candidate_labels", "unknown_score").

Notes

This function is intentionally permissive: it never raises, and it prefers returning a best-effort normalized object to keep cluster loops running.

export_cluster_annotation_table

export_cluster_annotation_table(
    adata,
    cluster_col,
    cluster_results,
    genes_by_cluster=None,
    prefix="Curated",
)

Build a cluster-level annotation table (DataFrame) from per-cluster LLM outputs.

This utility converts JSON-like outputs (one result per cluster) into a tidy cluster summary table suitable for: - CSV export and sharing with non-bioinformatics users - downstream plotting and reporting (e.g., cluster composition summaries) - reproducible logging of label, confidence, reasoning, and the gene list used

Parameters:

  • adata (AnnData - like) –

    Object with adata.obs containing a cluster column. Only adata.obs[cluster_col] is required.

  • cluster_col (str) –

    Column name in adata.obs containing cluster IDs (e.g., "seurat_clusters"). Cluster IDs are coerced to string.

  • cluster_results (dict) –

    Mapping: cluster_id (str) -> result dict. Each result dict is expected to include at least: - "cell_type" - "confidence" - "reasoning" Extra keys are allowed and will be appended as additional columns.

  • genes_by_cluster (dict, default: None ) –

    Mapping: cluster_id (str) -> list/sequence of genes used for that cluster. If provided, genes are serialized as a semicolon-separated string.

  • prefix (str, default: "Curated" ) –

    Prefix used to namespace output columns, e.g.: - Curated_CellType - Curated_Confidence - Curated_Reasoning - Curated_Genes

Returns:

  • DataFrame

    A DataFrame with one row per cluster and at minimum: - {cluster_col} : str - n_cells : int - {prefix}_CellType : str - {prefix}_Confidence : str - {prefix}_Reasoning : str - {prefix}_Genes : str

    If cluster_results contains extra keys beyond {"cell_type","confidence","reasoning"}, they are included as: - {prefix}_

Notes
  • This function does not call the LLM. It only formats and aggregates results.
  • Stability: missing clusters in cluster_results are filled with "Unknown"/"Low".
  • Extensibility: extra keys (e.g., V2 "unknown_score") are kept as columns.

apply_cluster_map_to_cells

apply_cluster_map_to_cells(
    adata,
    cluster_col,
    df_cluster,
    label_col,
    new_col="Curated_CellType",
    unknown="Unknown",
)

Add per-cell labels to adata.obs by mapping cluster IDs to cluster-level labels.

This is the standard "cluster → cell" propagation step: - users annotate each cluster once - the annotation is assigned to all cells in that cluster

Parameters:

  • adata (AnnData - like) –

    Object with adata.obs containing a cluster column. The function modifies adata.obs in-place by adding new_col.

  • cluster_col (str) –

    Column name in adata.obs containing cluster IDs (e.g., "seurat_clusters"). Cluster IDs are coerced to string.

  • df_cluster (DataFrame) –

    Cluster-level annotation table (typically produced by export_cluster_annotation_table). Must contain columns: [cluster_col, label_col].

  • label_col (str) –

    Column in df_cluster containing the label to map (e.g., "Curated_CellType").

  • new_col (str, default: "Curated_CellType" ) –

    Name of the new per-cell column to create in adata.obs.

  • unknown (str, default: "Unknown" ) –

    Fallback label used when a cluster has no entry in df_cluster.

Returns:

  • AnnData - like

    The input adata with adata.obs[new_col] added/updated.

Notes
  • This function is intentionally minimal: it performs a direct mapping.
  • If you need label harmonization (synonyms / wording), run harmonize_labels after mapping, to keep ontology decisions explicit and user-controlled.

harmonize_labels

harmonize_labels(
    adata, col, mapping=None, new_col=None, keep_raw=True
)

Harmonize label wording using an explicit user-provided mapping dictionary.

LLM-generated labels may vary in wording (synonyms, punctuation, long descriptors). This function applies an explicit mapping to standardize labels for plots and downstream summaries, while keeping the raw labels for transparency.

Parameters:

  • adata (AnnData - like) –

    Object with adata.obs[col] present. The function updates adata.obs.

  • col (str) –

    Column name in adata.obs containing raw labels to harmonize.

  • mapping (dict or None, default: None ) –

    Mapping dict: raw_label -> standardized_label. If None, this function is a no-op and returns adata.

  • new_col (str or None, default: None ) –

    Output column name. If None, defaults to f"{col}_harmonized".

  • keep_raw (bool, default: True ) –

    If True, store the original labels in adata.obs[f"{col}_raw"] before harmonization.

Returns:

  • AnnData - like

    The input adata with harmonized labels added as a categorical column.

Notes
  • This is intentionally a thin wrapper: ontology decisions stay outside the core engine and are encoded explicitly in mapping.
  • For safety and reproducibility, this function does not attempt fuzzy matching. If desired, fuzzy suggestions can be implemented as a separate helper that proposes candidates without auto-replacing.

Noise modules

llm_sc_curator.noise_lists

Noise module definitions for LLM-scCurator.

This module centralizes regex patterns and curated gene lists that represent biological/technical programs which commonly dominate naive marker rankings and confuse LLM-based interpretation (e.g., ribosomal/mitochondrial, stress, cell cycle, TCR/Ig clonotypes, uninformative locus IDs).

These definitions are used by the masking/distillation stage to: - suppress ubiquitous programs in prompt marker lists, and - optionally rescue sentinel markers (e.g., proliferation) to avoid over-filtering.

Notes
  • Patterns are written to be Human/Mouse compatible when possible (case-aware).
  • This file is intentionally dependency-free and safe to import.
  • Edit conservatively: changes may affect benchmarking reproducibility.