LLM-scCurator¶

LLM-scCurator logo

LLM-scCurator is a Python framework for noise-aware marker distillation that improves the robustness of zero-shot cell-type annotation with LLMs across single-cell and spatial transcriptomics.

What it does¶

LLM-scCurator adds a pre-prompt feature distillation layer: it suppresses recurrent biological/technical programs (e.g., ribosomal/mitochondrial, stress, cell cycle, TCR/Ig) while rescuing lineage- and state-defining markers, then applies leakage-safe lineage filters before LLM prompting.

Why it helps¶

Stabilizes inputs: reduces “garbage-in” marker lists by masking clonotype and ubiquitous programs.
Preserves biology: specificity-aware rescue retains informative, lineage-restricted markers.
Scales to discovery: supports hierarchical, coarse-to-fine annotation for complex tissues, including spatial modalities (Visium/Xenium).

Quick start¶

Start here: Getting started
Key ideas (masking, rescue, leakage filter, hierarchical inference): Concepts
Practical workflows and recipes: User guide
Tutorials: Tutorials
Full API docs: API reference

Privacy¶

We respect the sensitivity of clinical and biological data. LLM-scCurator is designed so that raw expression matrices and cell-level metadata can remain within your local environment.

Local execution: Preprocessing, confounder masking, and feature ranking run locally on your machine.
Minimal transmission (optional): If you choose to use an external LLM API, only anonymized, cluster-level marker lists (e.g., top 50 gene symbols) and minimal tissue context are sent.
User control: You decide what additional context (e.g., disease state, treatment, platform) to include. Always follow institutional policy and the LLM provider’s terms before sharing any information.

Example workflows (institutional-policy friendly)¶

Many institutions restrict which AI tools can be used with internal clinical or research datasets. To support these real-world constraints, we provide two end-to-end workflows that keep raw matrices and cell-level metadata local and avoid external LLM API calls unless explicitly permitted:

Fully local LLM (Ollama): Curate features and optionally annotate clusters using a local LLM backend (no external transmission). examples/local/local_quickstart_ollama.ipynb
Local feature distillation → Approved chat LLM annotation (no external LLM API calls): Curate features locally, export a curated cluster→genes table, then annotate it via an institution-approved chat interface (e.g., Microsoft Copilot “Work”) by uploading the CSV/Excel or pasting markers. examples/local/local_quickstart_approved_ai_workflow.ipynb

Links¶

Project home and issue tracker:

Source code and releases: GitHub
Report issues / questions: GitHub Issues