Skip to content

Spatial domains labeling

novae.label_domains(adata, obs_key=None, tissue='unknown', species=None, n_genes=15, cell_type_key=None, pathways=None, spatial_context=None, provider='openai', model='gpt-4.1', api_key=None, max_tokens=1024, seed=None, return_prompt=False)

While the model.assign_domains function provide domain IDs, this function provide biologically meaningful label (or names) to the latter Novae spatial domain IDs.

Internally, it uses an LLM that is prompted with descriptive information: DEGs per domain, domain sizes, optionally pathway expressions, and cell-type proportions.

API key

An API key is required to use this function. You can either provide it directly as an api_key argument, or set it as an environment variable (OPENAI_API_KEY for OpenAI, ANTHROPIC_API_KEY for Anthropic). If you just want to generate the prompt without making an API call, set return_prompt=True and no API key will be required. You can then copy/paste the generated messages and output_schema into your preferred LLM playground.

Parameters:

Name Type Description Default
adata AnnData

An AnnData object containing the spatial domains assigned by Novae.

required
obs_key str | None

Key in adata.obs containing domain IDs to label. By default, it uses the last available Novae domain key.

None
tissue str

Tissue name (for example, "liver").

'unknown'
species str | None

Species name (for example, "human" or "mouse").

None
n_genes int

Number of marker genes per domain passed to the LLM prompt.

15
cell_type_key str | None

Optional key in adata.obs containing cell-type annotations. When provided, cell-type composition per domain is added to the LLM input.

None
pathways dict[str, list[str]] | str | None

Either a dictionary of pathways (keys are pathway names, values are lists of gene names), or a path to a GSEA JSON file. When provided, pathway enrichment scores per domain are added to the LLM input.

None
spatial_context str | None

Optional extra biological or spatial context to include in the prompt.

None
provider str

LLM provider to use. Either 'openai' or 'anthropic'.

'openai'
model str

OpenAI model name used for labeling.

'gpt-4.1'
api_key str | None

OpenAI API key. If None, uses OPENAI_API_KEY from the environment.

None
max_tokens int

Maximum number of tokens the model is allowed to generate for the labeling response (only for anthropic).

1024
seed int | None

Optional random seed passed to the labeling utility.

None
return_prompt bool

If True, returns only the generated request payload (messages and output_schema) so you can copy/paste it into an LLM manually. No LLM request is made, and no API key is required.

False

Returns:

Type Description
DataFrame | dict[str, dict[str, Any]]

A DataFrame with domain labels. If return_prompt=True, returns a dictionary containing messages and output_schema.

Source code in novae/label/label.py
def label_domains(
    adata: AnnData,
    obs_key: str | None = None,
    tissue: str = "unknown",
    species: str | None = None,
    n_genes: int = 15,
    cell_type_key: str | None = None,
    pathways: dict[str, list[str]] | str | None = None,
    spatial_context: str | None = None,
    provider: str = "openai",
    model: str = "gpt-4.1",
    api_key: str | None = None,
    max_tokens: int = 1024,
    seed: int | None = None,
    return_prompt: bool = False,
) -> pd.DataFrame | dict[str, dict[str, Any]]:
    """While the [`model.assign_domains`](../Novae/#novae.Novae.assign_domains) function provide domain IDs, this function provide biologically meaningful label (or names) to the latter Novae spatial domain IDs.

    Internally, it uses an LLM that is prompted with descriptive information: DEGs per domain, domain sizes, optionally pathway expressions, and cell-type proportions.

    !!! info "API key"
        An API key is required to use this function. You can either provide it directly as an `api_key` argument, or set it as an environment variable (`OPENAI_API_KEY` for OpenAI, `ANTHROPIC_API_KEY` for Anthropic).
        If you just want to generate the prompt without making an API call, set `return_prompt=True` and no API key will be required. You can then copy/paste the generated `messages` and `output_schema` into your preferred LLM playground.

    Args:
        adata: An `AnnData` object containing the spatial domains assigned by Novae.
        obs_key: Key in `adata.obs` containing domain IDs to label. By default, it uses the last available Novae domain key.
        tissue: Tissue name (for example, `"liver"`).
        species: Species name (for example, `"human"` or `"mouse"`).
        n_genes: Number of marker genes per domain passed to the LLM prompt.
        cell_type_key: Optional key in `adata.obs` containing cell-type annotations. When provided, cell-type composition per domain is added to the LLM input.
        pathways: Either a dictionary of pathways (keys are pathway names, values are lists of gene names), or a path to a [GSEA](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) JSON file. When provided, pathway enrichment scores per domain are added to the LLM input.
        spatial_context: Optional extra biological or spatial context to include in the prompt.
        provider: LLM provider to use. Either `'openai'` or `'anthropic'`.
        model: OpenAI model name used for labeling.
        api_key: OpenAI API key. If `None`, uses `OPENAI_API_KEY` from the environment.
        max_tokens: Maximum number of tokens the model is allowed to generate for the labeling response (only for anthropic).
        seed: Optional random seed passed to the labeling utility.
        return_prompt: If `True`, returns only the generated request payload (`messages` and `output_schema`) so you can copy/paste it into an LLM manually. No LLM request is made, and no API key is required.

    Returns:
        A `DataFrame` with domain labels. If `return_prompt=True`, returns a dictionary containing `messages` and `output_schema`.
    """

    obs_key = utils.check_available_domains_key([adata], obs_key)
    domain_ids = adata.obs[obs_key].dropna().unique().tolist()

    description = domains_description(
        adata=adata,
        obs_key=obs_key,
        domain_ids=domain_ids,
        cell_type_key=cell_type_key,
        pathways=pathways,
        n_genes=n_genes,
    )

    messages = [
        {
            "role": "system",
            "content": _get_system_prompt(tissue=tissue, species=species, spatial_context=spatial_context),
        },
        {
            "role": "user",
            "content": f"Label the following domains.\n\n{description}",
        },
    ]

    output_schema = _get_output_schema(domain_ids=domain_ids, domain_key=obs_key)

    if return_prompt:
        return {"messages": messages, "output_schema": output_schema}

    result = api_request(
        api_key=api_key,
        provider=provider,
        model=model,
        messages=messages,
        output_schema=output_schema,
        max_tokens=max_tokens,
        seed=seed,
    )

    return pd.DataFrame(result[Keys.LABEL_SUFFIX]).set_index(obs_key)