Metadata Prediction
===================

Overview
--------

Metadata prediction models **infer biological metadata from observed expression data**. Given a gene expression profile, the model predicts the likely biological characteristics such as cell type, tissue, disease state, and more.

This is useful when you want to:

- Annotate samples of unknown origin
- Validate sample labels against expression patterns
- Discover potential mislabeled or contaminated samples
- Understand the biological characteristics captured in expression data

Available Models
----------------

- **gem-1-bulk_predict-metadata**: Bulk RNA-seq metadata prediction model
- **gem-1-sc_predict-metadata**: Single-cell RNA-seq metadata prediction model

.. note::
   These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.

.. code-block:: python

    import pysynthbio

How It Works
------------

Metadata prediction encodes your expression data into the model's latent space and then uses classifiers to predict the most likely metadata values for each sample. The model returns:

1. **Classifier probabilities**: For each categorical metadata field, the probability distribution over possible values
2. **Predicted labels**: The most likely value for each metadata field
3. **Latent representations**: The biological, technical, and perturbation latent vectors

Creating a Query
----------------

Metadata prediction queries are simpler than other model types—you only need to provide expression counts:

.. code-block:: python

    # Get the example query structure
    example_query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]

    # Inspect the query structure
    print(example_query)

The query structure includes:

1. **inputs**: A list of count vectors, where each element is a dictionary with a ``counts`` field containing expression values

2. **seed** (optional): Random seed for reproducibility

Example: Predicting Sample Metadata
-----------------------------------

Here's a complete example predicting metadata for expression samples:

.. code-block:: python

    # Start with example query structure
    query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]

    # Replace with your actual expression counts
    # Each input should be a dictionary with a counts list
    query["inputs"] = [
        {"counts": sample1_counts},
        {"counts": sample2_counts},
        {"counts": sample3_counts}
    ]

    # Optional: set seed for reproducibility
    query["seed"] = 42

    # Submit the query
    result = pysynthbio.predict_query(query, model_id="gem-1-bulk_predict-metadata")

Example: Single Sample Prediction
---------------------------------

For predicting metadata of a single sample:

.. code-block:: python

    query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]

    # Single sample
    query["inputs"] = [
        {"counts": my_sample_counts}
    ]

    result = pysynthbio.predict_query(query, model_id="gem-1-bulk_predict-metadata")

    # Access the predictions for the first (and only) sample
    print(result[0]["metadata"])

Query Parameters
----------------

inputs (list, required)
^^^^^^^^^^^^^^^^^^^^^^^

A list of expression count vectors. Each element should be a dictionary containing:

- **counts**: A list of non-negative integers representing gene expression counts

.. code-block:: python

    query["inputs"] = [
        {"counts": [0, 12, 5, 0, 33, 7, ...]},  # Sample 1
        {"counts": [3, 0, 0, 7, 1, 0, ...]}     # Sample 2
    ]

seed (int, optional)
^^^^^^^^^^^^^^^^^^^^

Random seed for reproducibility.

.. code-block:: python

    query["seed"] = 123

Understanding the Results
-------------------------

The results from metadata prediction are returned as a **list of output dictionaries**, one per input sample. Each output dictionary contains:

- ``metadata``: Predicted metadata values for the sample
- ``classifier_probs``: Probability distributions over possible values for each metadata field
- ``latents``: Latent representations (biological, technical, perturbation)

.. code-block:: python

    # result is a list of outputs, one per input sample
    print(f"Number of outputs: {len(result)}")

    # Access the first sample's output
    first_output = result[0]
    print(first_output.keys())  # dict_keys(['metadata', 'classifier_probs', 'latents'])

Predicted Metadata
^^^^^^^^^^^^^^^^^^

Each output's ``metadata`` field contains the predicted values for that sample:

.. code-block:: python

    # Access predictions for each sample
    for i, output in enumerate(result):
        print(f"Sample {i}: {output['metadata']}")

    # Access specific predictions for first sample
    first_sample = result[0]["metadata"]
    print(first_sample.get("cell_type_ontology_id"))
    print(first_sample.get("tissue_ontology_id"))
    print(first_sample.get("disease_ontology_id"))

Classifier Probabilities
^^^^^^^^^^^^^^^^^^^^^^^^

For categorical metadata fields, the model returns probability distributions over all possible values. These are useful for understanding prediction confidence:

.. code-block:: python

    # Access cell type probabilities for first sample
    first_output = result[0]
    cell_type_probs = first_output["classifier_probs"]["cell_type"]
    sorted_probs = sorted(cell_type_probs.items(), key=lambda x: x[1], reverse=True)
    print("Top predicted cell types:", sorted_probs[:5])

Latent Representations
^^^^^^^^^^^^^^^^^^^^^^

The model also returns latent vectors that capture biological, technical, and perturbation characteristics:

.. code-block:: python

    # Access latent representations for first sample
    first_output = result[0]
    biological_latents = first_output["latents"]["biological"]
    technical_latents = first_output["latents"]["technical"]

Use Cases
---------

Sample Annotation
^^^^^^^^^^^^^^^^^

Annotate unlabeled samples with predicted metadata:

.. code-block:: python

    import pandas as pd

    # Load your unlabeled samples
    unlabeled_counts = pd.read_csv("unlabeled_samples.csv", index_col=0)

    # Create query
    query = pysynthbio.get_example_query(model_id="gem-1-bulk_predict-metadata")["example_query"]
    query["inputs"] = [
        {"counts": unlabeled_counts.iloc[:, i].tolist()}
        for i in range(unlabeled_counts.shape[1])
    ]

    # Predict metadata
    result = pysynthbio.predict_query(query, model_id="gem-1-bulk_predict-metadata")

    # Combine with sample IDs - result is a list of outputs
    annotations = pd.DataFrame([output["metadata"] for output in result])
    annotations["sample_id"] = unlabeled_counts.columns.tolist()

Quality Control
^^^^^^^^^^^^^^^

Validate existing sample labels against predicted metadata:

.. code-block:: python

    # Compare predicted vs. provided labels
    provided_labels = ["UBERON:0002107", "UBERON:0002107", "UBERON:0000955", "UBERON:0000955"]
    predicted_labels = [output["metadata"].get("tissue_ontology_id") for output in result]

    # Identify potential mismatches
    mismatches = [
        i for i, (p, pred) in enumerate(zip(provided_labels, predicted_labels))
        if p != pred
    ]
    if mismatches:
        print(f"Potential mislabeled samples: {mismatches}")

Batch Characterization
^^^^^^^^^^^^^^^^^^^^^^

Understand batch-specific technical characteristics:

.. code-block:: python

    import numpy as np

    # Group samples by batch
    batch_labels = ["batch1", "batch1", "batch2", "batch2"]

    # Check if technical predictions cluster by batch
    # This can help identify batch effects
    # Extract technical latents from each sample's output
    technical_latents = [output["latents"]["technical"] for output in result]
    for batch in set(batch_labels):
        batch_indices = [i for i, b in enumerate(batch_labels) if b == batch]
        batch_mean = np.mean([technical_latents[i][0] for i in batch_indices])
        print(f"{batch} technical latent mean: {batch_mean}")

Important Notes
---------------

Counts Vector Length
^^^^^^^^^^^^^^^^^^^^

The counts vector for each sample must match the model's expected number of genes. If the length doesn't match, the API will return a validation error.

Use ``get_example_query()`` to see the expected structure.

Gene Order
^^^^^^^^^^

Ensure your counts are in the same gene order expected by the model. The gene order should match what the baseline model expects—you can retrieve this from any prediction result's ``gene_order`` field.

Non-Negative Counts
^^^^^^^^^^^^^^^^^^^

All count values must be non-negative integers. Floats that are whole numbers (like ``10.0``) are accepted, but negative values will cause validation errors.