Reference Conditioning ====================== Overview -------- Reference conditioning models generate expression data **conditioned on a real reference sample**. This allows you to "anchor" to an existing expression profile while applying perturbations or modifications. This is useful when you want to: - Simulate the effect of a perturbation on a specific sample - Generate expression profiles that preserve the biological and technical characteristics of a reference - Create synthetic "treated vs. control" pairs Available Models ---------------- - **gem-1-bulk_reference-conditioning**: Bulk RNA-seq reference conditioning model - **gem-1-sc_reference-conditioning**: Single-cell RNA-seq reference conditioning model .. note:: These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use. .. code-block:: python import pysynthbio How It Works ------------ Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that: 1. Preserves the biological/technical latent space of the reference 2. Applies any perturbation metadata you specify 3. Returns synthetic expression that reflects the perturbation effect on that specific sample Creating a Query ---------------- Reference conditioning queries require different inputs than baseline models: .. code-block:: python # Get the example query structure example_query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"] # Inspect the query structure print(example_query) The query structure includes: 1. **inputs**: A list where each input contains: - **counts**: The reference expression counts (a dictionary with a ``counts`` list) - **metadata**: Perturbation-only metadata (see below) - **num_samples**: How many samples to generate 2. **conditioning**: Which latent spaces to condition on (typically ``["biological", "technical"]``) 3. **sampling_strategy**: ``"mean estimation"`` or ``"sample generation"`` Perturbation-Only Metadata ^^^^^^^^^^^^^^^^^^^^^^^^^^ Unlike baseline models, reference conditioning queries only accept perturbation metadata fields: - ``perturbation_ontology_id`` - ``perturbation_type`` - ``perturbation_time`` - ``perturbation_dose`` All other biological and technical metadata is inferred from the reference expression. Example: Simulating a Drug Treatment ------------------------------------ Here's a complete example simulating a drug treatment effect on a reference sample: .. code-block:: python # Start with example query structure query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"] # Replace with your actual reference counts # The counts list must match the model's expected gene order and length query["inputs"][0]["counts"] = {"counts": your_reference_counts} # Specify the perturbation query["inputs"][0]["metadata"] = { "perturbation_ontology_id": "CHEMBL25", # Aspirin (ChEMBL ID) "perturbation_type": "compound", "perturbation_time": "24h", "perturbation_dose": "10uM" } query["inputs"][0]["num_samples"] = 3 # Set the sampling strategy query["sampling_strategy"] = "mean estimation" # Submit the query result = pysynthbio.predict_query(query, model_id="gem-1-bulk_reference-conditioning") Example: CRISPR Knockout Simulation ----------------------------------- Simulate the effect of knocking out a specific gene: .. code-block:: python query = pysynthbio.get_example_query(model_id="gem-1-bulk_reference-conditioning")["example_query"] # Your reference sample counts query["inputs"][0]["counts"] = {"counts": control_sample_counts} # CRISPR knockout of TP53 query["inputs"][0]["metadata"] = { "perturbation_ontology_id": "ENSG00000141510", # TP53 Ensembl ID "perturbation_type": "crispr" } query["inputs"][0]["num_samples"] = 5 result = pysynthbio.predict_query(query, model_id="gem-1-bulk_reference-conditioning") Query Parameters ---------------- conditioning (list, optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Controls which latent spaces are conditioned on the reference. Default is ``["biological", "technical"]``. When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample. sampling_strategy (str, required) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Controls the type of prediction: - **"sample generation"**: Generates realistic-looking synthetic data with measurement error. **(Bulk only)** - **"mean estimation"**: Provides stable mean estimates. **(Bulk and single-cell)** .. code-block:: python query["sampling_strategy"] = "mean estimation" fixed_total_count (bool, optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Controls whether to preserve the reference's library size: - **False** (default): The output's total count is taken from the reference expression (sum of its counts). Use this when you want the synthetic sample to preserve the reference's library size. - **True**: Forces the model to use the ``total_count`` parameter value (or default) instead of the reference's library size. .. code-block:: python # Preserve reference library size (default) query["fixed_total_count"] = False # Or force a specific library size query["fixed_total_count"] = True query["total_count"] = 10000000 total_count (int, optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Library size used when converting predicted log CPM back to raw counts. Only effective when ``fixed_total_count = True``. - Default: 10,000,000 for bulk; 10,000 for single-cell deterministic_latents (bool, optional) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If ``True``, the model uses the mean of each latent distribution (``p(z|metadata)`` for perturbation, ``q(z|x)`` for conditioned components) instead of sampling. This produces deterministic, reproducible outputs. - Default: ``False`` .. code-block:: python query["deterministic_latents"] = True seed (int, optional) ^^^^^^^^^^^^^^^^^^^^ Random seed for reproducibility. .. code-block:: python query["seed"] = 42 Valid Perturbation Metadata --------------------------- .. list-table:: :header-rows: 1 :widths: 30 70 * - Field - Description / Format * - ``perturbation_ontology_id`` - Ensembl gene ID (e.g., ``ENSG00000141510``), `ChEBI ID `_, `ChEMBL ID `_, or `NCBI Taxonomy ID `_ * - ``perturbation_type`` - One of: "coculture", "compound", "control", "crispr", "genetic", "infection", "other", "overexpression", "peptide or biologic", "shrna", "sirna" * - ``perturbation_time`` - Time since perturbation (e.g., "24h", "48h") * - ``perturbation_dose`` - Dose of perturbation (e.g., "10uM", "1mg/kg") Working with Results -------------------- The result structure is similar to baseline models: .. code-block:: python # Access metadata and expression matrices metadata = result["metadata"] expression = result["expression"] # Compare to your reference print(expression.shape) print(metadata.head()) Differential Expression ^^^^^^^^^^^^^^^^^^^^^^^ When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects: .. code-block:: python import numpy as np # Your reference (input) counts reference_cpm = your_reference_counts / np.sum(your_reference_counts) * 1e6 # Generated (perturbed) counts generated_counts = expression.iloc[0].values generated_cpm = generated_counts / np.sum(generated_counts) * 1e6 # Log fold change log2fc = np.log2(generated_cpm + 1) - np.log2(reference_cpm + 1) # Identify top changed genes gene_names = expression.columns top_indices = np.argsort(log2fc)[-20:] print("Top upregulated genes:", gene_names[top_indices].tolist()) Important Notes --------------- Counts Vector Length ^^^^^^^^^^^^^^^^^^^^ The reference counts vector must match the model's expected number of genes. If the length doesn't match, the API will return a validation error. Use ``get_example_query()`` to see the expected structure and ensure your counts vector has the correct length. Gene Order ^^^^^^^^^^ Ensure your reference counts are in the same gene order expected by the model. The response includes a ``gene_order`` field that specifies the expected order.