Overview
Reference conditioning models generate expression data conditioned on a real reference sample. This allows you to “anchor” to an existing expression profile while applying perturbations or modifications.
This is useful when you want to:
- Simulate the effect of a perturbation on a specific sample
- Generate expression profiles that preserve the biological and technical characteristics of a reference
- Create synthetic “treated vs. control” pairs
Available Models
-
gem-1-bulk_reference-conditioning: Bulk RNA-seq reference conditioning model -
gem-1-sc_reference-conditioning: Single-cell RNA-seq reference conditioning model
Note: These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.
How It Works
Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that:
- Preserves the biological/technical latent space of the reference
- Applies any perturbation metadata you specify
- Returns synthetic expression that reflects the perturbation effect on that specific sample
Creating a Query
Reference conditioning queries require different inputs than baseline models:
# Get the example query structure
example_query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query
# Inspect the query structure
str(example_query)The query structure includes:
-
inputs: A list where each input contains:-
counts: The reference expression counts (a named list with acountsvector) -
metadata: Perturbation-only metadata (see below) -
num_samples: How many samples to generate
-
conditioning: Which latent spaces to condition on (typically["biological", "technical"])sampling_strategy:"mean estimation"or"sample generation"
Example: Simulating a Drug Treatment
Here’s a complete example simulating a drug treatment effect on a reference sample:
# Start with example query structure
query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query
# Replace with your actual reference counts
# The counts vector must match the model's expected gene order and length
query$inputs[[1]]$counts <- list(counts = your_reference_counts)
# Specify the perturbation
query$inputs[[1]]$metadata <- list(
perturbation_ontology_id = "CHEMBL25", # Aspirin (ChEMBL ID)
perturbation_type = "compound",
perturbation_time = "24h",
perturbation_dose = "10uM"
)
query$inputs[[1]]$num_samples <- 3
# Set the sampling strategy
query$sampling_strategy <- "mean estimation"
# Submit the query
result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning")Example: CRISPR Knockout Simulation
Simulate the effect of knocking out a specific gene:
query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query
# Your reference sample counts
query$inputs[[1]]$counts <- list(counts = control_sample_counts)
# CRISPR knockout of TP53
query$inputs[[1]]$metadata <- list(
perturbation_ontology_id = "ENSG00000141510", # TP53 Ensembl ID
perturbation_type = "crispr"
)
query$inputs[[1]]$num_samples <- 5
result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning")Query Parameters
conditioning (list, optional)
Controls which latent spaces are conditioned on the reference.
Default is ["biological", "technical"].
When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample.
sampling_strategy (character, required)
Controls the type of prediction:
- “sample generation”: Generates realistic-looking synthetic data with measurement error. (Bulk only)
- “mean estimation”: Provides stable mean estimates. (Bulk and single-cell)
query$sampling_strategy <- "mean estimation"fixed_total_count (logical, optional)
Controls whether to preserve the reference’s library size:
-
FALSE(default): The output’s total count is taken from the reference expression (sum of its counts). Use this when you want the synthetic sample to preserve the reference’s library size. -
TRUE: Forces the model to use thetotal_countparameter value (or default) instead of the reference’s library size.
# Preserve reference library size (default)
query$fixed_total_count <- FALSE
# Or force a specific library size
query$fixed_total_count <- TRUE
query$total_count <- 10000000total_count (integer, optional)
Library size used when converting predicted log CPM back to raw
counts. Only effective when fixed_total_count = TRUE.
- Default: 10,000,000 for bulk; 10,000 for single-cell
Valid Perturbation Metadata
| Field | Description / Format |
|---|---|
perturbation_ontology_id |
Ensembl gene ID (e.g., ENSG00000141510), ChEBI ID, ChEMBL ID, or NCBI Taxonomy ID
|
perturbation_type |
One of: “coculture”, “compound”, “control”, “crispr”, “genetic”, “infection”, “other”, “overexpression”, “peptide or biologic”, “shrna”, “sirna” |
perturbation_time |
Time since perturbation (e.g., “24h”, “48h”) |
perturbation_dose |
Dose of perturbation (e.g., “10uM”, “1mg/kg”) |
Working with Results
The result structure is similar to baseline models:
# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression
# Compare to your reference
dim(expression)
head(metadata)Differential Expression
When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects:
# Your reference (input) counts
reference_cpm <- your_reference_counts / sum(your_reference_counts) * 1e6
# Generated (perturbed) counts
generated_cpm <- expression[1, ] / sum(expression[1, ]) * 1e6
# Log fold change
log2fc <- log2(generated_cpm + 1) - log2(reference_cpm + 1)
# Identify top changed genes
head(sort(log2fc, decreasing = TRUE), 20)Important Notes
Counts Vector Length
The reference counts vector must match the model’s expected number of genes. If the length doesn’t match, the API will return a validation error.
Use get_example_query() to see the expected structure
and ensure your counts vector has the correct length.
