Overview
Metadata prediction models infer biological metadata from observed expression data. Given a gene expression profile, the model predicts the likely biological characteristics such as cell type, tissue, disease state, and more.
This is useful when you want to:
- Annotate samples of unknown origin
- Validate sample labels against expression patterns
- Discover potential mislabeled or contaminated samples
- Understand the biological characteristics captured in expression data
Available Models
-
gem-1-bulk_predict-metadata: Bulk RNA-seq metadata prediction model -
gem-1-sc_predict-metadata: Single-cell RNA-seq metadata prediction model
Note: These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.
How It Works
Metadata prediction encodes your expression data into the model’s latent space and then uses classifiers to predict the most likely metadata values for each sample. The model returns:
- Classifier probabilities: For each categorical metadata field, the probability distribution over possible values
- Predicted labels: The most likely value for each metadata field
- Latent representations: The biological, technical, and perturbation latent vectors
Creating a Query
Metadata prediction queries are simpler than other model types—you only need to provide expression counts:
# Get the example query structure
example_query <- get_example_query(model_id = "gem-1-bulk_predict-metadata")$example_query
# Inspect the query structure
str(example_query)The query structure includes:
inputs: A list of count vectors, where each element is a named list with acountsfield containing expression valuesseed(optional): Random seed for reproducibility
Example: Predicting Sample Metadata
Here’s a complete example predicting metadata for expression samples:
# Start with example query structure
query <- get_example_query(model_id = "gem-1-bulk_predict-metadata")$example_query
# Replace with your actual expression counts
# Each input should be a list with a counts vector
query$inputs <- list(
list(counts = sample1_counts),
list(counts = sample2_counts),
list(counts = sample3_counts)
)
# Optional: set seed for reproducibility
query$seed <- 42
# Submit the query
result <- predict_query(query, model_id = "gem-1-bulk_predict-metadata")Example: Single Sample Prediction
For predicting metadata of a single sample:
query <- get_example_query(model_id = "gem-1-bulk_predict-metadata")$example_query
# Single sample
query$inputs <- list(
list(counts = my_sample_counts)
)
result <- predict_query(query, model_id = "gem-1-bulk_predict-metadata")
# Access the predictions
print(result$outputs$metadata)Query Parameters
Understanding the Results
The results from metadata prediction include several components:
Predicted Metadata
The metadata data frame contains the predicted values
for each sample:
# View predicted metadata
head(result$outputs$metadata)
# Access specific predictions
result$outputs$metadata$cell_type_ontology_id
result$outputs$metadata$tissue_ontology_id
result$outputs$metadata$disease_ontology_idClassifier Probabilities
For categorical metadata fields, the model returns probability distributions over all possible values. These are useful for understanding prediction confidence:
# If probabilities are included in the output
# Access cell type probabilities for first sample
# The exact structure depends on the API response format
# Example: viewing top predicted cell types
cell_type_probs <- result$outputs$classifier_probs$cell_type[[1]]
head(sort(cell_type_probs, decreasing = TRUE))Use Cases
Sample Annotation
Annotate unlabeled samples with predicted metadata:
# Load your unlabeled samples
unlabeled_counts <- read.csv("unlabeled_samples.csv", row.names = 1)
# Create query
query <- get_example_query(model_id = "gem-1-bulk_predict-metadata")$example_query
query$inputs <- lapply(1:ncol(unlabeled_counts), function(i) {
list(counts = unlabeled_counts[, i])
})
# Predict metadata
result <- predict_query(query, model_id = "gem-1-bulk_predict-metadata")
# Combine with sample IDs
annotations <- result$outputs$metadata
annotations$sample_id <- colnames(unlabeled_counts)Quality Control
Validate existing sample labels against predicted metadata:
# Compare predicted vs. provided labels
provided_labels <- c("UBERON:0002107", "UBERON:0002107", "UBERON:0000955", "UBERON:0000955")
predicted_labels <- result$outputs$metadata$tissue_ontology_id
# Identify potential mismatches
mismatches <- which(provided_labels != predicted_labels)
if (length(mismatches) > 0) {
message("Potential mislabeled samples: ", paste(mismatches, collapse = ", "))
}Important Notes
Counts Vector Length
The counts vector for each sample must match the model’s expected number of genes. If the length doesn’t match, the API will return a validation error.
Use get_example_query() to see the expected
structure.
