rsynthbio
is an R package that provides a convenient
interface to the Synthesize
Bio API, allowing users to generate realistic gene expression data
based on specified biological conditions. This package enables
researchers to easily access AI-generated transcriptomic data for
various modalities including bulk RNA-seq, single-cell RNA-seq,
microarray data, and more.
Alternatively, you can AI generate datasets from our platform website.
How to install
You can install rsynthbio
from CRAN:
install.packages("rsynthbio")
If you want the development version, you can install using the
remotes
package to install from GitHub:
if (!("remotes" %in% installed.packages())) {
install.packages("remotes")
}
remotes::install_github("synthesizebio/rsynthbio")
Once installed, load the package:
Authentication
Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication:
# Securely prompt for and store your API token
# The token will not be visible in the console
set_synthesize_token()
# You can also store the token in your system keyring for persistence
# across R sessions (requires the 'keyring' package)
set_synthesize_token(use_keyring = TRUE)
Loading your API key for a session.
# In future sessions, load the stored token
load_synthesize_token_from_keyring()
# Check if a token is already set
has_synthesize_token()
You can obtain an API token by registering at Synthesize Bio.
Security Best Practices
For security reasons, remember to clear your token when you’re done:
# Clear token from current session
clear_synthesize_token()
# Clear token from both session and keyring
clear_synthesize_token(remove_from_keyring = TRUE)
Never hard-code your token in scripts that will be shared or committed to version control.
Basic Usage
Creating a Query
The first step to generating AI-generated gene expression data is to create a query. The package provides a sample query that you can modify:
# Get a sample query
query <- get_valid_query()
# Inspect the query structure
str(query)
The query consists of:
-
output_modality
: The type of gene expression data to generate (seeget_valid_modalities
) -
mode
: The prediction mode (e.g., “mean estimation” or “sample generation”) -
inputs
: A list of biological conditions to generate data for
We train our models with diverse multi-omics datasets. There are two model types/modes available today:
Sample generation: This runs in “diffusion” mode and generates different results for each sample requested. Use this mode to understand the distribution of expression across sample groups.
Mean estimation: This is deterministic. For a given metadata specification, you will get the same values.
# Request raw counts data
result <- predict_query(query, raw_response = TRUE)
This result will be a list of two dataframes: metadata
and expression
Making Predictions
Once your query is ready, you can send it to the API to generate gene expression data.
# Request raw counts data
result <- predict_query(query, as_counts = TRUE)
If you want the full API response beyond just than just the result of
the metadata and expression returned put
raw_response = TRUE
.
Working with Results
# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression
# Check dimensions
dim(expression)
# View metadata sample
head(metadata)
You may want to process the data in chunks or save it for later use:
Custom Validation
You can validate your queries before sending them to the API:
# Validate structure
validate_query(query)
# Validate modality
validate_modality(query)