Scientific Intent, Precisely Matched: BioMate's Hybrid Workflow Search

Natural language is ambiguous. Scientific analysis is not. Bridging that gap — reliably, at scale — is one of the harder engineering problems in building a bioinformatics platform that actually works for researchers who are not bioinformaticians.

Why Routing Is Hard

When a researcher says "analyze gene expression changes in my treated samples," they might mean bulk RNA-seq with differential expression analysis, single-cell transcriptomics with pseudotime trajectory, or spatial transcriptomics with region-specific contrast. They might be working with mouse or human, with fresh frozen or FFPE tissue, with a time-series experiment or a simple two-condition comparison. The same words map to dozens of scientifically distinct workflows — and choosing the wrong one is not a cosmetic error.

Keyword matching fails here because it cannot distinguish "somatic mutation" from "germline variant" from "copy number alteration" in context. Simple similarity search fails because biological terminology is dense with synonyms, abbreviations, and discipline-specific conventions that are not well-represented in general-purpose embeddings.

A Three-Layer Approach

BioMate's routing system combines three signals. The first is dense semantic search: query and workflow descriptions are both encoded into a shared vector space using embeddings trained on biomedical literature, capturing relationships between concepts that keyword matching misses. The second is structured domain scoring: the query is parsed for biological signals — organism, data type, analytical goal, disease context — and these are matched against the structured metadata of candidate workflows. The third is a reasoning layer that resolves remaining ambiguity by reading the full query in context and applying biological domain knowledge to break ties.

"Getting the workflow right is the first responsibility of the platform. Every subsequent step depends on it."

Graceful Disambiguation

When the routing system identifies genuine ambiguity — two workflows that are both plausible given the query, with meaningful scientific differences between them — it does not guess. It surfaces a disambiguation card that presents the top candidates with plain-language descriptions of what each one does, what data it expects, and what question it is best suited to answer. The researcher makes the call; the system records the choice and uses it to improve future routing for similar queries.

Further reading: nf-core pipeline registry, Lewis et al. 2020 (retrieval-augmented generation), Sentence Transformers (Hugging Face), and PubMed.

What this means for you

You do not need to know the workflow name to find the right one. Describe your experiment and your question, and BioMate finds the match — explaining its reasoning and asking for clarification only when the ambiguity is real and consequential.