evolvers
evolvers compiles your intent and taste into self-improving artifacts. Provide criteria, examples, or both — the LLM iterates toward what you meant.
Install
uv add evolvers lurkerslurkers is optional — only the Quick start uses it. For your own data, uv add evolvers is enough.
Quick start
evolvers is async-primary — train, evaluate, and calling an Evolvable are coroutines.
End-to-end runnable example: fetch a small dataset with lurkers, then train a TLDR program against it.
import asyncio
import evolvers as ev
import lurkers
def tldr(input_text: str, llm) -> str:
"""Summarize input_text as a TLDR (~140 chars)."""
return input_text[:130] + "..."
async def main():
# Bring your own data — here, three arXiv abstracts.
docs = await asyncio.gather(
lurkers.afetch("https://arxiv.org/abs/1706.03762"), # Attention Is All You Need
lurkers.afetch("https://arxiv.org/abs/2005.14165"), # GPT-3
lurkers.afetch("https://arxiv.org/abs/2310.06825"), # Mistral 7B
)
dataset = [d.content for d in docs]
llm = ev.LLM(model="claude-opus-4-7")
evo = ev.Evolvable(
tldr,
criteria=[
ev.judge("Does it directly summarize the main points as a TLDR?"),
ev.code(
lambda output_text:
max(-1.0, 1 - 2 * max(0, (len(output_text) - 140) / 140))
),
],
llm=llm,
)
await evo.train(dataset, num_train_epochs=10)
print(evo.source) # the function body the optimizer settled on
evo.save("you/tldr-v1:claude-opus-4-7")
reloaded = ev.Evolvable.load("you/tldr-v1:claude-opus-4-7")
print(await reloaded(dataset[0]))
asyncio.run(main())Sync wrappers (evo.train_sync, evo.evaluate_sync, evo.call_sync) exist for non-async codebases.
Concepts
Evolvable
Wraps a function, its criteria, and an LLM. Calling it runs the function; if the function has an llm parameter, the bound LLM is passed in. Your function can be sync or async — both work.
After training, evo.source is the function body the optimizer produced — read it to see what it wrote.
| Method | What it does |
|---|---|
await evo.train(dataset, num_train_epochs=N) | Propose-test-accept-or-revert loop. Each epoch proposes a new function body, scores it against the dataset, and keeps it only if the score improves. num_train_epochs defaults to 20. Returns a dict with best_score, best_source, and history. |
await evo.evaluate(dataset) | Scores the current version without changing it. Returns a dict with the aggregate score and a per_criterion breakdown. |
await evo(input) | Runs the current best version on one input. |
evo.save("owner/name:variant") | Saves the program and its criteria under ~/.cache/evolvers/ (override the location with the EVOLVERS_CACHE env var). |
Evolvable.load("owner/name:variant") | Loads a saved program. |
evo.clone().set_llm(other_llm) | A copy bound to a different LLM. |
evo.train_sync(...) / evo.evaluate_sync(...) / evo.call_sync(...) | Sync wrappers for non-async codebases. |
Criterion
Two kinds of criterion. Mix them freely — each has a weight, and the score is the weighted mean.
| Factory | What it scores |
|---|---|
ev.judge(question) | Natural-language LLM-as-judge. Sees the program's input and output; returns a score in [-1, 1] plus reasoning. |
ev.code(callable) | A plain Python function. Takes one argument (the output) or two (input, output); returns a number in [-1, 1]. |
LLM
One wrapper for Anthropic and any OpenAI-compatible endpoint. Same interface for both:
opus = ev.LLM(model="claude-opus-4-7")
local = ev.LLM(model="deepkek", base_url="http://localhost:8001/v1")Credentials come from the standard provider env vars (ANTHROPIC_API_KEY, OPENAI_API_KEY); pass api_key= to override.
| Method | What it does |
|---|---|
await llm(prompt, *, schema=..., system=...) | Single call. Returns str, or a parsed pydantic instance if schema is given. |
await llm.batch(prompts, **kwargs) | Runs many prompts concurrently. |
llm.call_sync(...) / llm.batch_sync(...) | Sync wrappers. |
Source
github.com/tiramisu-sh/evolvers · Apache-2.0