Author: Tracer-Cloud
Stars: 257 stars today
Description: Build your own AI SRE agents. The open source toolkit for the AI era ✨
The open-source framework for AI SRE agents, and the training and evaluation environment they need to improve. Connect the 60+ tools you already run, define your own workflows, and investigate incidents on your own infrastructure.
Quickstart · Docs · FAQ · Security
🚧 Public Alpha: Core workflows are usable for early exploration, though not yet fully stable. The project is in active development, and APIs and integrations may evolve
When something breaks in production, the evidence is scattered across logs, metrics, traces, runbooks, and Slack threads. OpenSRE is an open-source framework for AI SRE agents that resolve production incidents, built to run on your own infrastructure.
We do that because SWE-bench1 gave coding agents scalable training data and clear feedback. Production incident response still lacks an equivalent.
Distributed failures are slower, noisier, and harder to simulate and evaluate than local code tasks, which is why AI SRE, and AI for production debugging more broadly, remains unsolved.
OpenSRE is building that missing layer:
an open reinforcement learning environment for agentic infrastructure incident response, with end-to-end tests and synthetic incident simulations for realistic production failures
We do that by:
Our mission is to build AI SRE agents on top of this, scale it to thousands of realistic infrastructure failure scenarios, and establish OpenSRE as the benchmark and training ground for AI SRE.
1 https://arxiv.org/abs/2310.06770
bash
curl -fsSL https://raw.githubusercontent.com/Tracer-Cloud/opensre/main/install.sh | bash
bash
brew install Tracer-Cloud/opensre/opensre
powershell
irm https://raw.githubusercontent.com/Tracer-Cloud/opensre/main/install.ps1 | iex
bash
opensre onboard
opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json
opensre update
Before running opensre deploy railway, make sure the target Railway project has
both Postgres and Redis services, and that your OpenSRE service has DATABASE_URI
and REDIS_URI set to those connection strings. The containerized LangGraph
runtime will not boot without those backing services wired in.
```bash
opensre deploy railway --project
If the deploy starts but the service never becomes healthy, verify that
DATABASE_URI and REDIS_URI are present on the Railway service and point to the
project Postgres and Redis instances.
After deploying a hosted service, you can run post-deploy operations from the CLI:
```bash
opensre remote ops --provider railway --project
opensre remote ops --provider railway --project
opensre remote ops --provider railway --project
opensre remote ops --provider railway --project
OpenSRE saves your last used provider/project/service, so you can run:
bash
opensre remote ops status
opensre remote ops logs --follow
New to OpenSRE? See SETUP.md for detailed platform-specific setup instructions, including Windows setup, environment configuration, and more.
```bash git clone https://github.com/Tracer-Cloud/opensre cd opensre make install
opensre onboard opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json ```
When an alert fires, OpenSRE automatically:
Generate the benchmark report:
shell
make benchmark
| | | | ---------------------------------------- | -------------------------------------------------------------------------------- | | 🔍 Structured incident investigation | Correlated root-cause analysis across all your signals | | 📋 Runbook-aware reasoning | OpenSRE reads your runbooks and applies them automatically | | 🔮 Predictive failure detection | Catch emerging issues before they page you | | 🔗 Evidence-backed root cause | Every conclusion is linked to the data behind it | | 🤖 Full LLM flexibility | Bring your own model — Anthropic, OpenAI, Ollama, Gemini, OpenRouter, NVIDIA NIM |
OpenSRE connects to 40+ tools and services across the modern cloud stack, from LLM providers and observability platforms to infrastructure, databases, and incident management.
| Category | Integrations | Roadmap |
| --- | --- | --- |
| AI / LLM Providers | Anthropic · OpenAI · Ollama · Google Gemini · OpenRouter · NVIDIA NIM · Bedrock | |
| Observability |
Grafana (Loki · Mimir · Tempo) · Datadog · Honeycomb · Coralogix ·
CloudWatch ·
Sentry · Elasticsearch | Splunk · New Relic · Victoria Logs |
| Infrastructure |
Kubernetes ·
AWS (S3 · Lambda · EKS · EC2 · Bedrock) ·
GCP ·
Azure | Helm · ArgoCD |
| Database | MongoDB · ClickHouse | PostgreSQL · MySQL · MariaDB · MongoDB Atlas · Azure SQL · RDS · Snowflake |
| Data Platform | Apache Airflow · Apache Kafka · Apache Spark · Prefect | RabbitMQ |
| Dev Tools |
GitHub · GitHub MCP · Bitbucket | GitLab |
| Incident Management |
PagerDuty · Opsgenie · Jira | ServiceNow · incident.io · Alertmanager · Linear · Trello |
| Communication |
Slack · Google Docs | Discord · Teams · WhatsApp · Confluence · Notion |
| Agent Deployment |
Vercel ·
LangSmith ·
EC2 ·
ECS | Railway |
| Protocols | MCP ·
ACP ·
OpenClaw | |
OpenSRE is community-built. Every integration, improvement, and bug fix makes it better for thousands of engineers. We actively review PRs and welcome contributors of all experience levels.
Good first issues are labeled good first issue. Ways to contribute:
See CONTRIBUTING.md for the full guide.
Thanks goes to these amazing people:
OpenSRE is designed with production environments in mind:
See SECURITY.md for responsible disclosure.
opensre collects anonymous usage statistics with Posthog to help us understand adoption
and demonstrate traction to sponsors and investors who fund the project.
What we collect: command name, success/failure, rough runtime, CLI version,
Python version, OS family, machine architecture, and a small amount of
command-specific metadata such as which subcommand ran. For opensre onboard
and opensre investigate, we may also collect the selected model/provider and
whether the command used flags such as --interactive or --input.
A randomly generated anonymous ID is created on first run and stored in
~/.config/opensre/. We never collect alert contents, file contents,
hostnames, credentials, or any personally identifiable information.
Telemetry is automatically disabled in GitHub Actions and pytest runs.
To opt out locally, set the environment variable before running:
bash
export OPENSRE_NO_TELEMETRY=1
The legacy alias OPENSRE_ANALYTICS_DISABLED=1 also still works.
To inspect the payload locally without sending anything, use:
bash
export OPENSRE_TELEMETRY_DEBUG=1
Apache 2.0 - see LICENSE for details.
1 https://arxiv.org/abs/2310.06770
Unable to fetch file structure.