Prior to this change, all alert instance writes and deletes happened
individually, in their own database transaction. This change batches up
writes or deletes for a given rule's evaluation loop into a single
transaction before applying it.
Before:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 398 2991381 ns/op 1133537 B/op 27703 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: FovKXiRVzm} with title: "an alert definition FTvFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: foDFXmRVkm} with title: "an alert definition fovFXmRVkz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: VQvFuigVkm} with title: "an alert definition VwDKXmR4kz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.619s
```
After:
```
goos: darwin
goarch: arm64
pkg: github.com/grafana/grafana/pkg/services/ngalert/store
BenchmarkAlertInstanceOperations-8 1440 816484 ns/op 352297 B/op 6529 allocs/op
--- BENCH: BenchmarkAlertInstanceOperations-8
util.go:127: alert definition: {orgID: 1, UID: 302r_igVzm} with title: "an alert definition q0h9lmR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: 71hrlmR4km} with title: "an alert definition nJ29_mR4zz" interval: 60 created
util.go:127: alert definition: {orgID: 1, UID: Cahr_mR4zm} with title: "an alert definition ja2rlmg4zz" interval: 60 created
PASS
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.383s
```
So we cut time by about 75% and memory allocations by about 60% when
storing and deleting 100 instances.
This change also updates some of our tests so that they run successfully against postgreSQL - we were using random Int64s, but postgres integers, which our tables use, max out at 2^31-1
// Flush the state batch if we would go over the batch size when adding the next loop.
iflen(ruleStates)+len(stateSlice)>stateBatchSize{
writeBatch()
}
for_,state:=rangeruleStates{
err:=st.saveState(ctx,state)
totalStates++
iferr!=nil{
st.log.Error("failed to save alert state",append(state.GetRuleKey().LogContext(),"labels",state.Labels.String(),"state",state.State.String(),"err",err.Error()))
errorsCnt++
}
// Take everything for a single rule, even if it exceeds the batch
// size. We'd like to flush everything for one rule as a unit.
stateSlice=append(stateSlice,state)
}
}
}
writeBatch()
st.log.Info("the state has been flushed","total_instances",totalStates,"errors",errorsCnt,"took",st.clock.Since(t))
addToQuery("SELECT alert_instance.*, alert_rule.title AS rule_title FROM alert_instance LEFT JOIN alert_rule ON alert_instance.rule_org_id = alert_rule.org_id AND alert_instance.rule_uid = alert_rule.uid WHERE rule_org_id = ?",cmd.RuleOrgID)
addToQuery("SELECT * FROM alert_instance WHERE rule_org_id = ?",cmd.RuleOrgID)
returnfmt.Errorf("failed to upsert alert instances. Last statements had %v fields, which is not a multiple of the number of fields, %v",len(args),fieldsPerRow)