Skip to content

Commit b0a7e00

Browse files
authored
[BREAKING] Only store hash of distributions over time (#12)
Evaluation resource has been storing list of replicas to each history record. It appears to still blow out of the waters the size of the object on big number of replicas. This change removes the distribution from history records on evaluations. The idea is that we store the last known projected winning distribution and compare it to the current one, and if current one becomes the winner - we store it insead. At no point in time we should need to know any other non-winning distribution, other than what was its hash, how many times we've seen it and when was the last time we've seen it. There would be an edge case when a while ago there was a very much wanted distribution that no longer wanter but it's total seen count is higher than any of the current distributions. When that huge part of the history goes out of the bounds and getting erased - we might not have current projection anymore. In that case evaluator will go in not ready state for a while until the new projected winning distribution is clear. This change updates CRDs in-place with removal of fields, which is ok because this is still beta.
1 parent 05125ce commit b0a7e00

7 files changed

+198
-137
lines changed

DESIGN.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,16 @@ This will guarantee that we will use an up to date configuration once every pred
329329
It may not be the most efficient accordingly to the latest data, but it's reflecting well the most recent history,
330330
without risking to fall behind too much.
331331

332+
There is an edge case that needs to be explained in a little more details.
333+
The `MostWantedTwoPhaseHysteresisEvaluation` CRD only stores hashes of the recent history for repeated evaluation.
334+
It cannot store full distribution plans as it would quickly get out of etcd object size limits.
335+
Therefore, there could be a scenario when the most-wanted but old record is about to be erased,
336+
while the new projected winner is not the current partitioner output.
337+
Thus - the current projection is effectively unknown (only its hash and last seen time and total seen count is known).
338+
When that happens - evaluator will enter a not-ready state.
339+
Eventually, either another partitioning plan will surpass this currently unknown projection, or -
340+
partitioner will give us another sample with the current projection details that evaluator will be able to remember.
341+
332342
## Scaler
333343

334344
Using input from `.spec.partitionProviderRef` - Scaler can ensure that intention becomes reality.

api/autoscaler/v1alpha1/mostwantedtwophasehysteresisevaluation_types.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,9 @@ type MostWantedTwoPhaseHysteresisEvaluationStatusHistoricalRecord struct {
3636
// +kubebuilder:validation:Required
3737
Timestamp metav1.Time `json:"timestamp,omitempty"`
3838

39-
// Replicas is the partition as it was seen at this moment in time.
39+
// ReplicasHash is the hash of serialized replicas string.
4040
// +kubebuilder:validation:Required
41-
Replicas common.ReplicaList `json:"replicas,omitempty"`
41+
ReplicasHash string `json:"replicasHash,omitempty"`
4242

4343
// SeenTimes is the counter of how many times have this record been seen.
4444
// +kubebuilder:validation:Required

api/autoscaler/v1alpha1/zz_generated.deepcopy.go

Lines changed: 0 additions & 7 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

config/crd/bases/autoscaler.argoproj.io_mostwantedtwophasehysteresisevaluations.yaml

Lines changed: 5 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -150,104 +150,10 @@ spec:
150150
changes over time.
151151
items:
152152
properties:
153-
replicas:
154-
description: Replicas is the partition as it was seen at this
155-
moment in time.
156-
items:
157-
description: Replica is a representation of the replica for
158-
sharding
159-
properties:
160-
id:
161-
description: ID of the replica, starting from 0 and onward.
162-
format: int32
163-
type: integer
164-
loadIndexes:
165-
description: LoadIndexes shards assigned to this replica
166-
wrapped into their load index.
167-
items:
168-
description: LoadIndex is a representation of a shard
169-
with calculated load index to it.
170-
properties:
171-
displayValue:
172-
description: |-
173-
DisplayValue is the string representation of the without precision guarantee.
174-
This is meaningless and exists purely for convenience of someone who is looking at the kubectl get output.
175-
type: string
176-
shard:
177-
description: Shard is the shard that this load index
178-
is calculated for.
179-
properties:
180-
id:
181-
description: |-
182-
ID of this shard. It may or may not be unique, depending on the discoverer.
183-
For a secret with type=cluster label, this would be the name of the secret.
184-
type: string
185-
name:
186-
description: |-
187-
Name of this shard.
188-
This must be the same as the name of this destination cluster as seen by Application Controller.
189-
type: string
190-
namespace:
191-
description: |-
192-
Namespace of this shard.
193-
For a secret with type=cluster label, this would be the namespace of the secret.
194-
If shard is managed externally - it is expected to be set to some value.
195-
Same as the Application Controller is in - would be a logical choice.
196-
type: string
197-
server:
198-
description: |-
199-
Server of this shard.
200-
This must be the same as the server URL of this destination cluster as seen by Application Controller.
201-
type: string
202-
uid:
203-
description: |-
204-
UID unique identifier of this shard.
205-
There is multiple seemingly duplicative fields here, but this is the only one that is unique.
206-
For example, when the shard is represented as a secret with type=cluster label,
207-
the UID of the secret is a UID of the shard.
208-
Meaning that it would change if the secret is re-created.
209-
That's what is meant by "unique" in this context.
210-
When the discovery was external - this may be arbitrary string unique to that shard.
211-
type: string
212-
required:
213-
- id
214-
- name
215-
- namespace
216-
- server
217-
- uid
218-
type: object
219-
value:
220-
anyOf:
221-
- type: integer
222-
- type: string
223-
description: Value is a value of this load index.
224-
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
225-
x-kubernetes-int-or-string: true
226-
required:
227-
- displayValue
228-
- shard
229-
- value
230-
type: object
231-
type: array
232-
totalLoad:
233-
anyOf:
234-
- type: integer
235-
- type: string
236-
description: TotalLoad is the sum of all load indexes
237-
assigned to this replica.
238-
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
239-
x-kubernetes-int-or-string: true
240-
totalLoadDisplayValue:
241-
description: |-
242-
DisplayValue is the string representation of the total load without precision guarantee.
243-
This is meaningless and exists purely for convenience of someone who is looking at the kubectl get output.
244-
type: string
245-
required:
246-
- loadIndexes
247-
- totalLoad
248-
- totalLoadDisplayValue
249-
type: object
250-
type: array
153+
replicasHash:
154+
description: ReplicasHash is the hash of serialized replicas
155+
string.
156+
type: string
251157
seenTimes:
252158
description: SeenTimes is the counter of how many times have
253159
this record been seen.
@@ -259,7 +165,7 @@ spec:
259165
format: date-time
260166
type: string
261167
required:
262-
- replicas
168+
- replicasHash
263169
- seenTimes
264170
- timestamp
265171
type: object

internal/controller/autoscaler/mostwantedtwophasehysteresisevaluation_controller.go

Lines changed: 50 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,8 @@ func init() {
133133

134134
// For more details, check Reconcile and its Result here:
135135
// - https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.19.4/pkg/reconcile
136+
//
137+
//nolint:gocyclo
136138
func (r *MostWantedTwoPhaseHysteresisEvaluationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
137139
log := log.FromContext(ctx)
138140
log.V(2).Info("Received reconcile request")
@@ -202,12 +204,12 @@ func (r *MostWantedTwoPhaseHysteresisEvaluationReconciler) Reconcile(ctx context
202204
return ctrl.Result{}, nil
203205
}
204206

205-
replicas := partitionProvider.GetPartitionProviderStatus().Replicas
206-
log.V(2).Info("Currently reported replicas by partition provider", "count", len(replicas))
207+
currentReplicas := partitionProvider.GetPartitionProviderStatus().Replicas
208+
log.V(2).Info("Currently reported replicas by partition provider", "count", len(currentReplicas))
207209

208210
seenBefore := false
209211
for i, record := range evaluation.Status.History {
210-
if record.Replicas.SerializeToString() == replicas.SerializeToString() {
212+
if record.ReplicasHash == Sha256(currentReplicas.SerializeToString()) {
211213
seenBefore = true
212214
evaluation.Status.History[i].SeenTimes++
213215
evaluation.Status.History[i].Timestamp = metav1.Now()
@@ -217,9 +219,9 @@ func (r *MostWantedTwoPhaseHysteresisEvaluationReconciler) Reconcile(ctx context
217219
if !seenBefore {
218220
evaluation.Status.History = append(evaluation.Status.History,
219221
autoscalerv1alpha1.MostWantedTwoPhaseHysteresisEvaluationStatusHistoricalRecord{
220-
Timestamp: metav1.Now(),
221-
Replicas: replicas,
222-
SeenTimes: 1,
222+
Timestamp: metav1.Now(),
223+
ReplicasHash: Sha256(currentReplicas.SerializeToString()),
224+
SeenTimes: 1,
223225
},
224226
)
225227
}
@@ -237,45 +239,65 @@ func (r *MostWantedTwoPhaseHysteresisEvaluationReconciler) Reconcile(ctx context
237239
evaluation.Status.History = cleanHistory
238240
}
239241

240-
historyRecords := map[string]autoscalerv1alpha1.MostWantedTwoPhaseHysteresisEvaluationStatusHistoricalRecord{}
241-
historyRecordsLastSeen := map[string]metav1.Time{}
242-
historyRecorsSeenTimes := map[string]int32{}
242+
historyRecordsByHash := map[string]autoscalerv1alpha1.MostWantedTwoPhaseHysteresisEvaluationStatusHistoricalRecord{}
243+
historyRecordsByHashLastSeen := map[string]metav1.Time{}
244+
historyRecorsByHashSeenTimes := map[string]int32{}
243245
for _, record := range evaluation.Status.History {
244-
serializedRecord := record.Replicas.SerializeToString()
245-
log.V(2).Info("Noticing record", "record", serializedRecord)
246-
if _, ok := historyRecordsLastSeen[serializedRecord]; !ok {
247-
historyRecords[serializedRecord] = record
248-
historyRecordsLastSeen[serializedRecord] = record.Timestamp
249-
historyRecorsSeenTimes[serializedRecord] = record.SeenTimes
250-
} else if record.Timestamp.After(historyRecordsLastSeen[serializedRecord].Time) {
251-
historyRecordsLastSeen[serializedRecord] = record.Timestamp
252-
historyRecorsSeenTimes[serializedRecord] += record.SeenTimes
246+
hash := record.ReplicasHash
247+
log.V(2).Info("Noticing record", "record", hash)
248+
if _, ok := historyRecordsByHashLastSeen[hash]; !ok {
249+
historyRecordsByHash[hash] = record
250+
historyRecordsByHashLastSeen[hash] = record.Timestamp
251+
historyRecorsByHashSeenTimes[hash] = record.SeenTimes
252+
} else if record.Timestamp.After(historyRecordsByHashLastSeen[hash].Time) {
253+
historyRecordsByHashLastSeen[hash] = record.Timestamp
254+
historyRecorsByHashSeenTimes[hash] += record.SeenTimes
253255
} else {
254-
historyRecorsSeenTimes[serializedRecord] += record.SeenTimes
256+
historyRecorsByHashSeenTimes[hash] += record.SeenTimes
255257
}
256258
}
257259

258260
topSeenRecord := autoscalerv1alpha1.MostWantedTwoPhaseHysteresisEvaluationStatusHistoricalRecord{}
259261
maxSeenCount := int32(0)
260-
for serializedRecord, seenTimes := range historyRecorsSeenTimes {
261-
log.V(2).Info("Evaluating records", "record", serializedRecord, "seenTimes", seenTimes)
262+
for hash, seenTimes := range historyRecorsByHashSeenTimes {
263+
log.V(2).Info("Evaluating records", "record", hash, "seenTimes", seenTimes)
262264
if seenTimes > maxSeenCount {
263265
maxSeenCount = seenTimes
264-
topSeenRecord = historyRecords[serializedRecord]
266+
topSeenRecord = historyRecordsByHash[hash]
265267
} else if seenTimes == maxSeenCount &&
266-
historyRecords[serializedRecord].Timestamp.After(topSeenRecord.Timestamp.Time) {
267-
log.V(2).Info("Tie breaker", "left", topSeenRecord, "right", serializedRecord)
268-
topSeenRecord = historyRecords[serializedRecord]
268+
historyRecordsByHash[hash].Timestamp.After(topSeenRecord.Timestamp.Time) {
269+
log.V(2).Info("Tie breaker", "left", topSeenRecord.ReplicasHash, "right", hash)
270+
topSeenRecord = historyRecordsByHash[hash]
269271
}
270272
}
271-
log.V(2).Info("Top seen record", "record", topSeenRecord.Replicas.SerializeToString())
273+
log.V(2).Info("Top seen record", "record", topSeenRecord.ReplicasHash)
274+
if topSeenRecord.ReplicasHash == Sha256(currentReplicas.SerializeToString()) {
275+
log.V(1).Info("A new election projection updated", "record", topSeenRecord.ReplicasHash)
276+
evaluation.Status.Projection = currentReplicas
277+
}
278+
if topSeenRecord.ReplicasHash != Sha256(evaluation.Status.Projection.SerializeToString()) {
279+
err := fmt.Errorf("top seen record is neither current projection nor current partition")
280+
log.Error(err, "Failed to evaluate")
281+
meta.SetStatusCondition(&evaluation.Status.Conditions, metav1.Condition{
282+
Type: StatusTypeReady,
283+
Status: metav1.ConditionFalse,
284+
Reason: "AwaitingNewProjection",
285+
Message: err.Error(),
286+
})
287+
if err := r.Status().Update(ctx, evaluation); err != nil {
288+
log.V(1).Info("Failed to update resource status", "err", err)
289+
return ctrl.Result{}, err
290+
}
291+
// In this case re-queuing will change nothing
292+
return ctrl.Result{}, nil
293+
}
272294
mostWantedTwoPhaseHysteresisEvaluationProjectedShardsGauge.DeletePartialMatch(prometheus.Labels{
273295
"evaluation_ref": req.NamespacedName.String(),
274296
})
275297
mostWantedTwoPhaseHysteresisEvaluationProjectedReplicasTotalLoadGauge.DeletePartialMatch(prometheus.Labels{
276298
"evaluation_ref": req.NamespacedName.String(),
277299
})
278-
for _, replica := range topSeenRecord.Replicas {
300+
for _, replica := range evaluation.Status.Projection {
279301
for _, li := range replica.LoadIndexes {
280302
mostWantedTwoPhaseHysteresisEvaluationProjectedShardsGauge.WithLabelValues(
281303
req.NamespacedName.String(),
@@ -293,10 +315,9 @@ func (r *MostWantedTwoPhaseHysteresisEvaluationReconciler) Reconcile(ctx context
293315
).Set(replica.TotalLoad.AsApproximateFloat64())
294316
}
295317

296-
evaluation.Status.Projection = topSeenRecord.Replicas
297318
if evaluation.Status.LastEvaluationTimestamp == nil ||
298319
time.Since(evaluation.Status.LastEvaluationTimestamp.Time) >= evaluation.Spec.StabilizationPeriod.Duration {
299-
evaluation.Status.Replicas = topSeenRecord.Replicas
320+
evaluation.Status.Replicas = evaluation.Status.Projection
300321
evaluation.Status.LastEvaluationTimestamp = ptr.To(metav1.Now())
301322
log.Info("New partitioning has won",
302323
"replicas", len(evaluation.Status.Replicas), "lastEvaluationTimestamp", evaluation.Status.LastEvaluationTimestamp)

0 commit comments

Comments
 (0)