心跳&配置同步流程重构讨论 #1491

PapaPiya · 2024-05-16T11:51:37Z

PapaPiya
May 16, 2024

背景

基于 #1481 ，对目前的心跳&配置同步流程做了简单设计。

设计目标：
- 简单、高效、可靠、易扩展。
- 所有的操作尽量由服务端发起而不是客户端主动发起。
目前存在问题：
- 心跳接口携带信息过多：元信息、流水线配置、实例配置
- 每次心跳请求都需要服务端对配置进行全量比较，计算量大
- 心跳接口与配置获取逻辑绑定，后续拓展比较困难：升级、预览等

总体设计

计划整体拆分成两个主要的接口：注册与心跳。注册接口用于确认连通性并上报基础元信息；心跳接口用于上报状态与接收命令。后续功能拓展都可以通过命令类型来执行不同的操作。

通信设计

接口设计

1. 注册

注册是agent与server通信的开始。注册会做一次本地配置、本地流水线与server端的全量同步。

注册成功：成功后会进行配置全量同步，并开始上报心跳。
注册失败：定期重试直到成功。

// Method: POST 
// Path: /Agent/Register?agentId=<agentId>
// Header: X-Req-Id:<reqId>
message RegisterRequest {
    string agent_type = 1;                     // Required, Agent's type(ilogtail, ..)
    AgentAttributes attributes = 2             // Agent's basic attributes
    repeated string tags =  3;                  // Agent's tags
    string running_status = 4;                  // Required, Agent's running status
    int64 startup_time = 5;                     // Required, Agent's startup time
    int32 interval = 6;                         // Agent's heartbeat interval
    repeated ConfigInfo pipeline_configs = 7;   // Information about the current PIPELINE_CONFIG held by the Agent
    repeated ConfigInfo agent_configs = 8;     // Information about the current AGENT_CONFIG held by the Agent
}

// Header: X-Req-Id:<reqId>
message RegisterResponse {
    RespCode code = 1;
    string message = 2;

    repeated Command custom_commands = 3;     // Agent received commands
}

2. 心跳

心跳主要功能：1. 上报Agent运行状态；2. 拉取需要执行的命令；

上报成功：正常运行
上报失败：server会将长时间(5min)上报失败的agent状态改为离线，并且清空相关的command。
- 后续上报成功：做一次流水线的全量同步。

// Method: POST
// Path: /Agent/HeartBeat?agentId=<agentId>
// Header: X-Req-Id:<reqId>
message HeartBeatRequest {
    string running_status = 1;                  // Required, Agent's running status
}

// Header: X-Req-Id:<reqId>
message HeartBeatResponse {
    RespCode code = 1;
    string message = 2;

    repeated Command custom_commands = 3;     // Agent received commands
}

3. 流水线同步

server端会在 1. agent上线；2. 定期同步；时生成全量流水线/实例配置同步的command。通知agent上报本地配置与server做对比。

// Method: POST 
// Path: /Agent/Sync?agentId=<agentId>
// Header: X-Req-Id:<reqId>
message SyncRequest {
    repeated ConfigInfo pipeline_configs = 1;   // Information about the current PIPELINE_CONFIG held by the Agent
    repeated ConfigInfo agent_configs = 2;     // Information about the current AGENT_CONFIG held by the Agent
}

// Header: X-Req-Id:<reqId>
message SyncResponse {
    RespCode code = 1;
    string message = 2;

    repeated Command custom_commands = 3;     // Agent received commands
}

4. 获取流水线详细配置

保持不变

5. 获取Agent配置

保持不变

6. 命令执行结果上报

agent在执行命令后，需要将结果通知给server。

上报成功：server修改command状态，后续心跳请求不会获取重复command。
上报失败：每次心跳获取重复command，重复执行。
- 重复执行command影响：
  - 流水线更新/新增：存在相同版本的流水线不会更新，不影响
  - 流水线删除：删除不存在的配置，不影响

// Method: POST
// Path: /Agent/ReportCommandResult?agentId=<agentId>
// Header: X-Req-Id:<reqId>
message ReportCommandResultRequest {
    repeated CommandResult commandResult = 1; // Required, 
}

// Define Command Execute Result
message CommandResult {
    string command_id = 1;                // Required, Command id
    ExecCode code = 2;                  // Required, Command execute code
    string message = 3;                   // Optional, Error message
}

// Define Command Execute Code
enum RespCode {
    SUCCESS = 0;
    FAIL = 1;
}

// Header: X-Req-Id:<reqId>
message ReportCommandResultResponse {
    RespCode code = 1;
    string message = 2;
}

Command设计

结构体设计

// Define command
message Command {
    string type = 1;                // Required, Command type
    string name = 2;                // Required, Command name
    string id = 3;                  // Required, Command id
    map<string, string> args = 4;   // Command's parameter arrays
}

流水线命令示例

[{
    "id": "1",
    "type": "pipeline",
    "name": "",
    "args": {
        "operation": "remove",
        "name": "<pipeline_name>"
    } 
},{
    "id": "2",
    "type": "pipeline",
    "name": "",
    "args": {
        "operation": "create",
        "name": "<pipeline_name>"
    }
}]

特殊情况

Command执行中Agent被移除组: Agent被移除组会新增删除流水线的Command，下一次心跳得到执行
有新的Agent在Command执行成功后加入组: 同样，Agent关联组会创建新增流水线的Command。

接口兼容

config_server可以同时保留新旧两种接口。
agent侧可以在配置中指定使用哪种方式。

总结

心跳接口功能做了拆分，上报元数据由注册接口实现，配置同步由Command机制实现(支持增量与全量同步)
增加Command执行确认机制，保证Command正确执行

存在问题

欢迎讨论

Takuka0311 · 2024-05-16T12:43:41Z

Takuka0311
May 16, 2024
Maintainer

很好的设计。我有几个想法，可以探讨一下：
1、注册接口是否需要？可以在心跳里返回一个全量同步命令，一样的效果
2、配置相关是下发命令出发拉取、还是主动拉取？可以参考Elastic的Fleet Server，采取的是Agent主动从server拉取的方式：https://www.elastic.co/guide/en/fleet/current/fleet-server.html ，其他的一些配置管理器也是差不多的实现。
3、iLogtail的存活以拉取配置的接口为准是不是更好？ConfigServer核心是配置管理器，心跳是附带品用于监控的。心跳可以没有但配置一定要下发

0 replies

liuhaoyang · 2024-05-16T13:06:24Z

liuhaoyang
May 16, 2024
Collaborator

很好的设计。我有几个想法，可以探讨一下：
1、注册接口是否需要？可以在心跳里返回一个全量同步命令，一样的效果
2、配置相关是下发命令出发拉取、还是主动拉取？可以参考Elastic的Fleet Server，采取的是Agent主动从server拉取的方式：https://www.elastic.co/guide/en/fleet/current/fleet-server.html ，其他的一些配置管理器也是差不多的实现。
3、iLogtail的存活以拉取配置的接口为准是不是更好？ConfigServer核心是配置管理器，心跳是附带品用于监控的。心跳可以没有但配置一定要下发

基于这个提案设计的目标来回答

注册接口的目的是启动时上报agent更多的元信息和拉取全量配置，因为元信息在agent过程中几乎不变，那么心跳接口就可以不带更多的attributes 达到简化body大小的作用
我觉得心跳也蛮重要的，不管是配置管理还是要做升级，都依赖实例存活。还有一些核心业务的观测数据采集场景，用户会很在意agent的状态
所以在这个设计里，配置下发或者其他command 是依赖心跳确定实例正常才会下发执行

0 replies

liuhaoyang · 2024-05-16T13:08:06Z

liuhaoyang
May 16, 2024
Collaborator

很好的设计。我有几个想法，可以探讨一下：
1、注册接口是否需要？可以在心跳里返回一个全量同步命令，一样的效果
2、配置相关是下发命令出发拉取、还是主动拉取？可以参考Elastic的Fleet Server，采取的是Agent主动从server拉取的方式：https://www.elastic.co/guide/en/fleet/current/fleet-server.html ，其他的一些配置管理器也是差不多的实现。
3、iLogtail的存活以拉取配置的接口为准是不是更好？ConfigServer核心是配置管理器，心跳是附带品用于监控的。心跳可以没有但配置一定要下发

第2点配置下发也是agent主动拉取，但这里需要让agent知道他什么时候来拉取配置，这也是心跳返回commands的原因

0 replies

PapaPiya · 2024-05-20T14:32:07Z

PapaPiya
May 20, 2024
Author

update

简化接口为心跳 与通用命令 两个接口

1. 心跳

功能：1. 上报状态；2. 拉取命令

// Method: POST
// Path: /Agent/HeartBeat?agentId=<agentId>
// Header: X-Req-Id:<reqId>
message HeartBeatRequest {}

// Header: X-Req-Id:<reqId>
message HeartBeatResponse {
    RespCode code = 1;
    string message = 2;

    repeated Command custom_commands = 3;     // Optional, Agent received commands
}

// Define command
message Command {
    string id = 1;                  // Required, Command id
    string type = 2;                // Required, Command type
    string name = 3;                // Optional, Command name
    map<string, string> args = 4;   // Command's parameter arrays
}

2. 通用命令

命令通用接口，便于功能拓展。通过Path中的CommandType值路由到不同接口，比如：上报元信息(ReportMetadata)、同步流水线(SyncPipeline)、请求流水线(FetchPipelineConfig)、上报命令执行结果(ReportCommandResult)等。

// Method: POST
// Path: /Agent/<CommandType>?agentId=<agentId>
// Header: X-Req-Id:<reqId>
message CommandRequest {
    bytes request_body = 1;      // Optional, Request Body
}

// Header: X-Req-Id:<reqId>
message CommandResponse {
    RespCode code = 1;
    string message = 2;

    bytes response_body = 3;     // Optional, Response Body
}

2.1 上报元信息

server下发上报(ReportMetadata)命令，agent接收到命令后，上报自身元数据

Command:

{
    "id": "1",
    "type": "ReportMetadata"
}

Request Body:

{
   "agent_type":"",
    "attributes":{
        "version":"",
        "category":"",
        "ip":"",
        "hostname":"",
        "region":"",
        "zone":"",
        "extras":{}
    },
    "tags":[""],
    "startup_time":0,
    "interval":10
}

Response Body: nil

2.2 全量同步流水线

server下发全量同步命令(SyncPipeline)，agent接收到后上报本地流水线配置，server比较后返回检查结果。

Command:

{
    "id": "1",
    "type": "SyncPipeline"
}

Request Body:

{
    "pipeline_configs": [{
        "name":"", 
        "version":"",  
        "context":""
    }]
}

Response Body:

{
    "check_results": [{
        "name":"",
        "old_version":"",
        "new_version":"",
        "context":"",
        "check_status":""
    }]
}

2.3 增量同步流水线

server下发SyncPipelineChange命令，agent根据类型决定删除/新增/更新流水线

Command:

{
    "id": "1",
    "type": "SyncPipelineChange"
    "args":{
        "<pipeline_name>": "<new/modified/deleted>",
    }
}

deleted不需要额外请求server，modified/new需要通过请求流水线接口获取详细配置

2.4 获取流水线详细配置

agent接收到新增/修改的流水线命令，会请求服务端获取详细配置

Command Type: FetchPipelineConfig
Request Body:

{
    "pipeline_configs": [{
        "name":"",
        "version":"", 
        "context":""
    }]
}

Response Body:

{
    "pipeline_configs": [{
        "name":"",
        "version":"", 
        "context":"",
        "detail":""
    }]
}

2.5 同步实例配置

server下发同步实例配置命令的同时也会在args中添加实例配置，agent直接同步即可。

Command:

{
    "id": "1",
    "type": "SyncAgentConfig",
    "args":{
        "name":"",
        "version":"", 
        "context":"",
        "detail":""
    }
}

不需要额外请求server

2.6 上报命令执行结果

理论上所有的命令都应该上报结果，在命令执行失败时可以进行重试并在页面给予展示。

Command Type: ReportCommandResult
Request Body:

{
    "command_results": [{
        "command_id":"",
        "code":0,
        "message":""
    }]
}

Response Body: nil

0 replies

liangry · 2024-05-21T08:57:45Z

liangry
May 21, 2024
Collaborator

running_status 在 HeartbeatRequest 中，除了表示运行中的状态，还有别的值吗？如果没有，何不把它也去掉？

1 reply

PapaPiya May 21, 2024
Author

可以的

liangry · 2024-05-21T08:59:53Z

liangry
May 21, 2024
Collaborator

上报失败：server会将长时间(5min)上报失败的agent状态改为离线，并且清空相关的command。

在 leveldb 中进行时间范围查询是比较困难的，似乎免不了进行遍历操作，有没有更好办法？

1 reply

PapaPiya May 21, 2024
Author

目前只讨论协议的可行性，暂不讨论实际的实现方式、存储方式/类型等。随着server侧的功能迭代，后面是需要引入其他存储引擎的。

liangry · 2024-05-21T09:11:44Z

liangry
May 21, 2024
Collaborator

repeated string tags = 3; // Agent's tags

Agent Group 中的 tags 是 AgentGroupTag 格式，建议也统一一下吧

1 reply

PapaPiya May 21, 2024
Author

目前agent心跳协议中就是使用字符串数组表示tags。使用AgentGroupTag是有什么考虑吗

liangry · 2024-05-21T11:00:58Z

liangry
May 21, 2024
Collaborator

目前agent心跳协议中就是使用字符串数组表示tags。使用AgentGroupTag是有什么考虑吗

tags 字段是规划用于 Agent 分组的，两边不统一，好像没办法实行

3 replies

PapaPiya May 21, 2024
Author

协议中计划使用google.protobuf.Struct来存储元信息，支持字符串数组+kv对，具体类型可以等后续的config_server改造时再定。

yyuuttaaoo May 24, 2024
Maintainer

Struct看起来似乎并没有比直接写bytes有任何好处，反而断了用更高效二进制表示的路。
Ref：https://medium.com/@raj.paani/using-protocol-buffers-protobuf-struct-datatype-for-generic-objects-json-representation-57bc1ba5c248

PapaPiya May 24, 2024
Author

Struct看起来似乎并没有比直接写bytes有任何好处，反而断了用更高效二进制表示的路。 Ref：https://medium.com/@raj.paani/using-protocol-buffers-protobuf-struct-datatype-for-generic-objects-json-representation-57bc1ba5c248

是的，替换为bytes会更好。

yyuuttaaoo · 2024-05-24T13:36:06Z

yyuuttaaoo
May 24, 2024
Maintainer

建立统一管控协议的意义

AMP（Agent Management Protocol）是ConfigProvider和ConfigServer之间的通信协议，与 OneAgent 本身没有耦合关系。但协议中定义的配置能否生效，和OneAgent的能力是有关的。因此统一管控协议的主要目的有2个：

统一OneAgent上ConfigProvider实现，使CommonConfigProvider可以作为默认的ConfigProvider，且提供一定扩展能力，多数情况下仅需简单重载少量方法即可构造自定义Provider，而不是每个ConfigProvider单独起一套炉灶。
完善端上配置和行为规范，完整覆盖通用的进程管控和自定义命令需求。
仅仅是Protocol统一，对于代码复用没什么帮助。如何定义有帮助？
1. 只要XxxConfigServer实现了协议，那么就可以管控Agent做Yyy事情。
2. 只要Agent实现了协议，那么任何XxxConfigServer就能过管控该Agent做Yyy事情。

所以协议统一必须包含客户端和服务端的行为定义，而不是只定义字段。

管控协议

在社区版当前协议上增强，尽量不破坏
$endpoint/GetAgentConfig?instance_id=$instance_id&wait_for=(true|false)

request

// Method: POST 
// Path: /ListenAgentConfig?DeployId=$deploy_id
// Header: X-Req-Id:<reqId>
message Request {
    string request_id = 1;  
    string instance_id = 2;                     // Required, Agent's unique identification, agent may have a interface for its generation
    string agent_type = 3;                      // Required, Agent's type(ilogtail, ..)，这个很重要，决定服务端用attributes中哪些字段组合获取进程级配置
    AgentAttributes attributes = 4;             // Agent's basic attributes
    repeated AgentGroupTag tags =  5;                  // Agent's tags
    string running_status = 6;                  // Agent's running status
    int64 startup_time = 7;                     // Required, Agent's startup time
    repeated ConfigInfo pipeline_configs = 9;   // Information about the current PIPELINE_CONFIG held by the Agent
    repeated ConfigInfo process_configs = 10;   // Information about the current PROCESS_CONFIG held by the Agent (原agent_configs)
    repeated CommandInfo custom_commands = 11;  // Information about command history
    uint64 sequence_num = 12;                   // Increment every request，如果服务端缓存Agent信息，可以通过sig判断是否要求客户端全量上报进行刷新，这是新增的字段
    uint64 capabilities = 13;                   // Bitmask of flags defined by AgentCapabilities enum，这是新增的字段
    uint64 flags = 14;                          // Predefined command flag，这是新增的字段
    
    // 想扩展协议的请在 = 100之后扩展，< 100的官方保留。
}

message AgentAttributes {
    string version = 1;                 // Agent's version
    string ip = 3;                      // Agent's ip
    string hostname = 4;                // Agent's hostname
    map<string, string> extras = 100;   // Agent's other attributes
}

enum ConfigStatus {
    // The value of status field is not set.
    UNSET = 0;
    // Agent is currently applying the remote config that it received earlier.
    APPLYING = 1;
    // Remote config was successfully applied by the Agent.
    APPLIED = 2;
    // Agent tried to apply the config received earlier, but it failed.
    // See error_message for more details.
    FAILED = 3;
}

message ConfigInfo {
    string name = 2;        // Required, Config's unique identification （reused as level for process_config）
    int64 version = 3;      // Required, Config's version number or hash
    ConfigStatus status = 5; // 这是新增的字段
}

message CommandInfo {
    string type = 1;        // Required, Command type
    string name = 2;        // Required, Config's unique identification （reused as level for process_config）
    ConfigStatus status = 5; // 这是新增的字段
}

enum AgentCapabilities {
    // The capabilities field is unspecified.
    UnspecifiedAgentCapability = 0;
    // The Agent can accept pipeline configuration from the Server.
    AcceptsPipelineConfig          = 0x00000001;
    // The Agent can accept process configuration from the Server.
    AcceptsProcessConfig           = 0x00000002;
    // The Agent can accept custom command from the Server.
    AcceptsCustomCommand           = 0x00000004;

    // Add new capabilities here, continuing with the least significant unused bit.
}

enum RequestFlags {
    FlagsUnspecified = 0;

    // Flags is a bit mask. Values below define individual bits.
}

response

// Header: X-Req-Id:<reqId>
message HeartBeatResponse {
    string request_id = 1;  
    RespCode code = 2;      
    string message = 3;     

    repeated ConfigCheckResult pipeline_check_results = 4;  // Agent's PIPELINE_CONFIG update status
    repeated ConfigCheckResult process_check_results = 5;   // Agent's PROCECESS_CONFIG update status (原agent_check_results)
    repeated Command custom_commands = 6;                   // Agent received commands
    uint64 capabilities = 7;                                // Bitmask of flags defined by ServerCapabilities enum，这是新增的字段
    uint64 flags = 8;                                       // Predefined command flag, e.g. reportFullState, fetchConfigDetail，这是新增的字段
}

message ConfigCheckResult {
    string name = 2;                // Required, Config's unique identification
    int64 new_version = 4;          // Required, Config's latest version number
    string context = 5;             // Config's context，填写上下文，如endpoint等
    string detail = 6;              // Config's detail，填写详情
}

message Command {
    string type = 1;                // Required, Command type
    string name = 2;                // Required, Command name
    map<string, string> args = 4;   // Command's parameter arrays
    int64 expire_time = 5;          // After which the command can be safely removed from history
}

enum ServerCapabilities {
    // The capabilities field is unspecified.
    UnspecifiedServerCapability = 0;
    // The Server can remember agent attributes.
    RembersAttribute                   = 0x00000001;
    // The Server can remember pipeline config status.
    RembersPipelineConfigStatus        = 0x00000002;
    // The Server can remember process config status.
    RembersProcessConfigStatus         = 0x00000004;
    // The Server can remember custom command status.
    RembersCustomCommandStatus         = 0x00000008;

    // Add new capabilities here, continuing with the least significant unused bit.
}

enum ResponseFlags {
    FlagsUnspecified = 0;

    // Flags is a bit mask. Values below define individual bits.

    // ReportFullState flag can be used by the Server if the Client did not include
    // some sub-message in the last AgentToServer message (which is an allowed
    // optimization) but the Server detects that it does not have it (e.g. was
    // restarted and lost state).
    ReportFullState           = 0x00000001;
    FetchPipelineConfigDetail = 0x00000002;
    FetchProcessConfigDetail  = 0x00000002;
}

行为规范

根据一开始的讨论，管控协议只谈接口字段不谈行为规范是空谈，虽然对于管控协议来说很多客户端和服务端行为是可选或由具体客户端类型决定，但具体到OneAgent一个Agent，管控协议的行为必须是确定性的。Server端则可以由可选的行为和不同实现，此时对于这些差异OneAgent侧在实现时必须都考虑到且做好兼容。这样，OneAgent只需要实现一个CommonConfigProvider就可以受任意符合此协议规范的ConfigServer管控。

能力报告

Client：应当通过capbilitiies上报Agent自身的能力，这样如果老的客户端接入新的ConfigServer，ConfigServer便知道客户端不具备某项能力，从而不会向其发送不支持的配置或命令而得不到状态汇报导致无限循环。

Server：应当通过capbilitiies上报Server自身的能力，这样如果新的客户端接入老的ConfigServer，Agent便知道服务端不具备某项能力，从而不会被其响应所误导，如其不具备记忆Attributes能力，那么Attributes字段无论如何都不应该在心跳中被省略。

注册

Client：Agent启动后第一次向Server汇报全量信息，request字段应填尽填。request_id、instance_id、agent_type、startup_time、sequence_num为必填字段。

Server：Server根据上报的信息返回响应。pipeline_check_results、agent_check_results中包含agent需要同步的配置，check_results中必然包含name和new_version，是否包含详情context和detail取决于server端实现。custom_commands包含要求agent执行的命令command中必然包含type、name和expire_time。Server是否保存Client信息也取决于Server实现，如果服务端找不到或保存的sequence_num + 1 ≠ 心跳的sequence_num，那么就立刻返回并且flags中必须设置ReportFullStatus标识位。

Server根据agent_type + attributes 查询进程配置，根据ip和tags查询机器组和关联采集配置。

心跳（心跳压缩）

Client：若接收到的响应中没有ReportFullStatus，且client的属性、配置状态、命令状态在上次上报后没有变化，那么可以只填instance_id、sequence_num，sequence_num每次请求+1。若有ReportStatus或任何属性、配置状态变化或Server不支持属性、配置状态记忆能力，则必须完整上报状态。

Server：同注册

允许心跳压缩

不允许心跳压缩

进程配置

若Server的注册/心跳响应中有agent_check_results.detail

Client：直接从response中获得detail，应用成功后下次心跳需要上报完整状态。

若Server的响应不包含detail

Client：根据agent_check_results的信息构造FetchProcessConfigRequest（原FetchAgentConfigRequest）

Server：返回FetchProcessConfigResponse（原FetchAgentConfigResponse）

Client获取到多个进程配置时，根据层级关系合并，范围越小优先级越高。

采集配置

若Server的注册/心跳响应中有pipeline_check_results.detail

Client：直接从response中获得detail，应用成功后下次心跳需要上报完整状态。

若Server的响应不包含detail

Client：根据agent_check_results的信息构造FetchPipelineConfigRequest

Server：返回FetchPipelineConfigResponse

客户端支持以下2种实现

实现1：直接将Detail返回在心跳响应中（FetchConfigDetail flag is unset）

实现2：仅返回配置名和版本，Detail使用单独请求获取（FetchConfigDetail flag is set）

配置状态上报

Client：这个版本的配置状态上报中修改了version的定义，-1仍然表示删除，0作为保留值，其他值都是合法version，只要version不同Client都应该视为配置更新。此外参考OpAMP增加了配置应用状态上报的字段，能反应出下发的配置是否生效。

Server：这些信息是Agent状态的一部分，可选保存。与通过Event上报可观测信息不同的是，作为状态信息没有时间属性，用户可通过接口可获取即刻状态，而不需要选择时间窗口合并事件。

预定义命令

Client: 通过request的flag传递，尚未定义

Server: 通过response的flag传递，定义了ReportFullStatus

自定义命令

Client: 为了防止服务端重复下发命令以及感知命令执行结果，在command expire前，Client始终应具备向服务端上报command执行状态的能力，实际是否上报取决于心跳压缩机制。在expire_time超过后，client不应该再上报超时的command状态。

Server: 如果上报+已知的Agent状态中，缺少应下发的custom_commands（通过name识别），那么server应该在响应中下发缺少的custom_commands。

Config的消费和反馈

当前CommonConfigProvider的工作结果仅仅是将配置序列化保存到本地，后续的使用通过ConfigWatcher对配置进行对比然后提供下游消费使用的，此外也没有机制反馈配置的生效情况。在新写一下，前者需要扩展而后者需要新的类来负责。

ConfigWatcher

ConfigWatcher会比对新老配置并将配置更新通知给后面的PipelineManager。PipelineManager负责实际应用这些配置。

有了ProcessConfig和自定义命令后

ConfigWatcher监控的东西变复杂了，原来目录中全都是Pipeline配置，现在多了Process配置和Commands。
后续消费也不都是PipelineManager处理。

故，ConfigWatcher的职责需要扩展：

不仅仅只watch处理PipelineConfig了，需要新增ProcessConfig和Command的ConfigDiff。
custom_command被ConfigWatcher捕获后就移动到history目录，只在最近一次的CheckCommand调用中返回，之后就不会再返回了。

ConfigFeedbackReceiver

新增ConfigFeedbackReceiver类，这个类负责将对应Config或Command的执行结果反馈给对应的Provider。

提供Register接口，ConfigProvider收到配置后会在内存中保留状态信息，以便下次心跳发送，此时应该同时将配置名和Provider的关系注册到ConfigFeedbackReceiver中。
对外提供FeedbackPipelineConfigStatus(configName: string, status: ConfigInfo.Status)，FeedbackProcessConfigStatus，FeedbackCommandStatus接口。Config的消费者完成消费后，应该调用对应的Feedback接口返回执行状态。Receiver负责根据注册信息将Feedback转发给对应Provider。
提供Unregister接口，当ConfigProvider删除内存状态信息时，应该调用，防止注册状态无人释放。

CommonConfigProvider自定义扩展

需求

社区对InstanceId定义歧义很大
社区对于获取配置是否需要额外走一个单独请求有不同意见
社区对Agent应该上报哪些属性有不同需求
第三方和开源ConfigServer有各自不同的鉴权方式
配置、自定义命令需要返回值

为实现需求1，新增GetInstanceId可扩展接口，子类可重载。

为实现需求2，新增FetchConfig接口，入参为HbResponse，出参为processConfigs和pipelineConfigs，默认实现为如果响应中没有FetchXxxConfigDetail flag，那么直接从HbResponse中获取processConfigs和pipelineConfigs的detail，如果有，则调用FetchConfig接口从HbResponse中获取信息拼接网络请求，请求后返回出参。FetchConfig也是一个新增可扩展接口，默认实现为向开源版ConfigServer请求。

为实现需求3，新增GetAgentAttributes接口，入参和出参为同一个map，传入上次的attributes。默认实现为填充os/version信息，从本地全局配置加载attributes。

为实现需求4，CommonConfigProvider负责Pb协议字段填充，但预留SendHeartbeat和FetchConfig接口，默认实现为使用curl向开源版ConfigServer发同步请求。

为实现需求5，新增了FeedbackProcessConfigStatus、FeedbackPipelineConfigStatus、FeedbackCommandStatus方法。当配置应用或命令执行结束后，都应该调用ConfigFeedbackReceiver进行反馈，然后ConfigFeedbackReceiver再通过映射反馈给实际Provider。

5 replies

PapaPiya May 28, 2024
Author

@yyuuttaaoo 流水线状态由心跳接口中上报会不会太慢了，等待接收10s+上报10s=最多20s

yyuuttaaoo May 28, 2024
Maintainer

可以把Feedback也改成可扩展的，接收反馈后可以选择立刻上报。同时feedback接口的status增加deleted枚举值，方便把配置对应的pipeline已删除事件反馈回来。

PapaPiya May 28, 2024
Author

@yyuuttaaoo sequence_num的作用还想在讨论下，在服务端分布式部署的场景下，如果每次心跳都路由到不同服务端的话，那心跳压缩的作用就消失了。全量同步的场景是不需要的，增量同步的场景需要有中间件保存状态的，所以也不需要？

yyuuttaaoo May 28, 2024
Maintainer

几种情况：

单台configserver，比如开源版
存在网关可以分流，虽然连接哪个网关可能随机，但网关可以根据url将请求路由到确定机器。比如ingress那层随机，但后面的server可以是statefulset。
所以还有有用的

yyuuttaaoo Jun 26, 2024
Maintainer

补充几点：

capability bits开源留16个bit，从2^16开始留给第三方做二次扩展
request和response中都额外增加一个bytes opaque字段，留给第三方做二次扩展

PalanQu · 2024-07-08T08:49:27Z

PalanQu
Jul 8, 2024

请问配置同步功能是否可以支持在ilogtail agent启动的时候指定一些环境变量的key，然后将key与环境变量的值作为一组label上传到config server，config server可以通过这些key对group进行管控？

2 replies

liangry Jul 12, 2024
Collaborator

目前应该是没有这功能，我当时是做了类似的，在 ilogtail_config.json 文件中配置环境变量，ilogtail 转换为实际的值，映射到 ilogtail_tags

PalanQu Jul 12, 2024

请问新版本的config server所配套的ilogtail会有这个功能吗？

yyuuttaaoo · 2024-08-14T10:57:39Z

yyuuttaaoo
Aug 14, 2024
Maintainer

根据最近提交的代码和review意见，发现process_config这个名词歧义很大，可能被理解为进程配置，也可能被理解为流程配置、处理配置，此外和ebpf代码中ProcessConfig结构体混淆。故打算将process_config全部替换为instance_config来减少歧义。
该变动，不会导致PB的二进制文件不兼容，只是编程接口的重命名。

0 replies

心跳&配置同步流程重构讨论 #1491

Uh oh!

Uh oh!

PapaPiya May 16, 2024

背景

总体设计

通信设计

接口设计

1. 注册

2. 心跳

3. 流水线同步

4. 获取流水线详细配置

5. 获取Agent配置

6. 命令执行结果上报

Command设计

特殊情况

接口兼容

总结

存在问题

Replies: 11 comments · 13 replies

Uh oh!

Takuka0311 May 16, 2024 Maintainer

Uh oh!

Uh oh!

liuhaoyang May 16, 2024 Collaborator

Uh oh!

Uh oh!

liuhaoyang May 16, 2024 Collaborator

Uh oh!

Uh oh!

PapaPiya May 20, 2024 Author

update

1. 心跳

2. 通用命令

2.1 上报元信息

2.2 全量同步流水线

2.3 增量同步流水线

2.4 获取流水线详细配置

2.5 同步实例配置

2.6 上报命令执行结果

Uh oh!

liangry May 21, 2024 Collaborator

Uh oh!

PapaPiya May 21, 2024 Author

Uh oh!

Uh oh!

liangry May 21, 2024 Collaborator

Uh oh!

PapaPiya May 21, 2024 Author

Uh oh!

liangry May 21, 2024 Collaborator

Uh oh!

PapaPiya May 21, 2024 Author

Uh oh!

liangry May 21, 2024 Collaborator

Uh oh!

PapaPiya May 21, 2024 Author

Uh oh!

yyuuttaaoo May 24, 2024 Maintainer

Uh oh!

PapaPiya May 24, 2024 Author

Uh oh!

Uh oh!

yyuuttaaoo May 24, 2024 Maintainer

建立统一管控协议的意义

管控协议

行为规范

能力报告

注册

心跳（心跳压缩）

进程配置

采集配置

配置状态上报

预定义命令

自定义命令

Config的消费和反馈

ConfigWatcher

ConfigFeedbackReceiver

CommonConfigProvider自定义扩展

Uh oh!

PapaPiya
May 16, 2024

Replies: 11 comments 13 replies

Takuka0311
May 16, 2024
Maintainer

liuhaoyang
May 16, 2024
Collaborator

liuhaoyang
May 16, 2024
Collaborator

PapaPiya
May 20, 2024
Author

liangry
May 21, 2024
Collaborator

PapaPiya May 21, 2024
Author

liangry
May 21, 2024
Collaborator

PapaPiya May 21, 2024
Author

liangry
May 21, 2024
Collaborator

PapaPiya May 21, 2024
Author

liangry
May 21, 2024
Collaborator

PapaPiya May 21, 2024
Author

yyuuttaaoo May 24, 2024
Maintainer

PapaPiya May 24, 2024
Author

yyuuttaaoo
May 24, 2024
Maintainer

PapaPiya May 28, 2024
Author

yyuuttaaoo May 28, 2024
Maintainer

PapaPiya May 28, 2024
Author