Cannot understand replica_groups = {{0}} in XLA all-reduce operator #17633

DicardoX · 2023-09-17T04:55:32Z

DicardoX
Sep 17, 2023

Hi, there! I am working on compiling a DNN model with XLA on two GPUs. After calling xe.run_auto_sharding(hlo, compile_options) and xe.run_spmd_partitioner(hlo, compile_options), I print the HLO text of the sharded module and got:

...
maximum.84 = f32[16,14,14,896]{3,2,1,0} maximum(add.298, broadcast.273)
  param.311 = f32[3,3,896,3584]{3,2,1,0} parameter(328), sharding={devices=[1,1,2,1]0,1}
  convolution.44 = f32[16,7,7,3584]{3,2,1,0} convolution(maximum.84, param.311), window={size=3x3 stride=2x2 pad=0_1x0_1}, dim_labels=b01f_01io->b01f
  all-reduce.16 = f32[16,7,7,3584]{3,2,1,0} all-reduce(convolution.44), channel_id=37, replica_groups={{0}}, to_apply=add.16
  reduce.88 = f32[3584]{0} reduce(all-reduce.16, constant.2), dimensions={0,1,2}, to_apply=region_89.5112.0
  constant.485 = f32[] constant(0.000127551015)
  broadcast.182 = f32[3584]{0} broadcast(constant.485), dimensions={}
...

In the HLO text, the replica_groups of the all-reduce.16 is {{0}}. According to the descriptions in https://tensorflow.google.cn/xla/operation_semantics?hl=en&authuser=0#allreduce, shouldn't the number of replicas in replica_groups be more than 1?

It confuses me a lot, and any kindly advice is appreciated! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot understand replica_groups = {{0}} in XLA all-reduce operator #17633

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Cannot understand replica_groups = {{0}} in XLA all-reduce operator #17633

Uh oh!

DicardoX Sep 17, 2023

Replies: 0 comments

DicardoX
Sep 17, 2023