Skip to content

Adding task features in stream dataflow #1

@vidushidadu

Description

@vidushidadu

Microbenchmarks

  1. Simple task-parallel program
    Read array A to array B on premapped three cores, with core ID = 0, 1, and 2 and copy on those premapped three cores.
    Ordering between A's and B's elements is maintained.
    Example
array_A = {0, 1, 2, 3}
array_B = {0, 1, 2, 3}

DFG

Input: A
B = A
Output: B
TaskProp[0]: (coreMask=1110)

Stream code

if tid<3:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
SS_GLOBAL_WAIT(3);
    • Heterogeneous cores
      Copy first half of array A to array B in parallel on premapped two cores, with core ID = 0, 1.
      Copy second half of array A+1 to array B in parallel on premapped two cores, with core ID = 2, 3.
      Ordering between A's and B's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 1, 3, 4}

DFG

Input: A, A2
B = A
B2 = Add(A2, 1)
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
else:
	SS_LINEAR_READ(&array2[0]*tid/3, N/3, P_A2);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Creation dependence
      Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on premapped two cores with core ID = 0, 1. Ordering between A's and B's elements is not maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 3, 2, 1}

DFG

Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=creation]: (B:A2)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Near-data scheduling
      Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on two cores with core ID = 0, 1. The location of copy among core ID = 0, 1 is determined whether the value in array is odd (core ID = 1) or even (core ID = 1). Ordering between A's and B's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}

DFG

Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100, spatial=arg0)
TaskDep[0:1, type=creation]: (producer:consumer)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Producer-consumer consumption with single operand
      Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array B on premapped two cores with core ID = 0, 1 in chunks of two elements (bytes=8*2) per core. Ordering between A's and B's elements is not maintained.
      Example
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}

DFG

Input: A, A2
Input: flag, bytes_in
B = A
B2 = A2
count = Acc64(flag)
Output: B, B2
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2)

Stream code

if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Producer-consumer consumption with multiple operands (single source->multiple dest)
      Read array A on premapped two cores with core ID = 2, 3 and send their contents to array on two cores with core ID = 0, 1 in chunks of two elements per core.
      Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
      Is I needed for mapping, sounds like a limitation?
      Example
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}

DFG

Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)

Stream code

if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Producer-consumer consumption with multiple operands (multiple source->multiple dest)
      Read array A on any of the two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any two cores with core ID = 0, 1 in chunks of two elements per core.
      Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}

DFG

Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
// we need to send CHILD information to where the producer task is scheduled (schedParent)
// DIFFERENT BETWEEN SCHED AND INDEX: FILTERED??? CHILD is distributed among STRM_CHILD and STRM_PARENT using dependence distance.
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/,  sched=STRM_CHILD/*hw*/, schedParent=STRM_PARENT/*hw*/, depDist=8]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)

Stream code

if tid==-1:
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
      Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
      Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and copy to B2. Ordering between A's, A2's and B2's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}

DFG

Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out]: (B:BATCH, bytes_out:SIZE)

Stream code

SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
    • Coarse-grained read reuse (Multicast data -> remote destination in-order)
      Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
      Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and send their contents to corresponding dynamic core ID = 0, 1 for the final copy to B2. Ordering between A's, A2's and B2's elements is maintained.
      Example
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}

DFG

Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// Merge "memory task-part" by COAL_ID and "ACK" Task with "tag=count".
// When setting streams, send "bytes" from "COAL_ID" to CHILD core.
TaskDep[0:1, type=creation, ack=ACK, index=CHILD, sched=STRM_CHILD]: (dummy:A2, I:A)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out, init_order=count, index=CHILD]: (B:BATCH, bytes_out:SIZE)
TaskDep[0:1, type=direct, repeat=]: (DATA:A2)

Stream code

SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
	// inform coalescing location about the child location
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);

Notes from microbenchmarks for other options

  1. From example 1, it does not automatically distribute data across assigned cores. Data sent to unconfigured cores will be hanging.
  2. From example 2, I do not know what coreMask contributes when memory distribution still need to distribute explicitly like pthread (it should be like openMP)
  3. From example 8, There is no way to define an explicit tasks and its annotations. Therefore, we need a dummy TaskDep[0:1]
    ACK edge annotation sounds like "Task 2's property".
    Problem: There is not definition of Task[i], and is implicitly associated with the producer and consumer ports.
  4. From example 10, can we have a non-stream-based way of handshaking? kind-of short streams...
  5. From example 11, not sure if multiple data locations add any complexity?

Workloads in the ASPLOS TaskStream paper

Code: https://github.com/PolyArch/ss-workloads/tree/polyarch/task-parallel-benchmarks/camera_ready

  1. Cholesky
  2. GCN
  3. DB->ML
  4. kNN
  5. SpMM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions