-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Microbenchmarks
- Simple task-parallel program
Read array A to array B on premapped three cores, with core ID = 0, 1, and 2 and copy on those premapped three cores.
Ordering between A's and B's elements is maintained.
Example
array_A = {0, 1, 2, 3}
array_B = {0, 1, 2, 3}
DFG
Input: A
B = A
Output: B
TaskProp[0]: (coreMask=1110)
Stream code
if tid<3:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
SS_GLOBAL_WAIT(3);
-
- Heterogeneous cores
Copy first half of array A to array B in parallel on premapped two cores, with core ID = 0, 1.
Copy second half of array A+1 to array B in parallel on premapped two cores, with core ID = 2, 3.
Ordering between A's and B's elements is maintained.
Example
- Heterogeneous cores
array_A = {0, 1, 2, 3}
array_B = {0, 1, 3, 4}
DFG
Input: A, A2
B = A
B2 = Add(A2, 1)
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
else:
SS_LINEAR_READ(&array2[0]*tid/3, N/3, P_A2);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Creation dependence
Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on premapped two cores with core ID = 0, 1. Ordering between A's and B's elements is not maintained.
Example
- Creation dependence
array_A = {0, 1, 2, 3}
array_B = {0, 3, 2, 1}
DFG
Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=creation]: (B:A2)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Near-data scheduling
Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array on two cores with core ID = 0, 1. The location of copy among core ID = 0, 1 is determined whether the value in array is odd (core ID = 1) or even (core ID = 1). Ordering between A's and B's elements is maintained.
Example
- Near-data scheduling
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}
DFG
Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100, spatial=arg0)
TaskDep[0:1, type=creation]: (producer:consumer)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Producer-consumer consumption with single operand
Read array A on premapped two cores with core ID = 2, 3 and copy their contents to array B on premapped two cores with core ID = 0, 1 in chunks of two elements (bytes=8*2) per core. Ordering between A's and B's elements is not maintained.
Example
- Producer-consumer consumption with single operand
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}
DFG
Input: A, A2
Input: flag, bytes_in
B = A
B2 = A2
count = Acc64(flag)
Output: B, B2
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2)
Stream code
if tid>1:
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_flag, 1, N);
SS_CONST(P_bytes_in, 8, N/2);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Producer-consumer consumption with multiple operands (single source->multiple dest)
Read array A on premapped two cores with core ID = 2, 3 and send their contents to array on two cores with core ID = 0, 1 in chunks of two elements per core.
Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
Is I needed for mapping, sounds like a limitation?
Example
- Producer-consumer consumption with multiple operands (single source->multiple dest)
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}
DFG
Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)
Stream code
if tid>1:
// this will produce data at (B, dummy, count, bytes_out, CHILD).
// (dummy) will be automatically sent to (CHILD's task buffer)
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_flag, 1, N);
SS_CONST(P_bytes_in, 8, N/2);
else:
// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
// When acknowledged, it will read "count" and use it as an index to read data...
SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Producer-consumer consumption with multiple operands (multiple source->multiple dest)
Read array A on any of the two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any two cores with core ID = 0, 1 in chunks of two elements per core.
Read array I on any two cores with core ID = 0, 1 and send their contents to corresponding dynamic core ID = 0, 1 for the final addition and storage in B. Ordering between A's, B's and C's elements is maintained.
Example
- Producer-consumer consumption with multiple operands (multiple source->multiple dest)
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}
DFG
Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
// we need to send CHILD information to where the producer task is scheduled (schedParent)
// DIFFERENT BETWEEN SCHED AND INDEX: FILTERED??? CHILD is distributed among STRM_CHILD and STRM_PARENT using dependence distance.
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/, sched=STRM_CHILD/*hw*/, schedParent=STRM_PARENT/*hw*/, depDist=8]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)
Stream code
if tid==-1:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else if tid>1:
// this will produce data at (B, dummy, count, bytes_out, CHILD).
// (dummy) will be automatically sent to (CHILD's task buffer)
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_flag, 1, N);
SS_CONST(P_bytes_in, 8, N/2);
else:
// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
// When acknowledged, it will read "count" and use it as an index to read data...
SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and copy to B2. Ordering between A's, A2's and B2's elements is maintained.
Example
- Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}
DFG
Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out]: (B:BATCH, bytes_out:SIZE)
Stream code
SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_bytes_in, 8, N/2);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
-
- Coarse-grained read reuse (Multicast data -> remote destination in-order)
Read array A on premapped two cores with core ID = 2, 3 (using dummy task) and send their contents to array on any four cores with core ID = 0, 1, 2, 3.
Read array A2 on corresponding dynamic four cores with core ID = 0, 1, 2, 3 and send their contents to corresponding dynamic core ID = 0, 1 for the final copy to B2. Ordering between A's, A2's and B2's elements is maintained.
Example
- Coarse-grained read reuse (Multicast data -> remote destination in-order)
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}
DFG
Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// Merge "memory task-part" by COAL_ID and "ACK" Task with "tag=count".
// When setting streams, send "bytes" from "COAL_ID" to CHILD core.
TaskDep[0:1, type=creation, ack=ACK, index=CHILD, sched=STRM_CHILD]: (dummy:A2, I:A)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out, init_order=count, index=CHILD]: (B:BATCH, bytes_out:SIZE)
TaskDep[0:1, type=direct, repeat=]: (DATA:A2)
Stream code
SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
SS_CONST(P_bytes_in, 8, N/2);
// inform coalescing location about the child location
SS_CONFIG_STREAM(/*bypass=1*/);
SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else:
SS_CONFIG_STREAM(/*bypass=1*/);
SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
Notes from microbenchmarks for other options
- From example 1, it does not automatically distribute data across assigned cores. Data sent to unconfigured cores will be hanging.
- From example 2, I do not know what coreMask contributes when memory distribution still need to distribute explicitly like pthread (it should be like openMP)
- From example 8, There is no way to define an explicit tasks and its annotations. Therefore, we need a dummy TaskDep[0:1]
ACK edge annotation sounds like "Task 2's property".
Problem: There is not definition of Task[i], and is implicitly associated with the producer and consumer ports. - From example 10, can we have a non-stream-based way of handshaking? kind-of short streams...
- From example 11, not sure if multiple data locations add any complexity?
Workloads in the ASPLOS TaskStream paper
Code: https://github.com/PolyArch/ss-workloads/tree/polyarch/task-parallel-benchmarks/camera_ready
- Cholesky
- GCN
- DB->ML
- kNN
- SpMM
Metadata
Metadata
Assignees
Labels
No labels