Adding task features in stream dataflow

## Microbenchmarks
1. Simple task-parallel program
Read array A to array B on **premapped** three cores, with core ID = 0, 1, and 2 and copy on those **premapped** three cores.
**Ordering between A's and B's elements is maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B = {0, 1, 2, 3}
```
**DFG**
```
Input: A
B = A
Output: B
TaskProp[0]: (coreMask=1110)
```
**Stream code**
```
if tid<3:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
SS_GLOBAL_WAIT(3);
````

2. + Heterogeneous cores
Copy first half of array A to array B in parallel on **premapped** two cores, with core ID = 0, 1.
Copy second half of array A+1 to array B in parallel on **premapped** two cores, with core ID = 2, 3.
**Ordering between A's and B's elements is maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B = {0, 1, 3, 4}
```
**DFG**
```
Input: A, A2
B = A
B2 = Add(A2, 1)
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
```
**Stream code**
```
if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
	SS_LINEAR_WRITE(P_B, N/3, &output[0]*tid/3);
else:
	SS_LINEAR_READ(&array2[0]*tid/3, N/3, P_A2);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
3. + Creation dependence
Read array A on **premapped** two cores with core ID = 2, 3 and copy their contents to array on **premapped** two cores with core ID = 0, 1. **Ordering between A's and B's elements is not maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B = {0, 3, 2, 1}
```
**DFG**
```
Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=creation]: (B:A2)
```
**Stream code**
```
if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
4. + Near-data scheduling
Read array A on **premapped** two cores with core ID = 2, 3 and copy their contents to array on two cores with core ID = 0, 1. The location of copy among core ID = 0, 1 is determined whether the value in array is odd (core ID = 1) or even (core ID = 1). **Ordering between A's and B's elements is maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}
```
**DFG**
```
Input: A, A2
B = A
B2 = A2
Output: B, B2
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100, spatial=arg0)
TaskDep[0:1, type=creation]: (producer:consumer)
```
**Stream code**
```
if tid>1:
	SS_LINEAR_READ(&array[0]*tid/3, N/3, P_A);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
5. + Producer-consumer consumption with single operand
Read array A on **premapped** two cores with core ID = 2, 3 and copy their contents to array B on **premapped** two cores with core ID = 0, 1 in chunks of two elements (bytes=8*2) per core. **Ordering between A's and B's elements is not maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B = {0, 2, 3, 1}
```
**DFG**
```
Input: A, A2
Input: flag, bytes_in
B = A
B2 = A2
count = Acc64(flag)
Output: B, B2
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2)
```
**Stream code**
```
if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
8. + Producer-consumer consumption with multiple operands (single source->multiple dest)
Read array A on **premapped** two cores with core ID = 2, 3 and send their contents to array on two cores with core ID = 0, 1 in chunks of two elements per core.
Read array I on **any** two cores with core ID = 0, 1 and send their contents to corresponding **dynamic** core ID = 0, 1 for the final addition and storage in B. **Ordering between A's, B's and C's elements is maintained.**
_Is I needed for mapping, sounds like a limitation?_
**Example**
```
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}
```
**DFG**
```
Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)
```
**Stream code**
```
if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
9. + Producer-consumer consumption with multiple operands (multiple source->multiple dest)
Read array A on **any** of the two cores with core ID = 2, 3 (using dummy task) and send their contents to array on **any** two cores with core ID = 0, 1 in chunks of two elements per core.
Read array I on **any** two cores with core ID = 0, 1 and send their contents to corresponding **dynamic** core ID = 0, 1 for the final addition and storage in B. **Ordering between A's, B's and C's elements is maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_I = {1, 2, 4, 8}
array_B = {0, 2, 8, 24}
```
**DFG**
```
Input: A, A2, C2, I
Input: flag, bytes_in
B = A
dummy = A
B2 = Add64(A2, C2)
count = Acc64(flag)
Output: B, B2, dummy
Output: count, bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// consumers: ACK notation, "Order for tag-matching", producers: child core location
// we need to send CHILD information to where the producer task is scheduled (schedParent)
// DIFFERENT BETWEEN SCHED AND INDEX: FILTERED??? CHILD is distributed among STRM_CHILD and STRM_PARENT using dependence distance.
TaskDep[0:1, type=creation, ack=ACK/*sw-state*/, order=count/*sw*/, index=CHILD/*hw*/,  sched=STRM_CHILD/*hw*/, schedParent=STRM_PARENT/*hw*/, depDist=8]: (count, I)
TaskDep[0:1, type=direct, init_order=count, bytes=bytes_out]: (B:A2, I:C2)
```
**Stream code**
```
if tid==-1:
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else if tid>1:
	// this will produce data at (B, dummy, count, bytes_out, CHILD).
	// (dummy) will be automatically sent to (CHILD's task buffer)
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_flag, 1, N);
	SS_CONST(P_bytes_in, 8, N/2);
else:
	// a stream will pull (bytes_out) data from (B, count) and send to (CHILD)
	// When acknowledged, it will read "count" and use it as an index to read data...
	SS_INDIRECT(&array2[0]*tid/2, P_I, N/2, P_C2);
        SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
11. + Coarse-grained read reuse (Batch memory requests and update near-data out-of-order)
Read array A on **premapped** two cores with core ID = 2, 3 (using dummy task) and send their contents to array on **any** four cores with core ID = 0, 1, 2, 3.
Read array A2 on corresponding **dynamic** four cores with core ID = 0, 1, 2, 3 and copy to B2. **Ordering between A's, A2's and B2's elements is maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}
```
**DFG**
```
Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out]: (B:BATCH, bytes_out:SIZE)
```
**Stream code**
```
SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```
13. + Coarse-grained read reuse (Multicast data -> remote destination in-order)
Read array A on **premapped** two cores with core ID = 2, 3 (using dummy task) and send their contents to array on **any** four cores with core ID = 0, 1, 2, 3.
Read array A2 on corresponding **dynamic** four cores with core ID = 0, 1, 2, 3 and send their contents to corresponding **dynamic** core ID = 0, 1 for the final copy to B2. **Ordering between A's, A2's and B2's elements is maintained.**
**Example**
```
array_A = {0, 1, 2, 3}
array_B2 = {0, 1, 2, 3}
```
**DFG**
```
Input: A, A2
Input: bytes_in
B = A
B2 = A2
Output: B, B2
Output: bytes_out
TaskProp[0]: (coreMask=0011)
TaskProp[1]: (coreMask=1100)
// Merge "memory task-part" by COAL_ID and "ACK" Task with "tag=count".
// When setting streams, send "bytes" from "COAL_ID" to CHILD core.
TaskDep[0:1, type=creation, ack=ACK, index=CHILD, sched=STRM_CHILD]: (dummy:A2, I:A)
TaskDep[0:1, type=coal, id=COAL_ID, bytes=bytes_out, init_order=count, index=CHILD]: (B:BATCH, bytes_out:SIZE)
TaskDep[0:1, type=direct, repeat=]: (DATA:A2)
```
**Stream code**
```
SS_INDIRECT(&array2[0]*tid/2, P_BATCH, P_SIZE, P_A2);
else if tid>1:
	SS_LINEAR_READ(&array[0]*tid/2, N/2, P_A);
	SS_CONST(P_bytes_in, 8, N/2);
	// inform coalescing location about the child location
	SS_CONFIG_STREAM(/*bypass=1*/);
	SS_REM_PORT(P_STRM_CHILD, N, P_STRM_PARENT);
else:
    SS_CONFIG_STREAM(/*bypass=1*/);
	SS_LINEAR_WRITE(P_B2, N/3, &output2[0]*tid/3);
SS_GLOBAL_WAIT(4);
```

## Notes from microbenchmarks for other options
1. From example 1, it does not automatically distribute data across assigned cores. Data sent to unconfigured cores will be hanging.
2. From example 2, I do not know what coreMask contributes when memory distribution still need to distribute explicitly like pthread (it should be like openMP)
3. From example 8, There is no way to define an explicit tasks and its annotations. Therefore, we need a dummy TaskDep[0:1]
ACK edge annotation sounds like "Task 2's property".
Problem: There is not definition of Task[i], and is implicitly associated with the producer and consumer ports.
4. From example 10, can we have a non-stream-based way of handshaking? kind-of short streams...
5. From example 11, not sure if multiple data locations add any complexity?

## Workloads in the ASPLOS TaskStream paper
Code: https://github.com/PolyArch/ss-workloads/tree/polyarch/task-parallel-benchmarks/camera_ready
1. Cholesky
4. GCN
6. DB->ML
7. kNN
8. SpMM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding task features in stream dataflow #1

Microbenchmarks

Notes from microbenchmarks for other options

Workloads in the ASPLOS TaskStream paper

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding task features in stream dataflow #1

Description

Microbenchmarks

Notes from microbenchmarks for other options

Workloads in the ASPLOS TaskStream paper

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions