[Pipeline Middleware] Reduce comm redundancy by getting accurate output (#2232)

* move to cpu to avoid dead lock

* get output by offsets

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>

Ziyue Jiang committed 3y ago

8b045b3c1f4e6b8bfae062f0318bd1481c881a10

Parent: 09c0102

Committed by GitHub <noreply@github.com> on 1/3/2023, 5:43:57 AM