Lecture 13 Notes

Lecture 13 Notes

ECE 571 – Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://www.eece.maine.edu/∼vweaver [email protected] 16 October 2014 ...

147KB Sizes 0 Downloads 12 Views

ECE 571 – Advanced Microprocessor-Based Design Lecture 13 Vince Weaver http://www.eece.maine.edu/∼vweaver [email protected]

16 October 2014

Prefetching Try to avoid cache misses by bringing values into the cache before they are needed. Caches with large blocksize already bring in extra data in advance, but can we do more?

1

Prefetching Concerns • When? We want to bring in data before we need it, but not too early or it wastes space in the cache. • Where? What part of cache? Dedicated buffer?

2

Limits of Prefetching • May kick data out of cache that is useful • Costs energy, especially if we do not use the data

3

Implementation Issues • Which cache level to bring into? (register, L1, L2) • Faulting, what happens if invalid address • Non-cachable areas (MTRR, PAT). Bad to prefetch mem-mapped registers!

4

Software Prefetching • ARM has PLD instruction • PREFETCHW for write (3dnow, Alpha) cache protocol • Prefetch, evict next (make it LRU) Alpha • Prefetch a stream (Altivec) • Prefetch0, 1, 2 to all cache levels (x86 SSE) Prefecthnta, non-temporal 5

Hardware Prefetching – icache • Bring in two cache lines • Branch predictor can provide hints, targets • Bring in both targets of a branch

6

Hardware Prefetching – dcache • Bring in next line – on miss bring in N and N+1 (or more?) • Demand – bring in on miss (every other access a miss with linear access) Tagged – bring in N+1 on first access to cache line (no misses with linear access)

7

Hardware Prefetching – Stride Prefetching • Stride predictors – like branch predictor, but with load addresses, keep track of stride • Separate stream buffer?

8

Stride Predictor 0x10004002: ldb r1,0x0000 0200 last load

0000 0100

stride

+100

Prefetch 0000 0300

...

9

Hardware Prefetching – Correlation/Content-Directed Prefetching • How to handle things like pointer chasing / linked lists? • Correlation – records sequence of misses, then when traversing again prefetches in that order • Content directed – recognize pointers and pre-fetch what they point to

10

Using 2-bit Counters • Use 2-bit counter to see if load causing lots of misses, if so automatically treat as streaming load (Rivers) • Partitioned cache: cache stack, heap, etc, (or little big huge) separately (Lee and Tyson)

11

Cortex A9 Prefetch • PLD – prefetch instruction has dedicated instruction unit • Optional hardware prefetcher. (Disabled on pandaboard) • Can prefetch 8 data streams, detects ascending and descending with stride of up to 8 cache lines • Keeps prefetching as long as causing hits • Stops if: crosses a 4kB page boundary, changes context, 12

a DSB (barrier) or a PLD instruction executes, or the program does not hit in the prefetched lines. • PLD requests always take precedence

13

Investigating Prefetching Using Hardware Performance Counters

14

Quick Look at Core2 HW Prefetch • Instruction prefetcher • L1 Data Cache Unit Prefetcher (streaming). Ascending data accesses prefetch next line • L1 Instruction Pointer Strided Prefetcher. Looks for strided access from particular load instructions. Forward or Backward up to 2k apart • L2 Data Prefetch Logic. Fetches to L2 based on the L1 DCU

15

x86 SW Prefetch Instructions (AMD) • • • • • •

PREFETCHNTA – SSE1, non temporal (use once) PREFETCHT0 – SSE1, prefetch to all levels PREFETCHT1 – SSE1, prefetch to L2 + higher PREFETCHT2 – SSE1, prefetch to L3 + higher PREFETCH – AMD 3DNOW! prefetch to L1 PREFETCHW – AMD 3DNOW! prefetch for write

16

Core2 • SSE PRE EXEC:NTA – counts NTA • SSE PRE EXEC:L1 – counts T0 (fxsave+2, fxrstor+5) • SSE PRE EXEC:L2 – counts T1/T2 • Problem: Only 2 counters available on Core2

17

AMD (Istanbul and Later) • • • •

PREFETCH INSTRUCTIONS DISPATCHED:NTA PREFETCH INSTRUCTIONS DISPATCHED:LOAD PREFETCH INSTRUCTIONS DISPATCHED:STORE These events appear to be speculative, and won’t count SW prefetches that conflict with HW prefetches

18

Atom • • • •

PREFETCH:PREFETCHNTA PREFETCH:PREFETCHT0 PREFETCH:SW L2 These events will count SW prefetches, but numbers counted vary in complex ways

19

Does anyone use SW Prefetch? • gcc by default disables SW prefetch unless you specify -fprefetch-loop-arrays • icc disables unless you specify -xsse4.2 -op-prefetch=4 • glibc has hand-coded SW prefetch in memcpy() • Prefetch can hurt behavior: – Can throw out good cache lines, – Can bring lines in too soon, – Can interfere with the HW prefetcher 20

SW Prefetch Distribution SPEC CPU 2000, Core2, gcc -fprefetch-loop-arrays Loads

T0

T1/T2

NTA

60B 40B

N/A

N/A

20B N/A

Load Instructions

Load Distribution 80B

l t hic .log ram om rce ace ute 166 200 xpr ate ilab mcf afty rser ook jiya eier 535 704 850 957 ai and fec gap x.1 x.2 x.3 hic ram rce olf rapgzip rog .rand .soupr.pl pr.rogcc. gcc. cc.entegr c.sc 181. 6.cr7.pa on.c n.kashm mk. mk. mk. mk. .diffmaker .per 254. vortevortevorte.grap rog .sou00.tw g . p . g . . c 18 19 2.e 2.eo n.ru erlb erlb erlb erlbbmk k.m bmk 5. 5. 5. 2 .p p2 3 zip 64 ip. zip gzip 5.v 5.v 76 76 76. cc.i 6.g 25 25 25 .bzipbzip26.bzi 25 25 .eo 53.p 53.p 53.p 53.pperl lbm perl 4.g 1 4.gz64.g164. 17 17 1 1 1 76.g 17 6 6 . 2 2 2 2 2 3. er 3. 1 1 16 1 25 256 25 25 25 53.p 25 2

Load Instructions

150B

Load Distribution

Loads

T0

T1/T2

NTA

100B

50B

7 7 0 0 7 7 7 7 0 C 90 0C 0C eC aC F7 F7 F9 F9 F7 F7 F7 F7 F9 mp dF ak es si lu el .11 .47 ck im ec as rid 3 u m t t ise p p g r m l r r a c g a . q w a a r p e w . a u 7 m 8 1. ixt 9.a 9.a ac 1.s up 3.e 9.l 3.a 2.m 17 8.g 1.f 18 30 17 17 7.f 0.s 17 18 18 17 17 17 8.w 19 18 20 16

21

Normalized SW Prefetch Runtime on Core2 (Smaller is Better)

1

N/A

0

N/A

0.5 N/A

Normalized Runtime

Integer SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays

l t hic .log ram om rce ace ute 166 200 xpr ate ilab mcf afty rser ook jiya eier 535 704 850 957 ai and fec gap x.1 x.2 x.3 hic ram rce olf rapgzip rog .rand .soupr.pl pr.rogcc. gcc. cc.entegr c.sc 181. 6.cr7.pa on.c n.kashm mk. mk. mk. mk. .diffmaker .per 254. vortevortevorte.grap rog .sou00.tw g . p . g . . c 18 19 2.e 2.eo n.ru erlb erlb erlb erlbbmk k.m bmk 5. 5. 5. 2 .p p2 3 zip 64 ip. zip gzip 5.v 5.v 76 76 76. cc.i 6.g 25 25 25 .bzipbzip26.bzi 25 25 .eo 53.p 53.p 53.p 53.pperl lbm perl 4.g 1 4.gz64.g164. 17 17 1 1 1 76.g 17 6 . r . 2 6 . 16 1 2 2 2 2 1 e 3 1 3 25 256 25 25 25 53.p 25 2

Normalized Runtime

FP SPEC CPU 2000 Normalized Runtime when SW Prefetch Enabled with -fprefetch-loop-arrays 1

0.5

0 u

8.w

16

7 0 0 0 7 7 7 7 0 C C C 77 aC pC F7 F9 F9 F9 F7 F7 F7 F7 F9 ke 10 70 eF es si lu el ck im ec as rid 3d ua mm t.1 t.4 p p g r m l r r a c g a . q w a a r p e . a a a . u 7 m 8 1 ixt 9. 9. ac 1.s 3.e 9.l 3.a 2.m 17 8.g 1.f 18 30 17 17 7.f 0.s 17 18 18 17 17 17 19 18 20

is pw

22

The HW Prefetcher on Core2 can be Disabled

23

Runtime with HW Prefetcher Disabled Normalized against Runtime with HW Prefetcher Enabled on Core2 (Smaller is Better) Normalized Runtime when HW Prefetch Disabled

plain

1.82

w/ SW Prefetch 1.84

1

N/A

0

N/A

0.5 N/A

Normalized Runtime

1.5

l t hic .log ram om rce ace ute 166 200 xpr rate ilab mcf afty rser ook jiya eier 535 704 850 957 ai and fec gap x.1 x.2 x.3 hic ram rce olf rapgzip rog .rand .soupr.pl pr.rogcc. gcc. cc.enteg c.sc 181. 6.cr7.pa on.c n.kashm mk. mk. mk. mk. .diffmaker .per 254. vortevortevorte.grapprog .sou00.tw g . p . g . . c 5. 5. 5. 2 . p2 3 18 19 2.e 2.eo n.ru erlb erlb erlb erlbbmk k.m bmk zip 64 ip. zip gzip 5.v 5.v 76 76 76. cc.i 6.g 25 25 25 .bzipbzip26.bzi 25 25 .eo 53.p 53.p 53.p 53.pperl lbm perl 4.g 1 4.gz64.g164. 17 17 1 1 1 76.g 17 6 . 2 2 2 2 2 3. er 3. 16 1 5 16 1 25 256 25 2 25 53.p 25 2

Normalized Runtime when HW Prefetch Disabled 2.47

Normalized Runtime

2

3.82

plain

3.66

w/ SW Prefetch

1.5 1 0.5 0 up

8.w

16

2.58

e wis

7

17

1.s

wim

F7

rid mg

2.

17

7 0 7 C 90 90 77 90 0C 0C eC aC F7 F9 F7 mp cF kF sF dF ak es 47 11 si lu el . . c e a 3 u m t t p p g r m l r r a c a . q r p e a u 7 m 8.a 1.a ixt 9.a 9.a ac 3.e 9.l 3.a 17 8.g 1.f 18 30 17 17 7.f 0.s 18 18 17 17 19 18 20 7

7

F7

F7

24

PAPI PRF SW Revisited • Can multiple machines count SW Prefetches? Yes. • Does the behavior of the events match expectations? Not always. • Would people use the preset? Maybe.

25

L1 Data Cache Accesses float array[1000],sum = 0.0; PAPI_start_counters(events,1); for(int i=0; i<1000; i++) { sum += array[i]; } PAPI_stop_counters(counts,1); 26

N eh gc ale c m 4. 3

eh gc ale c m 4. 1

N

al e EX m

eh

ul

nb

or e2

ta

N

Is

C

om

At

No Counter Available

6 5 4 3 2 1 0 P4

ro

PP

Normalized Accesses

PAPI L1 DCA

L1 DCache Accesses normalized against 1000

27

PAPI L1 DCA Expected Code * 4020d8: 4020dc: 4020e0: 4020e3:

f3 48 48 75

0f 58 00 83 c0 04 39 d0 f3

addss add cmp jne

(%rax),%xmm0 $0x4,%rax %rdx,%rax 4020d8

movss addss add cmp movss jne

0xc(%rsp),%xmm0 (%rdx,%rax,4),%xmm0 $0x1,%rax $0x3e8,%rax %xmm0,0xc(%rsp) 401e18

Unexpected Code * 401e18: * 401e1e: 401e23: 401e27: * 401e2d: 401e33:

f3 f3 48 48 f3 75

0f 0f 83 3d 0f e3

10 58 c0 e8 11

44 04 01 03 44

24 0c 82 00 00 24 0c

28

L1 Data Cache Misses • Allocate array as big as L1 DCache • Walk through the array byte-by-byte • Count misses with PAPI L1 DCM event • If 32B line size, if linear walk through memory, first time will have 1/32 miss rate or 3.125%. Second time through (if fit in cache) should be 0%.

29

PAPI L1 DCM – Forward/Reverse/Random

30

N

al e EX m

eh

em

al

N eh

ul

nb

ta

Is

C re or fe e2 tc h

oP

0.00

N

e2

or

C

om

0.25

At

0.50

No Counter Available

0.75

P4

ro

PP

Normalized Misses 1.25

1.00

31

L1D Sources of Divergences • Hardware Prefetching • PAPI Measurement Noise • Operating System Activity • Non-LRU Cache Replacement

32

L2 Total Cache Misses • Allocate array as big as L2 Cache • Walk through the array byte-by-byte • Count misses with PAPI L2 TCM event

33

N

al e EX m

eh

em

al

N eh

ul

nb

ta

Is

C re or fe e2 tc h

oP

0.00

N

e2

or

0.25

C

0.50

om

0.75

At

1.00

No Counter Available

1.25

P4

ro

PP

Normalized L2 Misses

PAPI L2 TCM – Forward/Reverse/Random

1.75

1.50

34

L2 Sources of Divergences • Hardware Prefetching • PAPI Measurement Noise • Operating System Activity • Non-LRU Cache Replacement • Cache Coherency Traffic

35