vendredi 29 mai 2015

Measurement of TLB effects on a Cortex-A9

after reading the following paper http://ift.tt/1clYByk ("What every programmer should know about memory") I wanted to try one of the author's test, that is, measuring the effects of TLB on the final execution time.

I am working on a Samsung Galaxy S3 that embeds a Cortex-A9.

According to the documentation:

  • we have two micro TLBs for instruction and data cache in L1 (http://ift.tt/1LR3hrp)

  • The main TLB is located in L2 (http://ift.tt/1LR3hrp)

  • Data micro TLB has 32 entries (instruction micro TLB has either 32 or 64 entries)

  • L1' size == 32 Kbytes
  • L1 cache line == 32 bytes
  • L2' size == 1MB

I wrote a small program that allocates an array of structs with N entries. Each entry's size is == 32 bytes so it fits in a cache line. I perform several read access and I measure the execution time.

typedef struct {
     int elmt; // sizeof(int) == 4 bytes
     char padding[28]; // 4 + 28 = 32B == cache line size
}entry;


volatile entry ** entries = NULL;

//Allocate memory and init to 0
entries = calloc(NB_ENTRIES, sizeof(entry *));
if(entries == NULL) perror("calloc failed"); exit(1);

for(i = 0; i < NB_ENTRIES; i++)
{
      entries[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
}


InitWithRandomValues(&entries);
entries[LAST_ELEMENT]->elmt = -1

gettimeofday(&tStart, NULL);
for(i = 0; i < NB_LOOPS; i++)
{
     j = 0;
     while(j != -1)
     {
          j = entries[j]->elmt
     }
}
gettimeofday(&tEnd, NULL);

time = (tEnd.tv_sec - tStart.tv_sec);
time *= 1000000;
time += tEnd.tv_usec - tStart.tv_usec;
time *= 100000
time /= (NB_ENTRIES * NBLOOPS);

fprintf(stdout, "%d %3lld.%02lld\n", NB_ENTRIES, time / 100, time % 100);

I have an outter loop that makes NB_ENTRIES vary from 4 to 1024.

As one can see in the figure below, while NB_ENTRIES == 256 entries, executin time is longer.

When NB_ENTRIES == 404 I get an "out of memory" (why? micro TLBs exceeded? main TLBs exceeded? Page Tables exceeded? virtual memory for the process exceeded?)

Can someone explain me please what is really going on from 4 to 256 entries, then from 257 to 404 entries?

Effects of TLB on execution time

Aucun commentaire:

Enregistrer un commentaire