yu's profile表说偶无趣:)PhotosBlogLists Tools Help
    July 21

    The story of "container_of()" and "likely() ,unlikely()"

    (1)container_of()

    What exactly does the weird Linux kernel container_of() macro do? This macro is defined as:

    #define container_of(ptr, type, member) ({ \
                    const typeof( ((type *)0)->member ) *__mptr = (ptr); 
                    (type *)( (char *)__mptr - offsetof(type,member) );})
    

    To help explain pointer mess, let us go through the steps of what it is doing. In the include/linux/i2c.h file, the to_i2c_driver() macro is defined as:

    container_of(d, struct i2c_driver, driver)
    

    and is used in code as:

    i2c_drv = to_i2c_driver(drv);
    

    where dev is a pointer to a struct device_driver.

    Replacing the above code with the first macro expansion produces:

    i2c_drv = container_of(drv, struct i2c_driver, driver);
    

    Then, the next level of expansion creates:

    i2c_drv = ({
            const typeof( ((struct i2c_driver *)0)->driver) *__mptr = drv;
            (struct i2c_driver *)( (char *)__mptr - offsetof(struct i2c_driver, driver));
    })
    

    To simplify this, remember that driver is a variable within the i2c_driver function. The first line of the macro sets up a pointer that points to the struct device_driver passed to the code. The second line of the macro finds the real location in memory of the struct i2c_driver that we want to access.

    So with the type of the driver variable known, the macro can be reduced to:

    i2c_drv = ({
            const struct device driver *__mptr = drv;
            (struct i2c_driver *)( (char *)__mptr - offsetof(struct i2c_driver, driver));
    })
    

    To show this using real numbers, say the i2c_driver structure looks like:

    struct i2c_driver {
            char name[32];
            struct device_driver driver;
    };
    

    The location of the driver variable usually is 32 bytes into the structure, depending on the packing rules of the compiler at the time. So, if the drv pointer is at location 0xc0101020, the __mptr variable is assigned that location in the first line of the macro. Then, the offset of the driver variable is subtracted from this address, giving us the value 0xc0101000. This would be assigned to the variable on the left side of the original assignment, i2c_drv.

    Putting these values into the macro, it then is reduced to:

    i2c_drv = ({
            const struct device_driver *__mptr = drv;
            (struct i2c_driver *)( (char *)__mptr - 0x20);
    })
    

    which is then evaluated at runtime by the processor as a simple subtraction on a pointer, which should be quite fast.

    In order to do this kind of pointer manipulation, the code has to know the type of pointer being passed to it. The driver core only passes in the type of driver structure registered with it, so this type of manipulation is safe. This also prevents other parts of the kernel from modifying the unique fields of the structure used to control the subsystem's code.

     

    (2)likely() ,unlikely()

    What are they ?

    In Linux kernel code, one often find calls to likely() and unlikely(), in conditions, like :

    bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx);
    if (unlikely(!bvl)) {
      mempool_free(bio, bio_pool);
      bio = NULL;
      goto out;
    }
    

    In fact, these functions are hints for the compiler that allows it to correctly optimize the branch, by knowing which is the likeliest one. The definitions of these macros, found in include/linux/compiler.h are the following :

    #define likely(x)       __builtin_expect(!!(x), 1)
    #define unlikely(x)     __builtin_expect(!!(x), 0)
    

    The GCC documentation explains the role of __builtin_expect() :

     -- Built-in Function: long __builtin_expect (long EXP, long C)
         You may use `__builtin_expect' to provide the compiler with branch
         prediction information.  In general, you should prefer to use
         actual profile feedback for this (`-fprofile-arcs'), as
         programmers are notoriously bad at predicting how their programs
         actually perform.  However, there are applications in which this
         data is hard to collect.
    
         The return value is the value of EXP, which should be an integral
         expression.  The value of C must be a compile-time constant.  The
         semantics of the built-in are that it is expected that EXP == C.
         For example:
    
              if (__builtin_expect (x, 0))
                foo ();
    
         would indicate that we do not expect to call `foo', since we
         expect `x' to be zero.  Since you are limited to integral
         expressions for EXP, you should use constructions such as
    
              if (__builtin_expect (ptr != NULL, 1))
                error ();
    
         when testing pointer or floating-point values.
    

    How does it optimize things ?

    It optimizes things by ordering the generated assembly code correctly, to optimize the usage of the processor pipeline. To do so, they arrange the code so that the likeliest branch is executed without performing any jmp instruction (which has the bad effect of flushing the processor pipeline).

    To see how it works, let's compile the following simple C user space program with gcc -O2 :

    #define likely(x)    __builtin_expect(!!(x), 1)
    #define unlikely(x)  __builtin_expect(!!(x), 0)
    
    int main(char *argv[], int argc)
    {
       int a;
    
       /* Get the value from somewhere GCC can't optimize */
       a = atoi (argv[1]);
    
       if (unlikely (a == 2))
          a++;
       else
          a--;
    
       printf ("%d\n", a);
    
       return 0;
    }
    

    Now, disassemble the resulting binary using objdump -S (comments added by me) :

    080483b0 <main>:
     // Prologue
     80483b0:       55                      push   %ebp
     80483b1:       89 e5                   mov    %esp,%ebp
     80483b3:       50                      push   %eax
     80483b4:       50                      push   %eax
     80483b5:       83 e4 f0                and    $0xfffffff0,%esp
     //             Call atoi()
     80483b8:       8b 45 08                mov    0x8(%ebp),%eax
     80483bb:       83 ec 1c                sub    $0x1c,%esp
     80483be:       8b 48 04                mov    0x4(%eax),%ecx
     80483c1:       51                      push   %ecx
     80483c2:       e8 1d ff ff ff          call   80482e4 <atoi@plt>
     80483c7:       83 c4 10                add    $0x10,%esp
     //             Test the value
     80483ca:       83 f8 02                cmp    $0x2,%eax
     //             --------------------------------------------------------
     //             If 'a' equal to 2 (which is unlikely), then jump,
     //             otherwise continue directly, without jump, so that it
     //             doesn't flush the pipeline.
     //             --------------------------------------------------------
     80483cd:       74 12                   je     80483e1 <main+0x31>
     80483cf:       48                      dec    %eax
     //             Call printf
     80483d0:       52                      push   %edx
     80483d1:       52                      push   %edx
     80483d2:       50                      push   %eax
     80483d3:       68 c8 84 04 08          push   $0x80484c8
     80483d8:       e8 f7 fe ff ff          call   80482d4 <printf@plt>
     //             Return 0 and go out.
     80483dd:       31 c0                   xor    %eax,%eax
     80483df:       c9                      leave
     80483e0:       c3                      ret
    

    Now, in the previous program, replace the unlikely() by a likely(), recompile it, and disassemble it again (again, comments added by me) :

    080483b0 <main>:
     //             Prologue
     80483b0:       55                      push   %ebp
     80483b1:       89 e5                   mov    %esp,%ebp
     80483b3:       50                      push   %eax
     80483b4:       50                      push   %eax
     80483b5:       83 e4 f0                and    $0xfffffff0,%esp
     //             Call atoi()
     80483b8:       8b 45 08                mov    0x8(%ebp),%eax
     80483bb:       83 ec 1c                sub    $0x1c,%esp
     80483be:       8b 48 04                mov    0x4(%eax),%ecx
     80483c1:       51                      push   %ecx
     80483c2:       e8 1d ff ff ff          call   80482e4 <atoi@plt>
     80483c7:       83 c4 10                add    $0x10,%esp
     //             --------------------------------------------------
     //             If 'a' equal 2 (which is likely), we will continue
     //             without branching, so without flusing the pipeline. The
     //             jump only occurs when a != 2, which is unlikely.
     //             ---------------------------------------------------
     80483ca:       83 f8 02                cmp    $0x2,%eax
     80483cd:       75 13                   jne    80483e2 <main+0x32>
     //             Here the a++ incrementation has been optimized by gcc
     80483cf:       b0 03                   mov    $0x3,%al
     //             Call printf()
     80483d1:       52                      push   %edx
     80483d2:       52                      push   %edx
     80483d3:       50                      push   %eax
     80483d4:       68 c8 84 04 08          push   $0x80484c8
     80483d9:       e8 f6 fe ff ff          call   80482d4 <printf@plt>
     //             Return 0 and go out.
     80483de:       31 c0                   xor    %eax,%eax
     80483e0:       c9                      leave
     80483e1:       c3                      ret
    

    How should I use it ?

    You should use it only in cases when the likeliest branch is very very very likely, or when the unlikeliest branch is very very very unlikely.

    July 15

    7-15

    放几篇好玩的文章地址吧:)
    http://lwn.net/Articles/187336/
    http://lwn.net/Articles/188056/
    http://lwn.net/Articles/188059/
    第一篇是说ext4的,后2篇是一个系列,承认第3篇我看的时候比较晕

    http://www.ntu.edu.sg/home/ascpfu/veno/veno.html
    http://www-ece.rice.edu/networks/TCP-LP/
    关于tcp的小东西,已经merge进入2.6.18

    http://developer.osdl.org/dev/robustmutexes/
    看标题就知道是什么的吧,MS这个现在还只是patch,不过对于高压力多线程服务器程序还是很有帮助

    http://sourceforge.net/mailarchive/forum.php?thread_id=23836054&forum_id=2697
    woo,一个完美的GPL的NTFS实现,由于采用了FUSE架构,这个patch无需进入内核,不过让人ft的是,它为优化过的效率竟然是这样的...
    > Version  1.03    ------Sequential Create------ --------Random Create--------
     >                  -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
     >            files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
     > reiserfs          16k 21459  99 +++++ +++ 17856  96 20172  98 +++++ +++ 16414  96
     > jfs                16k  7015  13 +++++ +++  5868  10  3068  14 +++++ +++  1075   3
     > ntfs-3g         16k  3021  99 14291  99  5226  99  3548  99 16149  99  5223  99
     > xfs               16k  2401  17 +++++ +++  2095  15  2301  20 +++++ +++   347   2
     > ext3            16k  1862  96 +++++ +++ +++++ +++  1914  96 +++++ +++  9695  98
     > minix            16k  1450  97 +++++ +++ 18148  94  1694  97 +++++ +++  4847  98
     > fat32           16k   366  97 +++++ +++  1809  97   428  97 +++++ +++  1361  97
     > paragon ntfs 16k    58  98  1259  99   245  99    55  99 +++++ +++   832  99
     > captive ntfs    Always crashes early on file creations and loses files.
     >              This was also confirmed and fix denied by its developer.