The former block size traded away good fit within cache lines in
order to gain faster division in deque_item(). However, compilers
are getting smarter and can now replace the slow division operation
with a fast integer multiply and right shift. Accordingly, it makes
sense to go back to a size that lets blocks neatly fill entire
cache-lines.
GCC-4.8 and CLANG 4.0 both compute "x // 62" with something
roughly equivalent to "x * 9520900167075897609 >> 69".
* Add comment explaining the endpoint checks
* Only do the checks in a debug build
* Simplify newblock() to only require a length argument
and leave the link updates to the calling code.
* Also add comment for the freelisting logic.
The division and modulo calculation in deque_item() can be compiled
to fast bitwise operations when the BLOCKLEN is a power of two.
Timing before:
~/cpython $ py -m timeit -r7 -s 'from collections import deque' -s 'd=deque(range(10))' 'd[5]'
10000000 loops, best of 7: 0.0627 usec per loop
Timing after:
~/cpython $ py -m timeit -r7 -s 'from collections import deque' -s 'd=deque(range(10))' 'd[5]'
10000000 loops, best of 7: 0.0581 usec per loop
* Clarified comment on the impact of BLOCKLEN on deque_index
(with a power-of-two, the division and modulo
computations are done with a right-shift and bitwise-and).
* Clarified comment on the overflow check to note that
it is general and not just applicable the 64-bit builds.
* In deque._rotate(), the "deque->" indirections are
factored-out of the loop (loop invariant code motion),
leaving the code cleaner looking and slightly faster.
* In deque._rotate(), replaced the memcpy() with an
equivalent loop. That saved the memcpy setup time
and allowed the pointers to move in their natural
leftward and rightward directions.
See comparative timings at: http://pastebin.com/p0RJnT5N