Ilya Zakharevich on Fri, 12 Jan 2024 23:27:51 +0100


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Big GOTCHAs WITH forprime()


   [I start with a longread supplying the context, then there is a
    short stopgap-suggestion at the end.]

About 20–25 years ago forprime()

  • was optimized through the teeth (at least for the case when the
    limits are within the size of one word);

  • the optimizations were highly tunable for idiosyncrasies of
    particular processors;

  • the logic of tuneups was documented (at least in the source); and

  • it seems that several people did at least vgrep⸣ped through the code.
    
Later the code was completely rewamped, and it was
replaced/enhanced “by a much better code”.  Today I did some timing on
this “much better code”.  Here are the results (with something like
  my(s=190*10^15); forprime(p=s,s+7*10^6,)
):

  s       len= 6e5      6e6      6e7      6e8
  —————————————————————————————————————————————
  15e16  |             0.999   25.366
  17e16  |             1.000   26.958   267.979
  190e15 |    0.125    0.984   28.361   282.534
  191e15 |    0.125    0.999    9.860    98.734
  
It is obvious that no tuneup was performed on this code.  I would not
be surprised if it can be sped up an order of magnitude…

  (Moreover, it segfaults… — reported, see
      https://pari.math.u-bordeaux.fr/cgi-bin/bugreport.cgi?bug=2520
   .)

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ language/forprime.c

I browsed through the code.  One critical parameter, the chunk size,
is chosen by:

  /* TODO: Optimize size (surely < 512k to stay in L2 cache, but not so large
   * as to force recalculating too often). */
  ulong chunk = 0x80000UL;
  ulong tmp = (b - a) / chunk + 1;

  if (tmp == 1)
    chunk = b - a + 16;
  else
    chunk = (b - a) / tmp + 15;

Hmm, no possibility of tuneup indeed?!

⁜⁜⁜⁜

Another critical parameter: when to switch between “sieving” and the
“nextprime” method:

  maxp2 = (maxp & HIGHMASK)? 0 : maxp*maxp;
  /* FIXME: should sieve as well if q != 1, adapt sieve code */
  if (q != 1 || (maxp2 && maxp2 <= a)
             || T->b - maxuu(a,maxp) < maxp / expu(b))

WHY disable sieving if maxp is wider than half-a-word?!

WHY disable sieving if b-a < primelimit/⌊log₂(b)⌋ ?!

⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜⁜ 1/10000th of a workaround

At some moment, the code which allowed to support arbitrarily large
primelimit (up to a word size) was removed from PARI.  (Compare with
      https://pari.math.u-bordeaux.fr/archives/pari-dev-2401/msg00008.html
.)  So now only the (unoptimized untunable buggy) code above can be
used in the range about [436e6, 436e6²].

This removal has advantages: when I introduced the code in question,
it could give a slowdown of 3–5% to forprime() — due to one extra
(very rarely taken) conditional branch inside a hot loop.

However:

  • With the progress in branch prediction in processors during the
    last 20 years, I would not be surprised that this slowdown is
    removed COMPLETELY nowadays.

  • As the timings above show, the replacement code can be pessimized
    UP TO 3 TIMES.  On this background, even these 3–5% (on the
    architectures of last millennium?!) pale in comparison.

So I think that a small part of the problems indicated at the start of
this message may be alleviated by reinstalling these (few lines of)
code allowing placing more forprime() invocations below the primelimit. 

Thanks,
Ilya

  P.S.  The next (obvious) useful steps seem to be
         • document the current “logic” of “optimizations” in the current
       	   code.
	 • Make ALL the “arbitrary choices” tunable by users.
	 • (Would be nice to) do massive experiments with timings to
  	   make the default tuneup reasonable.

	And MAYBE, when this is covered, more possibilities for
	optimizations would become visible…