Copyright 1999, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.





This directory contains mpn functions for 64-bit PA-RISC 2.0.

RELEVANT OPTIMIZATION ISSUES

The PA8000 has a multi-issue pipeline with large buffers for instructions
awaiting pending results.  Therefore, no RAW register latency scheduling is
necessary (and might actually be harmful).  RAW memory scheduling is still
necessary.

Two 64-bit loads can be completed per cycle.  One 64-bit store can be
completed per cycle.  A store cannot complete in the same cycle as a load.

STATUS

* mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
  the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
  for add/subtract.

* The multiplication functions run at 11 cycles/limb.  The cache bandwidth
  allows 7.5 cycles/limb for mul_1 and 8 cycles/limb for addmul_1/submul_1.
  It would surely be possible, using unrolling to allow better RAW memory
  scheduling, to reach the cache bandwidth limit.

* xaddmul_1.S contains a quicker method for forming the 128 bit product.  It
  uses some fewer operations, and keep the carry flag live across the loop
  boundary.  But it seems hard to make it run more than 1/4 cycle faster
  than the old code.  Perhaps we really ought to unroll this loop be 2x?  2x
  should suffice since register latency schedling is never needed, but the
  unrolling would hide the RAW memory latency.  Here is a sketch:

	1. A multiply and store 64-bit products
	2. B sum 64-bit products 128-bit product
	3. B load  64-bit products to integer registers
	4. B multiply and store 64-bit products
	5. A sum 64-bit products 128-bit product
	6. A load  64-bit products to integer registers
	7. goto 1

  In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
  for better instruction mix.
