Compiling with CUDA support
---------------------------

These tools are intended to allow CP2K to perform portions of the calculation on NVIDIA graphics cards using the CUDA (5) API.
In order to use these features you need a CUDA compatible graphics card (at least "Fermi" architecture). For running multiple
MPI's per GPU one needs a GPU of "Keppler" type.
In addition, you must download and install the CUDA toolkit, CUDA SDK, and a CUDA compatible NVIDIA driver.
These can be obtained from NVIDIA's website.

To compile CP2K with CUDA support, several modifications must be made to the arch file. 
An example is provided in arch/Linux-x86-64-cuda.sopt. The necessary modifications are:

BASIC:

1)  Add a line "NVCC = nvcc" pointing the makefile to the nvcc compiler.

2)  Add -D__PW_CUDA to the DFLAGS and NVFLAGS environmental variables

3)  Add libcudart.so, libcufft.so, and cublas to the LIBS variable.

4)  Consult the input reference manual for input file options that
    modify PW/FFT CUDA behavior (GLOBAL / CUDA).
    At the moment no option is read! 

5)  If using in conjunction with DBCSR (see next section).

DBCSR (experimental, with limitations):

1) Add -D__DBCSR_CUDA to DFLAGS.

2) Add -lcudart and -lrt to the LIBS variable (ensuring the
   libcudart.so is in the library path; use -L/path/to/libcudart.so to
   define it).

3) Consult the input reference manual for input file options that
   modify DBCSR CUDA behavior.

Then, compile as normal.

Alternatively one can compile with support for running DGEMM and DSYMM on the gpu, but nothing else.
This is accomplised as above, but instead of -D__PW_CUDA one should include -D__CUBLASDP.

Features
--------
At the moment, the only portion of the calculation which can be performed on the graphics card is the FFT,
the associated scatter/gather operations, and some linear algebra.
Single precision FFT, as formerly enabled by -D__FFTSGL or -D__SGL, are no longer supported! The
pre-processor flag -D__PW_CUDA implies the former used flag -D__FFTCU.
