I still haven't managed to make it run any faster. It might be as fast as it is going to get with cublas and cusparse, in which case it means I need to find better libraries (I am convinced writing it all myself is not going to be any faster, plus people have already dealt with these problems, so I shouldn't need to reinvent the wheel, but suggestions are welcome).
The slow down is considerable:
For a 10x10x10 grid, the GPU takes around 33ms in total (~7 in the analysis phase, ~24ms in total for the iterations, ~2ms doing other things).
For the same grid, the CPU does it in 7ms (no anlaysis phase, ~5ms in total for the iterations, ~2ms doing other things).
And the GPU just takes longer and longer time as the grid fills up, with as much as 60ms being spent (I didn't wait more to get higher numbers).
I was looking into a library, CUSP, but I couldn't get it to compile with my project, so if there are any CUSP users, please let me know how to do that =)
And that is it for now.