Jove Matrix Performance

Initial problem

  • Matrix multiplication code has poor single core performance, also doesn't scale beyond 4 threads.

What we did

  • Change data layout and re-order loops to avoid cache misses
  • Use a specialized linear algebra library to generate optimized code
  • Add thread-safe and performant parallelism using OpenMP
  • See for more details


  • Code runs more than 4x faster on a single core
  • Near perfect scaling up to 12 cores on a 12-core/24-thread CPU

Single core speed upParallel scaling