eigen_chain_multiplication

The idea comes from Dynamic Programming of matrix chain multiplication.

Theory

Clearly, matrix chain multiplication could be optimized depend on the shape of each matrix and associative property of multiple operator. XYZ = (XY)Z = X(YZ) In matrix case, their type and shape are defined below.

X float 200x200 Y float 200x200 Z float 200x1

We will exam the cost of time of different associative orders. Before start, there are several setup steps

Setup

You need install cpu frequency control tool cpufreq

sudo apt install cpufreq

Setup cpu to performance

sudo su
sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g performance; done'

To best show the differenece between different orders, we need big matrix. By default EIGEN_STACK_ALLOCATION_LIMIT is limit of all memory you could use to store matrix. Its default value is 131072 which could store 131072 / sizeof(float) = 32768 \(\approx\) 181 x 181. So if we wanna 200x200 float matrix, we need set EIGEN_STACK_ALLOCATION_LIMIT to 0 by

#define EIGEN_STACK_ALLOCATION_LIMIT 0

Notice this has to be happen before you include Eigen header files.

Code

Eventually, the test.cc looks

#include <iostream>
#include <chrono>
#define EIGEN_STACK_ALLOCATION_LIMIT 0

#include <Eigen/Dense>
using namespace std;
using namespace Eigen;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::microseconds;

int main() {
  double _total = 0.0;
  Eigen::Matrix<float, 200, 200> X, Y;
  Eigen::Matrix<float, 200, 1> Z, tmp;
  X = Eigen::Matrix<float, 200, 200>::Random();
  Y = Eigen::Matrix<float, 200, 200>::Random();
  Z = Eigen::Matrix<float, 200, 1>::Random();
  for (int i = 0; i < 50; i++) {
    auto t1 = high_resolution_clock::now();
    tmp = (X * Y) * Z;
    auto t2 = high_resolution_clock::now();
    duration<double, std::micro> mc_double = t2 - t1;
    _total += mc_double.count();
  }
  cout  << _total << endl;
  return 0;
}

with makefile

output:test.cc
    g++ -Wall -o output test.cc -I. -std=c++17

Compile

make

Run

./output

Output

1.32664e+06

If we update line 22

tmp = X * (Y * Z);

Output

11270.1

Clean

Setup cpu back to powersave

sudo su
sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g powersave; done'

Conclusion

In threoy, the first associative order should do 200x200x200 + 200x200x1 mulitplications. The later one should do 200x200x1 + 200x200x1 multiplication.

(200x200x200 + 200x200x1) / (200x200x1 + 200x200x1) = 100.5 1.32664e+06/11270.1 = 132

So the reality are pretty close to theory. Moreover, performance of reality is even better.

[ ]: