eigen_chain_multiplication

The idea comes from Dynamic Programming of matrix chain multiplication.

Theory

Clearly, matrix chain multiplication could be optimized depend on the shape of each matrix and associative property of multiple operator. XYZ = (XY)Z = X(YZ) In matrix case, their type and shape are defined below.

X float 200x200 Y float 200x200 Z float 200x1

We will exam the cost of time of different associative orders. Before start, there are several setup steps

Setup

You need install cpu frequency control tool cpufreq

sudo apt install cpufreq

Setup cpu to performance

sudo su
sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g performance; done'

To best show the differenece between different orders, we need big matrix. By default EIGEN_STACK_ALLOCATION_LIMIT is limit of all memory you could use to store matrix. Its default value is 131072 which could store 131072 / sizeof(float) = 32768 \(\approx\) 181 x 181. So if we wanna 200x200 float matrix, we need set EIGEN_STACK_ALLOCATION_LIMIT to 0 by

#define EIGEN_STACK_ALLOCATION_LIMIT 0

Notice this has to be happen before you include Eigen header files.

Code

Eventually, the test.cc looks

#include <iostream>
#include <chrono>
#define EIGEN_STACK_ALLOCATION_LIMIT 0

#include <Eigen/Dense>
using namespace std;
using namespace Eigen;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::microseconds;

int main() {
  double _total = 0.0;
  Eigen::Matrix<float, 200, 200> X, Y;
  Eigen::Matrix<float, 200, 1> Z, tmp;
  X = Eigen::Matrix<float, 200, 200>::Random();
  Y = Eigen::Matrix<float, 200, 200>::Random();
  Z = Eigen::Matrix<float, 200, 1>::Random();
  for (int i = 0; i < 50; i++) {
    auto t1 = high_resolution_clock::now();
    tmp = (X * Y) * Z;
    auto t2 = high_resolution_clock::now();
    duration<double, std::micro> mc_double = t2 - t1;
    _total += mc_double.count();
  }
  cout  << _total << endl;
  return 0;
}

with makefile

output:test.cc
    g++ -Wall -o output test.cc -I. -std=c++17

Compile

make

Run

./output

Output

1.32664e+06

If we update line 22

tmp = X * (Y * Z);

Output

11270.1

Clean

Setup cpu back to powersave

sudo su
sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g powersave; done'

Conclusion

In threoy, the first associative order should do 200x200x200 + 200x200x1 mulitplications. The later one should do 200x200x1 + 200x200x1 multiplication.

(200x200x200 + 200x200x1) / (200x200x1 + 200x200x1) = 100.5 1.32664e+06/11270.1 = 132

So the reality are pretty close to theory. Moreover, performance of reality is even better.

[ ]: