eigen_chain_multiplication
The idea comes from Dynamic Programming of matrix chain multiplication.
Theory
Clearly, matrix chain multiplication could be optimized depend on the shape of each matrix and associative property of multiple operator. XYZ = (XY)Z = X(YZ) In matrix case, their type and shape are defined below.
X float 200x200 Y float 200x200 Z float 200x1
We will exam the cost of time of different associative orders. Before start, there are several setup steps
Setup
You need install cpu frequency control tool cpufreq
sudo apt install cpufreq
Setup cpu to performance
sudo su
sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g performance; done'
To best show the differenece between different orders, we need big matrix. By default EIGEN_STACK_ALLOCATION_LIMIT is limit of all memory you could use to store matrix. Its default value is 131072 which could store 131072 / sizeof(float) = 32768 \(\approx\) 181 x 181. So if we wanna 200x200 float matrix, we need set EIGEN_STACK_ALLOCATION_LIMIT to 0 by
#define EIGEN_STACK_ALLOCATION_LIMIT 0
Notice this has to be happen before you include Eigen header files.
Code
Eventually, the test.cc looks
#include <iostream>
#include <chrono>
#define EIGEN_STACK_ALLOCATION_LIMIT 0
#include <Eigen/Dense>
using namespace std;
using namespace Eigen;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::microseconds;
int main() {
double _total = 0.0;
Eigen::Matrix<float, 200, 200> X, Y;
Eigen::Matrix<float, 200, 1> Z, tmp;
X = Eigen::Matrix<float, 200, 200>::Random();
Y = Eigen::Matrix<float, 200, 200>::Random();
Z = Eigen::Matrix<float, 200, 1>::Random();
for (int i = 0; i < 50; i++) {
auto t1 = high_resolution_clock::now();
tmp = (X * Y) * Z;
auto t2 = high_resolution_clock::now();
duration<double, std::micro> mc_double = t2 - t1;
_total += mc_double.count();
}
cout << _total << endl;
return 0;
}
with makefile
output:test.cc
g++ -Wall -o output test.cc -I. -std=c++17
Compile
make
Run
./output
Output
1.32664e+06
If we update line 22
tmp = X * (Y * Z);
Output
11270.1
Clean
Setup cpu back to powersave
sudo su
sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g powersave; done'
Conclusion
In threoy, the first associative order should do 200x200x200 + 200x200x1 mulitplications. The later one should do 200x200x1 + 200x200x1 multiplication.
(200x200x200 + 200x200x1) / (200x200x1 + 200x200x1) = 100.5 1.32664e+06/11270.1 = 132
So the reality are pretty close to theory. Moreover, performance of reality is even better.
[ ]: