{ "cells": [ { "cell_type": "markdown", "id": "d13e1bdc", "metadata": {}, "source": [ "## eigen_chain_multiplication\n", "\n", "The idea comes from Dynamic Programming of matrix chain multiplication.\n", "\n", "#### Theory\n", "\n", "Clearly, matrix chain multiplication could be optimized depend on the shape of each matrix and associative property of multiple operator.\n", "XYZ = (XY)Z = X(YZ)\n", "In matrix case, their type and shape are defined below.\n", "\n", "X float 200x200\n", "Y float 200x200\n", "Z float 200x1\n", "\n", "We will exam the cost of time of different associative orders. Before start, there are several setup steps\n", "\n", "#### Setup\n", "\n", "You need install cpu frequency control tool **cpufreq**\n", "\n", " sudo apt install cpufreq\n", "\n", "Setup cpu to **performance**\n", "\n", " sudo su\n", " sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g performance; done'\n", "\n", "To best show the differenece between different orders, we need big matrix. By default *EIGEN_STACK_ALLOCATION_LIMIT* is limit of all memory you could use to store matrix. Its default value is 131072 which could store 131072 / sizeof(float) = 32768 $\\approx$ 181 x 181. So if we wanna 200x200 float matrix, we need set *EIGEN_STACK_ALLOCATION_LIMIT* to 0 by\n", "\n", " #define EIGEN_STACK_ALLOCATION_LIMIT 0\n", "\n", "Notice this has to be happen before you include Eigen header files.\n", "\n", "#### Code\n", "\n", "Eventually, the **test.cc** looks\n", "\n", " #include \n", " #include \n", " #define EIGEN_STACK_ALLOCATION_LIMIT 0\n", " \n", " #include \n", " using namespace std;\n", " using namespace Eigen;\n", " using std::chrono::high_resolution_clock;\n", " using std::chrono::duration_cast;\n", " using std::chrono::duration;\n", " using std::chrono::microseconds;\n", " \n", " int main() {\n", " double _total = 0.0;\n", " Eigen::Matrix X, Y;\n", " Eigen::Matrix Z, tmp;\n", " X = Eigen::Matrix::Random();\n", " Y = Eigen::Matrix::Random();\n", " Z = Eigen::Matrix::Random();\n", " for (int i = 0; i < 50; i++) {\n", " auto t1 = high_resolution_clock::now();\n", " tmp = (X * Y) * Z;\n", " auto t2 = high_resolution_clock::now();\n", " duration mc_double = t2 - t1;\n", " _total += mc_double.count();\n", " }\n", " cout << _total << endl;\n", " return 0;\n", " }\n", "\n", "with **makefile**\n", "\n", " output:test.cc\n", " \tg++ -Wall -o output test.cc -I. -std=c++17\n", "\n", "Compile\n", "\n", " make\n", "\n", "Run\n", "\n", " ./output\n", "\n", "Output\n", "\n", " 1.32664e+06\n", "\n", "If we update line 22\n", "\n", " tmp = X * (Y * Z);\n", "\n", "Output\n", "\n", " 11270.1\n", "\n", "#### Clean\n", "\n", "Setup cpu back to **powersave**\n", "\n", " sudo su\n", " sudo bash -c 'for ((i=0;i<$(nproc);i++)); do cpufreq-set -c $i -g powersave; done'\n", "\n", "#### Conclusion\n", "\n", "In threoy, the first associative order should do 200x200x200 + 200x200x1 mulitplications. The later one should do 200x200x1 + 200x200x1 multiplication.\n", "\n", "(200x200x200 + 200x200x1) / (200x200x1 + 200x200x1) = 100.5\n", "1.32664e+06/11270.1 = 132\n", "\n", "So the reality are pretty close to theory. Moreover, performance of reality is even better." ] }, { "cell_type": "code", "execution_count": null, "id": "02deaefa", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }