Sneak peek at a derivative-free SINDy method

$$ \newcommand{\LetThereBe}[2]{\newcommand{#1}{#2}} \newcommand{\letThereBe}[3]{\newcommand{#1}[#2]{#3}} % Declare mathematics (so they can be overwritten for PDF) \newcommand{\declareMathematics}[2]{\DeclareMathOperator{#1}{#2}} \newcommand{\declareMathematicsStar}[2]{\DeclareMathOperator*{#1}{#2}} % striked integral \newcommand{\avint}{\mathop{\mathchoice{\,\rlap{-}\!\!\int} {\rlap{\raise.15em{\scriptstyle -}}\kern-.2em\int} {\rlap{\raise.09em{\scriptscriptstyle -}}\!\int} {\rlap{-}\!\int}}\nolimits} % \d does not work well for PDFs \LetThereBe{\d}{\differential} $$ $$ % Simply for testing \LetThereBe{\foo}{\textrm{FIXME: this is a test!}} % Font styles \letThereBe{\mcal}{1}{\mathcal{#1}} \letThereBe{\chem}{1}{\mathrm{#1}} % Sets \LetThereBe{\C}{\mathbb{C}} \LetThereBe{\R}{\mathbb{R}} \LetThereBe{\Z}{\mathbb{Z}} \LetThereBe{\N}{\mathbb{N}} \LetThereBe{\im}{\mathrm{i}} \LetThereBe{\Im}{\mathrm{Im}} \LetThereBe{\Re}{\mathrm{Re}} % Sets from PDEs \LetThereBe{\boundary}{\partial} \letThereBe{\closure}{1}{\overline{#1}} \letThereBe{\contf}{2}{C^{#2}(#1)} \letThereBe{\compactContf}{2}{C_c^{#2}(#1)} \letThereBe{\ball}{2}{B\brackets{#1, #2}} \letThereBe{\closedBall}{2}{B\parentheses{#1, #2}} \LetThereBe{\compactEmbed}{\subset\subset} \letThereBe{\inside}{1}{#1^o} \LetThereBe{\neighborhood}{\mcal O} \letThereBe{\neigh}{1}{\neighborhood \brackets{#1}} % Basic notation - vectors and random variables \letThereBe{\vi}{1}{\boldsymbol{#1}} %vector or matrix \letThereBe{\dvi}{1}{\vi{\dot{#1}}} %differentiated vector or matrix \letThereBe{\vii}{1}{\mathbf{#1}} %if \vi doesn't work \letThereBe{\dvii}{1}{\vii{\dot{#1}}} %if \dvi doesn't work \letThereBe{\rnd}{1}{\mathup{#1}} %random variable \letThereBe{\vr}{1}{\mathbf{#1}} %random vector or matrix \letThereBe{\vrr}{1}{\boldsymbol{#1}} %random vector if \vr doesn't work \letThereBe{\dvr}{1}{\vr{\dot{#1}}} %differentiated vector or matrix \letThereBe{\vb}{1}{\pmb{#1}} %#TODO \letThereBe{\dvb}{1}{\vb{\dot{#1}}} %#TODO \letThereBe{\oper}{1}{\mathsf{#1}} % Basic notation - general \letThereBe{\set}{1}{\left\{#1\right\}} \letThereBe{\seqnc}{4}{\set{#1_{#2}}_{#2 = #3}^{#4}} \letThereBe{\Seqnc}{3}{\set{#1}_{#2}^{#3}} \letThereBe{\brackets}{1}{\left( #1 \right)} \letThereBe{\parentheses}{1}{\left[ #1 \right]} \letThereBe{\dom}{1}{\mcal{D}\, \brackets{#1}} \letThereBe{\complexConj}{1}{\overline{#1}} % Special symbols \LetThereBe{\const}{\mathrm{const}} \LetThereBe{\konst}{\mathrm{konst.}} \LetThereBe{\vf}{\varphi} \LetThereBe{\ve}{\varepsilon} \LetThereBe{\tht}{\theta} \LetThereBe{\Tht}{\Theta} \LetThereBe{\after}{\circ} \LetThereBe{\lmbd}{\lambda} % Shorthands \LetThereBe{\xx}{\vi x} \LetThereBe{\yy}{\vi y} \LetThereBe{\XX}{\vi X} \LetThereBe{\AA}{\vi A} \LetThereBe{\bb}{\vi b} \LetThereBe{\vvf}{\vi \vf} \LetThereBe{\ff}{\vi f} \LetThereBe{\gg}{\vi g} % Basic functions \letThereBe{\absval}{1}{\left| #1 \right|} \LetThereBe{\id}{\mathrm{id}} \letThereBe{\floor}{1}{\left\lfloor #1 \right\rfloor} \letThereBe{\ceil}{1}{\left\lceil #1 \right\rceil} \declareMathematics{\im}{im} %image \declareMathematics{\tg}{tg} \declareMathematics{\sign}{sign} \declareMathematics{\card}{card} %cardinality \letThereBe{\setSize}{1}{\left| #1 \right|} \declareMathematics{\exp}{exp} \letThereBe{\Exp}{1}{\exp\brackets{#1}} \letThereBe{\indicator}{1}{\mathbb{1}_{#1}} \declareMathematics{\arccot}{arccot} \declareMathematics{\complexArg}{arg} \declareMathematics{\gcd}{gcd} % Greatest Common Divisor \declareMathematics{\lcm}{lcm} % Least Common Multiple \letThereBe{\limInfty}{1}{\lim_{#1 \to \infty}} \letThereBe{\limInftyM}{1}{\lim_{#1 \to -\infty}} % Useful commands \letThereBe{\onTop}{2}{\mathrel{\overset{#2}{#1}}} \letThereBe{\onBottom}{2}{\mathrel{\underset{#2}{#1}}} \letThereBe{\tOnTop}{2}{\mathrel{\overset{\text{#2}}{#1}}} \LetThereBe{\EQ}{\onTop{=}{!}} \LetThereBe{\letDef}{:=} %#TODO: change the symbol \LetThereBe{\isPDef}{\onTop{\succ}{?}} \LetThereBe{\inductionStep}{\tOnTop{=}{induct. step}} % Optimization \declareMathematicsStar{\argmin}{argmin} \declareMathematicsStar{\argmax}{argmax} \letThereBe{\maxOf}{1}{\max\set{#1}} \letThereBe{\minOf}{1}{\min\set{#1}} \declareMathematics{\prox}{prox} \declareMathematics{\loss}{loss} \declareMathematics{\supp}{supp} \letThereBe{\Supp}{1}{\supp\brackets{#1}} \LetThereBe{\constraint}{\text{s.t.}\;} $$ $$ % Operators - Analysis \LetThereBe{\hess}{\nabla^2} \LetThereBe{\lagr}{\mcal L} \LetThereBe{\lapl}{\Delta} \declareMathematics{\grad}{grad} \declareMathematics{\Dgrad}{D} \LetThereBe{\gradient}{\nabla} \LetThereBe{\jacobi}{\nabla} \LetThereBe{\Jacobi}{\mathrm J} \letThereBe{\jacobian}{2}{D_{#1}\brackets{#2}} \LetThereBe{\d}{\mathrm{d}} \LetThereBe{\dd}{\,\mathrm{d}} \letThereBe{\partialDeriv}{2}{\frac {\partial #1} {\partial #2}} \letThereBe{\npartialDeriv}{3}{\partialDeriv{^{#1} #2} {#3^{#1}}} \letThereBe{\partialOp}{1}{\frac {\partial} {\partial #1}} \letThereBe{\npartialOp}{2}{\frac {\partial^{#1}} {\partial #2^{#1}}} \letThereBe{\pDeriv}{2}{\partialDeriv{#1}{#2}} \letThereBe{\npDeriv}{3}{\npartialDeriv{#1}{#2}{#3}} \letThereBe{\deriv}{2}{\frac {\d #1} {\d #2}} \letThereBe{\nderiv}{3}{\frac {\d^{#1} #2} {\d #3^{#1}}} \letThereBe{\derivOp}{1}{\frac {\d} {\d #1}\,} \letThereBe{\nderivOp}{2}{\frac {\d^{#1}} {\d #2^{#1}}\,} $$ $$ % Linear algebra \letThereBe{\norm}{1}{\left\lVert #1 \right\rVert} \letThereBe{\scal}{2}{\left\langle #1, #2 \right\rangle} \letThereBe{\avg}{1}{\overline{#1}} \letThereBe{\Avg}{1}{\bar{#1}} \letThereBe{\linspace}{1}{\mathrm{lin}\set{#1}} \letThereBe{\algMult}{1}{\mu_{\mathrm A} \brackets{#1}} \letThereBe{\geomMult}{1}{\mu_{\mathrm G} \brackets{#1}} \LetThereBe{\Nullity}{\mathrm{nullity}} \letThereBe{\nullity}{1}{\Nullity \brackets{#1}} \LetThereBe{\nulty}{\nu} % Linear algebra - Matrices \LetThereBe{\tr}{\top} \LetThereBe{\Tr}{^\tr} \LetThereBe{\pinv}{\dagger} \LetThereBe{\Pinv}{^\dagger} \LetThereBe{\Inv}{^{-1}} \LetThereBe{\ident}{\vi{I}} \letThereBe{\mtr}{1}{\begin{pmatrix}#1\end{pmatrix}} \letThereBe{\bmtr}{1}{\begin{bmatrix}#1\end{bmatrix}} \declareMathematics{\trace}{tr} \declareMathematics{\diagonal}{diag} $$ $$ % Statistics \LetThereBe{\iid}{\overset{\text{i.i.d.}}{\sim}} \LetThereBe{\ind}{\overset{\text{ind}}{\sim}} \LetThereBe{\condp}{\,\vert\,} \letThereBe{\complement}{1}{\overline{#1}} \LetThereBe{\acov}{\gamma} \LetThereBe{\acf}{\rho} \LetThereBe{\stdev}{\sigma} \LetThereBe{\procMean}{\mu} \LetThereBe{\procVar}{\stdev^2} \declareMathematics{\variance}{var} \letThereBe{\Variance}{1}{\variance \brackets{#1}} \declareMathematics{\cov}{cov} \declareMathematics{\corr}{cor} \letThereBe{\sampleVar}{1}{\rnd S^2_{#1}} \letThereBe{\populationVar}{1}{V_{#1}} \declareMathematics{\expectedValue}{\mathbb{E}} \declareMathematics{\rndMode}{Mode} \letThereBe{\RndMode}{1}{\rndMode\brackets{#1}} \letThereBe{\expect}{1}{\expectedValue #1} \letThereBe{\Expect}{1}{\expectedValue \brackets{#1}} \letThereBe{\expectIn}{2}{\expectedValue_{#1} #2} \letThereBe{\ExpectIn}{2}{\expectedValue_{#1} \brackets{#2}} \LetThereBe{\betaF}{\mathrm B} \LetThereBe{\fisherMat}{J} \LetThereBe{\mutInfo}{I} \LetThereBe{\expectedGain}{I_e} \letThereBe{\KLDiv}{2}{D\brackets{#1 \parallel #2}} \LetThereBe{\entropy}{H} \LetThereBe{\diffEntropy}{h} \LetThereBe{\probF}{\pi} \LetThereBe{\densF}{\vf} \LetThereBe{\att}{_t} %at time \letThereBe{\estim}{1}{\hat{#1}} \letThereBe{\estimML}{1}{\hat{#1}_{\mathrm{ML}}} \letThereBe{\estimOLS}{1}{\hat{#1}_{\mathrm{OLS}}} \letThereBe{\estimMAP}{1}{\hat{#1}_{\mathrm{MAP}}} \letThereBe{\predict}{3}{\estim {\rnd #1}_{#2 | #3}} \letThereBe{\periodPart}{3}{#1+#2-\ceil{#2/#3}#3} \letThereBe{\infEstim}{1}{\tilde{#1}} \letThereBe{\predictDist}{1}{{#1}^*} \LetThereBe{\backs}{\oper B} \LetThereBe{\diff}{\oper \Delta} \LetThereBe{\BLP}{\oper P} \LetThereBe{\arPoly}{\Phi} \letThereBe{\ArPoly}{1}{\arPoly\brackets{#1}} \LetThereBe{\maPoly}{\Theta} \letThereBe{\MaPoly}{1}{\maPoly\brackets{#1}} \letThereBe{\ARmod}{1}{\mathrm{AR}\brackets{#1}} \letThereBe{\MAmod}{1}{\mathrm{MA}\brackets{#1}} \letThereBe{\ARMA}{2}{\mathrm{ARMA}\brackets{#1, #2}} \letThereBe{\sARMA}{3}{\mathrm{ARMA}\brackets{#1}\brackets{#2}_{#3}} \letThereBe{\SARIMA}{3}{\mathrm{ARIMA}\brackets{#1}\brackets{#2}_{#3}} \letThereBe{\ARIMA}{3}{\mathrm{ARIMA}\brackets{#1, #2, #3}} \LetThereBe{\pacf}{\alpha} \letThereBe{\parcorr}{3}{\rho_{#1 #2 | #3}} \LetThereBe{\noise}{\mathscr{N}} \LetThereBe{\jeffreys}{\mathcal J} \LetThereBe{\likely}{\mcal L} \letThereBe{\Likely}{1}{\likely\brackets{#1}} \LetThereBe{\loglikely}{\mcal l} \letThereBe{\Loglikely}{1}{\loglikely \brackets{#1}} \LetThereBe{\CovMat}{\Gamma} \LetThereBe{\covMat}{\vi \CovMat} \LetThereBe{\rcovMat}{\vrr \CovMat} \LetThereBe{\AIC}{\mathrm{AIC}} \LetThereBe{\BIC}{\mathrm{BIC}} \LetThereBe{\AICc}{\mathrm{AIC}_c} \LetThereBe{\nullHypo}{H_0} \LetThereBe{\altHypo}{H_1} \LetThereBe{\rve}{\rnd \ve} \LetThereBe{\rtht}{\rnd \theta} \LetThereBe{\rX}{\rnd X} \LetThereBe{\rY}{\rnd Y} \LetThereBe{\rZ}{\rnd Z} \LetThereBe{\rA}{\rnd A} \LetThereBe{\rB}{\rnd B} \LetThereBe{\vrZ}{\vr Z} \LetThereBe{\vrY}{\vr Y} \LetThereBe{\vrX}{\vr X} % Bayesian inference \LetThereBe{\paramSet}{\mcal T} \LetThereBe{\sampleSet}{\mcal Y} \LetThereBe{\bayesSigmaAlg}{\mcal B} % Different types of convergence \LetThereBe{\inDist}{\onTop{\to}{d}} \letThereBe{\inDistWhen}{1}{\onBottom{\onTop{\longrightarrow}{d}}{#1}} \LetThereBe{\inProb}{\onTop{\to}{P}} \letThereBe{\inProbWhen}{1}{\onBottom{\onTop{\longrightarrow}{P}}{#1}} \LetThereBe{\inMeanSq}{\onTop{\to}{L^2}} \letThereBe{\inMeanSqWhen}{1}{\onBottom{\onTop{\longrightarrow}{L^2}}{#1}} \LetThereBe{\convergeAS}{\tOnTop{\to}{a.s.}} \letThereBe{\convergeASWhen}{1}{\onBottom{\tOnTop{\longrightarrow}{a.s.}}{#1}} $$ $$ % Distributions \letThereBe{\WN}{2}{\mathrm{WN}\brackets{#1,#2}} \declareMathematics{\uniform}{Unif} \declareMathematics{\binomDist}{Bi} \declareMathematics{\negbinomDist}{NBi} \declareMathematics{\betaDist}{Beta} \declareMathematics{\betabinomDist}{BetaBin} \declareMathematics{\gammaDist}{Gamma} \declareMathematics{\igammaDist}{IGamma} \declareMathematics{\invgammaDist}{IGamma} \declareMathematics{\expDist}{Ex} \declareMathematics{\poisDist}{Po} \declareMathematics{\erlangDist}{Er} \declareMathematics{\altDist}{A} \declareMathematics{\geomDist}{Ge} \LetThereBe{\normalDist}{\mathcal N} %\declareMathematics{\normalDist}{N} \letThereBe{\normalD}{1}{\normalDist \brackets{#1}} \letThereBe{\mvnormalD}{2}{\normalDist_{#1} \brackets{#2}} \letThereBe{\NormalD}{2}{\normalDist \brackets{#1, #2}} \LetThereBe{\lognormalDist}{\log\normalDist} $$ $$ % Game Theory \LetThereBe{\doms}{\succ} \LetThereBe{\isdom}{\prec} \letThereBe{\OfOthers}{1}{_{-#1}} \LetThereBe{\ofOthers}{\OfOthers{i}} \LetThereBe{\pdist}{\sigma} \letThereBe{\domGame}{1}{G_{DS}^{#1}} \letThereBe{\ratGame}{1}{G_{Rat}^{#1}} \letThereBe{\bestRep}{2}{\mathrm{BR}_{#1}\brackets{#2}} \letThereBe{\perf}{1}{{#1}_{\mathrm{perf}}} \LetThereBe{\perfG}{\perf{G}} \letThereBe{\imperf}{1}{{#1}_{\mathrm{imp}}} \LetThereBe{\imperfG}{\imperf{G}} \letThereBe{\proper}{1}{{#1}_{\mathrm{proper}}} \letThereBe{\finrep}{2}{{#2}_{#1{\text -}\mathrm{rep}}} %T-stage game \letThereBe{\infrep}{1}{#1_{\mathrm{irep}}} \LetThereBe{\repstr}{\tau} %strategy in a repeated game \LetThereBe{\emptyhist}{\epsilon} \letThereBe{\extrep}{1}{{#1^{\mathrm{rep}}}} \letThereBe{\avgpay}{1}{#1^{\mathrm{avg}}} \LetThereBe{\succf}{\pi} %successor function \LetThereBe{\playf}{\rho} %player function \LetThereBe{\actf}{\chi} %action function % ODEs \LetThereBe{\timeInt}{\mcal I} \LetThereBe{\stimeInt}{\mcal J} \LetThereBe{\Wronsk}{\mcal W} \letThereBe{\wronsk}{1}{\Wronsk \parentheses{#1}} \LetThereBe{\prufRadius}{\rho} \LetThereBe{\prufAngle}{\vf} \LetThereBe{\weyr}{\sigma} \LetThereBe{\linDifOp}{\mathsf{L}} \LetThereBe{\Hurwitz}{\vi H} \letThereBe{\hurwitz}{1}{\Hurwitz \brackets{#1}} % Cont. Models \LetThereBe{\dirac}{\delta} % PDEs % \avint -- defined in format-respective tex files \LetThereBe{\fundamental}{\Phi} \LetThereBe{\fund}{\fundamental} \letThereBe{\normaDeriv}{1}{\partialDeriv{#1}{\vec{n}}} \letThereBe{\volAvg}{2}{\avint_{\ball{#1}{#2}}} \LetThereBe{\VolAvg}{\volAvg{x}{\ve}} \letThereBe{\surfAvg}{2}{\avint_{\boundary \ball{#1}{#2}}} \LetThereBe{\SurfAvg}{\surfAvg{x}{\ve}} \LetThereBe{\corrF}{\varphi^{\times}} \LetThereBe{\greenF}{G} \letThereBe{\reflect}{1}{\tilde{#1}} \letThereBe{\unitBall}{1}{\alpha(#1)} \LetThereBe{\conv}{*} \letThereBe{\dotP}{2}{#1 \cdot #2} \letThereBe{\translation}{1}{\tau_{#1}} \declareMathematics{\dist}{dist} \letThereBe{\regularizef}{1}{\eta_{#1}} \letThereBe{\fourier}{1}{\widehat{#1}} \letThereBe{\ifourier}{1}{\check{#1}} \LetThereBe{\fourierOp}{\mcal F} \LetThereBe{\ifourierOp}{\mcal F^{-1}} \letThereBe{\FourierOp}{1}{\fourierOp\set{#1}} \letThereBe{\iFourierOp}{1}{\ifourierOp\set{#1}} \LetThereBe{\laplaceOp}{\mcal L} \letThereBe{\LaplaceOp}{1}{\laplaceOp\set{#1}} \letThereBe{\Norm}{1}{\absval{#1}} % SINDy \LetThereBe{\Koop}{\mcal K} \letThereBe{\oneToN}{1}{\left[#1\right]} \LetThereBe{\meas}{\mathrm{m}} \LetThereBe{\stateLoss}{\mcal J} \LetThereBe{\lagrm}{p} % Stochastic analysis \LetThereBe{\RiemannInt}{(\mcal R)} \LetThereBe{\RiemannStieltjesInt}{(\mcal {R_S})} \LetThereBe{\LebesgueInt}{(\mcal L)} \LetThereBe{\ItoInt}{(\mcal I)} \LetThereBe{\Stratonovich}{\circ} \LetThereBe{\infMean}{\alpha} \LetThereBe{\infVar}{\beta} % Dynamical systems \LetThereBe{\nUnit}{\mathrm N} \LetThereBe{\timeUnit}{\mathrm T} % BCF \LetThereBe{\lieD}{\oper{L}} \letThereBe{\lieDeriv}{1}{\lieD_{#1}} \letThereBe{\nlieDeriv}{2}{\lieD^{#1}_{#2}} \LetThereBe{\outerProd}{\otimes} \LetThereBe{\wedgeProd}{\wedge} \LetThereBe{\bialtProd}{\odot} % Neural Networks \LetThereBe{\neuralNet}{\mcal N} \letThereBe{\allTo}{1}{#1_{\leftarrow}} \letThereBe{\allFrom}{1}{#1^{\rightarrow}} \LetThereBe{\actF}{\sigma} \letThereBe{\extended}{1}{#1^{+}} \declareMathematics{\flatten}{vec} \LetThereBe{\hadamard}{\odot} $$

Heads up

This is only a draft of what’s to be an article… So please excuse any mistakes :)

SINDy

Sparse Identification of Nonlinear Dynamics (SINDy) by Brunton et al. is a method of discovering unknown dynamics of an autonomous systems of ODEs from data. What’s more, a sparsity (or more precisely parsimony) constraint is enforced on the recovered dynamics, so that we end up with a sparse system. This allows for much easier interpretation afterward.

Mathematically speaking, we solve an $n$-dimensional regularized linear regression \[ \min_{\vi \Xi} \frac 1 2 \norm{\dvi X - \vi \Tht(\vi X) \vi \Xi}_2^2, \] where we penalize estimated derivatives given by our (parsimonious) combination $\vi \Xi$ of library functions from $\vi \Tht$ against true derivatives. These true derivatives must either be measured or estimated from data. As one can guess, this is a significant source of error when using estimates, but measuring derivatives of state variables is often difficult.

SINDyOC

To get around the problematic derivatives, we can look at the problem from a different perspective. Suppose now that instead of finite number of measurements $\vi X$ we have measured the entire trajectory $\vi x_\meas(t) : \R \to \R^n$.¹

Let us first then introduce the new problem: \[ \begin{aligned} & \min_{\vi \Xi} \overbrace{\gamma \stateLoss(\vi x; \vi x_\meas) + \alpha R(\vi \Xi)}^{\loss(\vi \Xi; \vi x, \vi x_\meas)} \\ & \constraint \dvi x(t) = \vi \Theta(\vi x(t)) \vi \Xi, \quad t \in (0,T), \end{aligned} \tag{1}\] where $\stateLoss$ characterizes the loss via the dissimilarities in the state $\vi x \in \R^n$ (implicitly defined by $\vi \Xi$) when compared with measurements $\vi x_\meas$, usually chosen \[ \stateLoss(\vi x; \vi x_\meas) = \frac 1 2 \int_0^T \norm{\vi x(t) - \vi x_\meas(t)}_2^2 \dd t \implies \stateLoss_{\vi x}(\vi x; \vi x_\meas) = \vi x - \vi x_\meas, \] regularizer of parameters $\vi \Xi$ is denoted $R$, and the $\vi \Theta : \R^n \to \R^m$ stands for the “SINDy library”. One can notice this is actually a Calculus of Variations (CoV) problem and it is (as it stands) almost intractable numerically. Let us state that this can be also viewed as a optimal control problem, thus the name SINDyOC.

Note that the state $\vi x$ is implicitly defined by the parameters $\vi \Xi$ (and of course our library $\vi \Theta$) as the solution to the ODE \[ \dvi x(t) = \vi \Theta(\vi x(t)) \vi \Xi. \]

This also means that the time derivatives of the state variables are not necessary. As before, the SINDy library $\vi \Tht$ is chosen a priori and is not a subject to optimization.

Augmenting search space

Now we can introduce Lagrange multipliers $\vi \lagrm$, also called the costate. Moreover, by untying $\vi x$ and $\vi \Xi$, thus augmenting our search space with the space of all possible trajectories $\vi x$, it transforms our problem into \[ \min_{\vi x, \vi \Xi, \vi \lagrm} \lagr(\vi x, \vi \Xi, \vi \lagrm), \qquad \lagr(\vi x, \vi \Xi, \vi \lagrm) = \loss(\vi \Xi; \vi x, \vi x_{\meas}) + \int_0^T \scal {\vi \lagrm(t)} {\dvi x(t) - \vi \Theta(\vi x(t)) \vi \Xi} \dd t. \tag{2}\]

From my research, I presume this is actually called a hamiltonian and should be somewhat standard approach in CoV. Notice that by separating $\vi x$ and $\vi \Xi$ (and introducing the costate), we have split the problem into the loss part, which we want to minimize, and penalization of discrepancy between the derivative $\dvi x$ of the variable trajectory $\vi x$ and dictated derivate by $\vi \Xi$.²

Optimality conditions

Using integration by parts, the augmented lagrangian $\lagr$ (or hamiltonian) can be further rewritten without $\dvi x$ as (first only separating the integral into two parts) \[ \begin{aligned} \lagr(\vi x, \vi \Xi, \vi \lagrm) &= \loss(\vi \Xi; \vi x, \vi x_{\meas}) + \int_0^T \scal {\vi \lagrm(t)} {\dvi x(t)} \dd t - \int_0^T \scal {\vi \lagrm(t)} {\vi \Theta(\vi x(t)) \vi \Xi} \dd t \\ &= \loss(\vi \Xi; \vi x, \vi x_{\meas}) - \int_0^T \scal {\vi \lagrm(t)} {\vi \Theta(\vi x(t)) \vi \Xi} \dd t - \int_0^T \scal {\dvi \lagrm(t)} {\vi x(t)} \dd t + \parentheses{\scal {\vi \lagrm(t)} {\vi x(t)}}_{t = 0}^T \\ &= \loss(\vi \Xi; \vi x, \vi x_{\meas}) - \int_0^T \scal {\vi \lagrm(t)} {\vi \Theta(\vi x(t)) \vi \Xi} \dd t - \int_0^T \scal {\dvi \lagrm(t)} {\vi x(t)} \dd t + \scal {\vi \lagrm(T)} {\vi x(T)} - \scal {\vi \lagrm(0)} {\vi x(0)} \\ &= \loss(\vi \Xi; \vi x, \vi x_{\meas}) - \int_0^T \vi \lagrm(t)\Tr \vi \Theta (\vi x(t)) \vi \Xi \dd t - \int_0^T \dvi \lagrm(t)\Tr \vi x(t) \dd t + \vi \lagrm(T)\Tr \vi x(t) - \vi \lagrm(0)\Tr \vi x(0). \end{aligned} \tag{3}\]

Considering the derivative of $\lagr$, see (2), w.r.t the costate $\vi p$ simply gives us the state equation \[ \pDeriv \lagr {\vi p} = 0 \implies \dvi x - \vi \Tht(\vi x) \vi \Xi = 0. \tag{4}\]

Futhermore, stationarity condition of (2) can be satisfied by vanishing derivatives w.r.t. $\vi x$ and $\vi \Xi$, i.e., \[\begin{align*} \pDeriv \lagr {\vi x} &= 0 \implies \forall \tilde{\vi x} : \int_0^T \scal{\dvi p(t)} {\tilde{\vi x}(t)} \dd t + \int_0^T \scal {\vi p} {\vi \Tht'(\vi x(t)) \vi \Xi \tilde{\vi x}(t)} \dd t + \stateLoss_{\vi x} (\vi x; \vi x_\meas) - \scal {\vi p(T)} {\tilde{\vi x}(T)} = 0, \end{align*}\] where $\tilde{\vi x}$ is an arbitrary test function for our solution $\vi x$. By choosing $\tilde{\vi x}(t)$ with compact support (in time), this even yields \[ \dvi p + \brackets{\vi \Tht'(\vi x) \vi \Xi}\Tr \vi p = - \stateLoss_{\vi x} (\vi x; \vi x_\meas), \tag{5}\] which is a linear ordinary differential equation which we can solve backwards in time (e.g. with explicit Euler scheme) with initial (or final time) condition $\vi p(T) = 0$ (or $\vi p(T) = \Phi_{\vi x}(\vi x(T))$ for a terminal cost $\Phi$).

Finally, the derivative of $\lagr$ w.r.t. $\vi \Xi$ (which is the gradient we need to use for the optimization) leads to \[ \pDeriv \lagr {\vi \Xi} = R'(\vi \Xi) + \int_0^T \vi \Tht(\vi x(t))\Tr \vi p(t) \dd t. \tag{6}\]

Algorithm

Putting this all together, it results in the following algorithm:

Suppose we have an estimate for $\vi \Xi$, then

we solve the state equation (4) for $\vi x$,
once we have $\vi x$, we can solve the adjoint (costate) equation (5) for $\vi p$
and finally, having $\vi x$ and $\vi p$, we can express the gradient with (6),
last, but not least, we perform a step of our chosen optimization functions (e.g. gradient descent or Barzilai-Borwein), updating the $\vi \Xi$ estimate (and back to step 1.).

Numerical results

Choice of optimizer

So far, we have successfully found a gradient of the loss function in the optimization problem given in (2), and simplified in (3). Using this knowledge, one can surely use any 1st order optimization method to try to reach the minimizer.

More precisely, we have tried basic gradient descent, a quasi-Newton method Barzilai-Borwein (BB method) and even its more robust variant Adaptive Barzilai-Borwein (AdaBB method).

Choice of functions

As one could notice, throughout the process there were multiple functions, which were user-specified. For starters, even in the initial problem (1), we had a state loss function $\stateLoss$, which we already introduced, and a regularizer $R$. The choice of a regularizer is a non-trivial, but most often one of these functions is used ³

$l_2$-regularizer \[ R(\vi \Xi) = \frac 1 2 \norm{\vi \Xi}_2^2 = \frac 1 2 \sum_{i = 1}^{a \cdot b} {\Xi_i^2} \implies R'(\vi \Xi) = \vi \Xi; \]
smoothed $l_1$-regularizer which using a smoothed $l_1$-norm $\norm{x}_{s_1} = \sqrt{x^2 + \varepsilon}$ computes the following \[ R(\vi \Xi) = \sum_{i = 1}^{a \cdot b} \norm{\Xi_i}_{s_1} \implies R'(\vi \Xi) = \mtr{ \frac {\Xi_1} {\norm{\Xi_1}_{s_1}} & \dots & \frac {\Xi_{a+b}} {\norm{\Xi_{a+b}}_{s_1}} }\Tr; \]
smoothed $l_0$-regularizer which for $\norm{x}_{s_0} = 1 - e^{-\frac 1 2 \sigma x^2}$ yields \[ R(\vi \Xi) = \sum_{i = 1}^n \norm{\Xi_i}_{s_0} \implies R'(\vi \Xi) = \mtr{ -\sigma \cdot \Xi_1 \cdot e^{-\frac 1 2 \sigma \Xi_1^2} & \dots & -\sigma \cdot \Xi_1 \cdot e^{-\frac 1 2 \sigma \Xi_{a+b}^2} }\Tr. \]

Similarly, we could define a terminal cost $\Phi$, but as we omitted it in the rest of the text, we shall do so here as well.

Simple 1D example

As a toy model, let us consider a trivial ODE \[ \dot x(t) = 10 \tag{7}\] on the time-span $(0, 1)$ and the library $\tht_1(x) = 1, \tht_2(x) = x$. We limit the number of iterations of the optimizer to 300.

Figure 1: Stepping through the coefficient space for the simple model (7).

As one can see from Figure 1, the BB method’s convergence is non-monotonous, but incredibly fast (in this nice scenario). Also AdaBB method is significantly faster than GD, but still slower than BB[^We have used the $l_2$-regularizer.].

Here, we have started very close to the actual solution of $\vi \Xi = \mtr{10 \\ 0}$. Next step should to check whether we can converge from much further initial estimates, which can be seen in Figure 2. Notice that the BB method diverged from $[15, -5]$, while AdaBB seems to be on a convergent path. In order to get the full picture, let us add that the BB method took 200 steps, while AdaBB took 3000 steps.

Figure 2: Enlarged coefficient space for the simple model (7).

2D toy model

Let us move from the trivial example model (7) to a more complicated, but still very simple 2D toy system \[ \begin{aligned} \dot x &= x - y, \\ \dot y &= y, \end{aligned} \tag{8}\] with the initial condition $[x_0, y_0] = [1, 2]$ and time interval $(0,1)$. Library functions will now be $\tht_1(x,y) = 1, \tht_2(x,y) = x, \tht_3(x,y) = y$, thus the correct coefficient matrix being \[ \vi \Xi_t = \mtr{ 0 & 0 \\ 1 & 0 \\ -1 & 1 }. \]

Unlike in the previous section, this time we use smoothed $l_0$-regularizer and a one-hot estimate for $\Xi$ on the first index with value $10^{-4}$. This gives us the starting coefficient vector (reshape-able to a matrix) $(10^{-4}, 0, 0, 0, 0, 0)$.⁴ Using AdaBB with 1000 steps, we almost recover the true dynamics, see Figure 3.

            Ξ                 Ξₜ
  0.00900413   0.250218    0.0  0.0
  0.984739    -0.0382853   1.0  0.0
 -0.99971      0.920509   -1.0  1.0

Figure 3: Comparison of found and true trajectories for the model (8).

This result was almost to be expected as we used a very small library (given almost no room for error). Let us therefore try to enlarge the search space for a slightly modified toy model \[ \begin{aligned} \dot x &= x - 2 y, \\ \dot y &= -1.5 y^2. \end{aligned} \tag{9}\]

The initial conditions and time-span stays the same, and as the library we use polynomials in $x,y$ up-to 2nd degree, i.e. $1, x, y, x^2, y^2, xy$. True coefficient matrix is \[ \vi \Xi_t = \mtr{ 0 & 0 \\ 1 & 0 \\ -2 & 0 \\ 0 & 0 \\ 0 & -1.5 \\ 0 & 9 } \] and the initial estimate for $\vi \Xi$ stays analogous as in the previous case. Using again the AdaBB method with 1000 steps, we get a nice fit, see Figure 4, but a rather dense matrix $\vi \Xi$.

            Ξ                     Ξₜ
 -0.976115      0.00360563    0.0   0.0
  0.445382     -0.000419411   1.0   0.0
 -0.598879     -0.491281     -2.0   0.0
 -0.00211222   -0.00162035    0.0   0.0
 -0.300793     -0.960423      0.0  -1.5
 -0.000608252  -0.396844      0.0   0.0

Figure 4: Comparison of found and true trajectories for the model (9).

Simple oscillator

Last, but not least, I have tried to discover a model with simple oscillatory dynamics \[ \begin{aligned} \dot x &= - x^3 - y, \\ \dot y &= x, \end{aligned} \tag{10}\] this time initializing from $[1,0]$ and working with time-span $(0,20)$. Given a library functions $\tht_1(x,y) = x, \tht_2(x,y) = x^3$ and $\tht_3(x,y) = y$ (thus the smallest possible library) and a initial estimate of $\vi \Xi$ formulated as $\vi \Xi_t + 0.1 \cdot \text{rand}(0, 1)^{3 \times 2}$, where \[ \vi \Xi_t = \mtr{ 0 & 1 \\ -1 & 0 \\ -1 & 0 } \] and $\text{rand}(0,1)^{3 \times 2}$ denotes a random matrix $\R^{3 \times 2}$ such that $0 \leq \Xi_{i,j} \leq 1$ for $i,j$ appropriate.

Note

The choice of initial estimate as a distorted $\vi \Xi_t$ was intentional, as I wasn’t able to converge anywhere near the solution with the one-hot estimate introduced earlier.

After 500 steps of AdaBB method, the fit is somewhat improved, but the convergence is very slow, see Figure 5.

        Ξₜ_distorted               Ξ                Ξₜ
  0.0252214  1.07222     0.0261762  1.07166     0.0  1.0
 -0.949978   0.0569715  -0.949786   0.056621   -1.0  0.0
 -0.944634   0.0213754  -0.944057   0.0203974  -1.0  0.0

Figure 5: Comparison of found, distorted and true trajectories for the model (10).

While this choice of the system, see (10), seems fitting due to its simplicity, on a further inspection one might find problems. First and foremost, this is actually a damped oscillator, which can be shown e.g. with the following Lyapunov function (think total energy): \[ L(x,y) = x^2 + y^2, \] by differentiating it with respect to time $t$. This yields, using (10) \[\begin{align*} \dot L(x,y) &= L_x(x,y) \dot x + L_y(x,y) \dot y \\ &= 2x \cdot (-x^3 - y) + 2y \cdot x \\ &=-2x^4 - 2xy + 2xy \\ &= -2x^4. \end{align*}\] In other words, the total energy (here given by the Lyapunov function $L$) is decreasing, which implies damped oscillations. Moreover, one can see that the decay is also polynomial, i.e. it tends to the stable equilibrium $(0,0)$ really slowly.

In our context, this means that the ground truth trajectory, i.e. our entire information about the unknown system, steps through its corresponding phase space slowly, and in a short trajectory covering only a small section of the phase space. Hence when we try to learn the phase space, i.e. the unknown system, from this data, we have seem very likely way too little of the phase space to be able to correctly identify it.

This whole issue also hints at another possible extension of our method. To cover more of the phase space of the unknown system, we can do the entire process with a group of different trajectories. To be more formal, let us assume we have $M \in \N$ trajectories $\vi x_{\meas}^m$ for $m \in \set{1, \dots, M}$. Then, instead of (1) we can use the following loss function \[ \loss(\vi \Xi; \vi x^1, \dots, \vi x^M, \vi x_\meas^1, \dots, \vi x_\meas^M) = \alpha R(\vi \Xi) + \gamma \sum_{m = 1}^M \stateLoss(\vi x^m, \vi x_\meas^m). \]

Such modification yields the following lagrangian \[ \lagr(\vi x^1, \dots, \vi x^m, \vi \Xi, \vi p^1, \dots, \vi p^m) = \loss(\vi \Xi; \vi x^1, \dots, \vi x^M, \vi x_\meas^1, \dots, \vi x_\meas^M) + \sum_{m = 1}^M \int_0^{T^m} \scal {\vi p^m(t)} {\dvi x^m(t) - \vi \Theta(\vi x^m(t)) \vi \Xi} \dd t, \] which is almost a linear combination of the lagrangians we have used earlier⁵. This in turn leads to \[ \partialDeriv \lagr {\vi \Xi} = R'(\vi \Xi) + \sum_{m = 1}^M \int_0^{T^m} \vi \Theta(\vi x^m(t))\Tr \vi p^m(t) \dd t. \tag{11}\]

In other words, the algorithm laid out in Section 2.3 changes to:

Suppose we have an estimate for $\vi \Xi$, then

for each $m \in \set{1, \dots, M}$ and its input trajectory $\vi x_\meas^m$:
1. we solve the state equation (4) for $\vi x^m$,
2. once we have $\vi x^m$, we can solve the adjoint (costate) equation (5) for $\vi p^m$
having $\vi x^m$ and $\vi p^m$ for all $m \in \set{1, \dots, M}$, we can express the gradient with (11),
last, but not least, we perform a step of our chosen optimization functions (e.g. gradient descent or Barzilai-Borwein), updating the $\vi \Xi$ estimate (and back to step 1.).

This approach is also beneficial from the computational perspective, because the entire step 1. can be run in parallel for all possible $m \in \set{1, \dots, M}$, thus making our method more scalable to bigger volumes of data.

Note

Purely for computation purposes, it should be possible to split a long trajectory into multiple parts and perform the discovery process on these short segments. Moreover, the error introduced by this change should be fairly negligible.

On the other hand, the reverse procedure seems unlikely to work, as in the step 1. of the original algorithm, see Section 2.3, we will have difficulties overcoming the jumps in the composed trajectory. The composed trajectory, constructed by concatenating the trajectory segments $\vi x_\meas^m$, does not have guaranteed continuity – something we quite obviously require, see e.g. (1), where $\dvi x$ is present.

Van der Pol oscillator

Using a similar approach as in previous section, I have attempted to discover also the Van der Pol model \[ \begin{aligned} \dot x &= \mu \brackets{x - \frac 1 3 x^3 - y}, \\ \dot y &= \frac 1 \mu x, \end{aligned} \tag{12}\] where I chose $\mu = 1.5$ and initial condition traditionally by now $[1,2]$. Time-span of $(0,10)$ was used. The true coefficient matrix is \[ \vi \Xi_t = \mtr{ \mu & \frac 1 \mu \\ - \frac \mu 3 & 0 \\ - \mu & 0 } \onTop{\implies}{\mu = \frac 3 2} \mtr{ \frac 3 2 & \frac 2 3 \\ - \frac 1 2 & 0 \\ - \frac 3 2 & 0 }. \]

I tweaked the optimization process by assigning $\gamma = 10$, thus increasing the importance of state loss. Furthermore, I have split the optimization into 2 parts:

we optimize using BB method for 100 steps (although only 66 steps are taken before reaching numerical instability) and using Rosenbrock23 solver for the state equation with high tolerance $10^{-2}$ (should handle really stiff ODEs),
starting from the last numerically stable coefficient estimate by the BB method we continue with AdaBB for up to 1000 steps (although again only for 396 steps before reaching numerical instability).

Still even after this complicated process, nothing close to an optimal fit is reached, see Figure 6.

         Ξ (BB)             Ξ (BB + AdaBB)          Ξₜ
 -0.451973   0.597353   -0.493289    0.64118    1.5  0.666667
  0.121509   0.0823231  -0.0762217   0.166764  -0.5  0.0
 -2.42941   -0.397994   -2.5218     -0.370136  -1.5  0.0

Figure 6: Comparison of BB-found, *AdaBB + BB*-found and true trajectories for the model (12).

Footnotes

This assumptions should improve clarity in the following section. Furthermore, because of used optimize/discretize (rather then discretize/optimize), we will end up with finite measured as well.↩︎
We could have also introduced the notion of terminal cost, to further restrain the variable trajectory from deviating at $T$ from the measured one.↩︎
Beware that even for a matrix of coefficients $\vi \Xi \in \R^{a \times b}$, we always treat it as a vector $\vi \xi \in \R^{c}$, where $c = a \cdot b$, outside of matrix product.↩︎
Using this initial estimate almost guarantees numerical stability in the beginning, while limiting the effect of favouring a single coefficient.↩︎
Using a simple linear combination of the lagrangians shown earlier would lead to duplication of $R(\vi \Xi)$, which seems undesirable.↩︎