## Introduction to system identification

As outlined in the Boulder Opal overview, system identification is often the first step in designing new quantum control strategies. It most generally refers to the process by which a user probes their system in order to learn about its characteristics. These characteristics can be highly varied including system responses in the time or frequency domain, or identifying the parameters or terms of a Hamiltonian model of the system. System identification is a complement to first-principles analyses that involves trying to discover the model that actually describes your quantum system, imperfections and all.

## How does system identification work?

System identification is at its heart an input-output process; you subject your system to stimuli and measure the output response. The stimuli you apply must be closely linked to the characteristics you wish to probe.

For instance, if you wish to understand the frequency-domain effect of drives applied to a qubit, you can generate shaped probe pulses modulated with different center frequencies and examine the response of the system. Alternatively, if you're trying to characterize a coupling term between excited levels in your system, you could use pulses that excite those transitions of different strength or duration.

After collecting the input-output relationship, the core task of system identification is to model the process that transforms the input stimulus to the output measurement. In general you can accomplish this by testing a variety of candidate models and choosing the one which gives the response closest to that of the actual system.

## System identification as an optimization problem

In general you can cast the problem of model selection as an optimization problem which is conveniently handled using the graph-based optimization engine in Boulder Opal.

In this approach, your aim is to find the model parameters $\{\lambda_j\}$ that minimize the distance between some measured data points $\{ P_i \}$ and those predicted by a candidate model $\{p_i(\{\lambda_j\})\}$ with the same inputs. This defines a simple cost function that links the measured and predicted data via their residual sum of squares (RSS): $$ C_\text{RSS} (\{\lambda_j\}) = \sum_i [P_i-p_i(\{\lambda_j\})]^2. $$ In this expression, each of the differences $P_i-p_i(\{\lambda_j\})$ is called a residual.

The problem can be translated to Boulder Opal by creating a graph with the relationship between the parameters to be identified (defined as optimization variables) and the measured response. Then, you calculate the cost function by comparing each measured point to the simulated one in the graph. Running the optimization will give you the parameters that minimize the cost function, and thus create the model that most faithfully describes the measured system dynamics.

A similar approach consists of minimizing the negative logarithm of the likelihood that a certain set of parameters $\{\lambda_j\}$ could generate the points $P_i$, for the case where the probability distribution of the errors are independent Gaussians with standard deviation $\Delta P$. In this case, the probability of a certain curve $p_i(\{\lambda_j\})$ generating the points $P_i$ is the product of all the individual Gaussian probabilities: $$ P = \prod_i \frac{1}{\sqrt{2\pi (\Delta P)^2}} \exp \left\{ - \frac{[P_i - p_i(\{\lambda_j\})]^2}{2 (\Delta P)^2} \right\} . $$ The negative log-likelihood of this probability distribution will then be proportional to the following cost function: $$ C_\text{NLL} (\{\lambda_j\}) = \sum_i \frac{[P_i-p_i(\{\lambda_j\})]^2}{2 (\Delta P)^2}. $$ Using either $C_\text{RSS}$ or $C_\text{NLL}$ as the cost is equivalent, their only difference is a constant.

As usual for graph operations, the construction of the cost function benefits from batching.
To make use of it, store the measured data $\{ P_i \}$ in a NumPy array where each axis represents one variation of the experiment.
For example, if you ran experiments with all combinations of 3 different observables and 10 different pulses, you can structure the data into a 2D NumPy array of shape `(3, 10)`

.
By creating the candidate model data $\{ p_i (\{ \lambda_j \}) \}$ as a tensor of the same shape, you can directly make operations between $P_i$ and $p_i$.
Converting the batch into a scalar cost can then be achieved by using the graph method `graph.sum`

.

For an example of how to set up the graph and cost function for system identification, see the user guide How to perform Hamiltonian parameter estimation using a small amount of measured data.

## Estimating the precision of the system identification

You can find out the precision of your parameter estimates by drawing their confidence region, which represents where the correct values are expected to be with a certain probability. The basic assumption here is that the errors in the estimate $\boldsymbol\lambda$ of the $p$ parameters that you want to determine are given by a multivariate normal distribution: $$ P(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^p (\mathrm{det} \boldsymbol\Sigma )}} e^{- \frac{1}{2} (\mathbf{x} - \boldsymbol\lambda)^T \boldsymbol\Sigma^{-1} (\mathbf{x} - \boldsymbol\lambda)},$$ where $\boldsymbol\Sigma$ is the covariance matrix and $\mathbf{x}$ are a vector of values for the parameters. This assumption is usually valid as long as the residuals are sufficiently small—and can be verified by plotting a histogram of the residuals and verifying that their distribution is a Gaussian.

Given this probability distribution, the confidence region will be a (hyper)ellipsoid in a space that has as many dimensions as the number of parameters that you are estimating. The set of points $\mathbf{x}$ that form the surface of the ellipsoid is obtained by: $$ (\mathbf{x} - \boldsymbol\lambda)^T \boldsymbol\Sigma^{-1} (\mathbf{x} - \boldsymbol\lambda) = p F_{1- \alpha} \left( \frac{n-p}{2}, \frac{p}{2} \right) \equiv z^2, $$ where $n$ is the number of data points, $\alpha$ is the fraction of confidence that the region represents, and $F_{1-\alpha} (a, b)$ is the point of the F-distribution where the probability in the tail is equal to $1-\alpha$. All these are being included above in the scaling factor $z$.

Note that the following variable transformation turns the ellipsoid equation into an equation for the surface of a (hyper)sphere with unit radius: $$ \mathbf{\tilde x} = \frac{1}{z} \boldsymbol\Sigma^{-1/2} (\mathbf{x} - \boldsymbol\lambda), \;\;\; \| \mathbf{\tilde x} \|^2 = 1. $$ Solving for $\mathbf{x}$, you can draw the confidence region by applying the following transformation to the coordinates $\mathbf{\tilde x}$ of a unit hypersphere: $$ \mathbf{x} = z \boldsymbol\Sigma^{1/2} \mathbf{\tilde x} + \boldsymbol\lambda. $$ As the estimates $\boldsymbol\lambda$ were provided by the optimization and the value of $z$ can be obtained from the inverse cumulative distribution function (CDF) of the F-distribution, the only element missing to draw the confidence region is the covariance matrix $\boldsymbol\Sigma$.

Using the Cramér–Rao bound, you can estimate the covariance matrix $\boldsymbol\Sigma$ as the inverse of the Fisher information matrix $I_\text{F}$.
You can calculate this by taking the Hessian (matrix of second derivatives) of the negative log likelihood cost $C_\text{NLL}$:
$$ \Sigma^{-1}_{i,j} = \left( I_\text{F} \right)_{i,j} = \frac{\partial^2 C_\text{NLL}}{\partial \lambda_i \partial \lambda_j} . $$
Alternatively, you can use the RSS to arrive at the covariance matrix via:
$$ \Sigma^{-1}_{i,j} = \frac{1}{2} \frac{n-p}{C_\text{RSS}} \frac{\partial^2 C_\text{RSS}}{\partial \lambda_i \partial \lambda_j}. $$
Note that the Hessian of the negative log-likelihood is proportional to the Hessian of the residual sum of squares:
$$ \frac{\partial^2 C_\text{NLL}}{\partial \lambda_i \partial \lambda_j} = \frac{1}{2} \frac{1}{( \Delta P)^2} \frac{\partial^2 C_\text{RSS}}{\partial \lambda_i \partial \lambda_j}. $$
In either approach, you can calculate the Hessian of the cost node using the graph method `graph.hessian`

.

It is easier to visualize the confidence region if you take sections of it in a plane, where only two variables are considered at a time. Each of these 2D sections will be an ellipse, which you can draw by generating the coordinates for a unit circle and then multiplying them by $z \boldsymbol\Sigma^{1/2}$. You can see an example of how to do this in the user guide How to perform parameter estimation using a small amount of measured data.

## Selecting a system identification strategy

Choosing the best system identification strategy for your system will be linked to the constraints of your problem, and will ultimately feed into the selection of an optimization method best suited to your needs. Note that these are the same mathematical methods used for control design, but here employed for cost-function minimization in a different setting.

### System identification for simple systems with moderate amounts of measurement data

For most system identification problems, involving a moderate amount of data (of the order of hundreds or fewer data points) and a simple physical system, using a deterministic gradient-based optimizer (such as the one provided by `qctrl.functions.calculate_optimization`

) is a fast and reliable way to obtain the minimum of a cost function.

See the How to perform Hamiltonian parameter estimation using a small amount of measured data or the How to characterize the bandwidth of a transmission line using a qubit as a probe user guides to learn how to perform system identification using Boulder Opal.

### System identification for multiparameter systems with large amounts of measurement data

Optimizations on large systems or involving a large dataset (of the order of thousands of data points and multiple qubits) can be quite demanding computationally; frequently such optimizations can be slow or require large amounts of classical computing memory. In some situations the amount of data that you want to analyze might exceed the memory space available for the computation. In such cases it is generally advantageous to break down the input data into smaller, randomly sampled batches and perform a stochastic optimization.
This is available in Boulder Opal through `qctrl.functions.calculate_stochastic_optimization`

.

In such cases it can be beneficial to split the dataset in small batches, using a different one at each optimization step. This option is particularly useful for problems that involve a large dataset or a complex simulation due to, for example, a fine time discretization. Moreover, using a momentum-based stochastic optimization algorithm such as Adam, as employed by the stochastic optimization engine in Boulder Opal, can provide an advantage in finding local minima in the cost landscape when using noisy datasets.

See the How to perform Hamiltonian parameter estimation using a large amount of measured data user guide to learn how to use stochastic optimization in Boulder Opal to perform system identification.