Machine learning revolves around empirical models such as kernels or Neural Networks (NN) that require big data and efficient algorithms for their identification and training. Constraint learning is a technique for improving efficiency in such models. As such, the consideration of constraints in learning algorithms remains a very important and active research topic. Noteworthy research is available, whereby approaches such as the Bayesian updating have provided rational frameworks for integrating data into predictive models and has been successfully adapted to situations where likelihoods are not readily available either because of expense or because of the nature of available information. Generally, relevant information is mainly available in the form of sample statistics rather than raw data or sample-wise constraints. To circumvent this shortfall of the Bayesian framework, alternative approaches such as the Kullback-Liebler divergence have been reported and extensively used particularly to impose constraints in the framework of learning with statistical models. However, in most instances, large data sets are not available. Consequently, the neural networks cannot be trained as desired. To overcome this, researchers have developed small data sets that can help circumvent this drawback; nonetheless, the small data sets have been seen to share conceptual and computational challenges of “big data” with further complications pertaining to scarcity of evidence, and the necessity to extract the most knowledge, with quantifiable confidence, from scarce data.
Therefore, a need arises to implement learning from a high-dimensional small data set, without invoking the Gaussian assumption. Recently, the probabilistic learning on manifolds (PLoM) has been reported and complementary development and applications with validations also demonstrated. This technique could potentially neutralize the problem at hand by improving the knowledge that one has from only a small number of expensive evaluations of a computational model in order to be able to solve a problem, such as a nonconvex optimization problem under uncertainties with nonlinear constraints, for which a large number of expensive evaluations would be required, which, in general, is not possible. In this view, Professor Christian Soize from the University of Paris, France, in collaboration with Professor Roger Ghanem at the University of Southern California, Los Angeles, California; proposed an extension of the PLoM for which, not only the initial data set was given, but in addition, constraints would be specified, in the form of statistics synthesized from experimental data, from theoretical considerations, or from numerical simulations. Their work is currently published in the research journal, International Journal for Numerical Methods in Engineering.
The two researchers considered a non-Gaussian random vector whose unknown probability distribution had to satisfy constraints. The technique involved construction of a generator using the PLoM and the classical Kullback-Leibler minimum cross-entropy principle. The resulting optimization problem was then reformulated using Lagrange multipliers associated with the constraints. The researchers then computed the optimal solution of the Lagrange multipliers using an efficient iterative algorithm. At each iteration, the Markov chain Monte Carlo algorithm developed for the PLoM was used, consisting in solving an Itô stochastic differential equation that was projected on a diffusion-maps basis.
The authors reported that the method and the algorithm were efficient and allowed for the construction of probabilistic models for high-dimensional problems from small initial data sets and for which an arbitrary number of constraints were specified. In fact, for the two sample applications build, the first one was seen to be sufficiently simple and easy to reproduce while as the second one was relative to a stochastic elliptic boundary value problem in high dimension.
In summary, the study introduced a methodology that extends the probabilistic learning on manifolds from a small data set to the case for which constraints are imposed, during the learning process, to a subset of QoI. Interestingly, the researchers mentioned that the methodology had the capability to consider more general constraints than the second-order statistical moments. Of much significance, the proposed approach allowed for analysis of non-Gaussian cases in high dimension related to functional inputs and outputs. In a statement to Advances in Engineering, Professor Christian Soize further pointed out that their iteration algorithm was very robust and seemed to be exponentially convergent with respect to the number of iterations.
Probabilistic machine learning for the small-data challenge in computational sciences
Illustration of the loss of concentration using a classical MCMC generator and of the efficiency of the probabilistic learning on manifolds that preserves the concentration and avoids the scattering. Figure 1-(left) displays N= 400 given points of the initial dataset for which the realizations of the random variable X = (X1,X2,X3) are concentrated around a helical. Figure 1-(central) shows M = 8,000 additional realizations of X generated with a classical MCMC for which the concentration is. Figure 1-(right) shows M= 8,000 additional realizations of X generated with the probabilistic learning on manifold for which the concentration is preserved.
Christian Soize, Roger Ghanem. Physics-constrained non-Gaussian probabilistic learning on manifolds. International Journal for Numerical Methods in Engineering 2020;volume 121:page 110–145.