Безградиентные методы выпуклой оптимизации в условиях шума / Gradient-Free Methods for Convex Optimization under Noise Conditions

Лобанов Александр Владимирович

Безградиентные методы выпуклой оптимизации в условиях шума / Gradient-Free Methods for Convex Optimization under Noise Conditions тема диссертации и автореферата по ВАК РФ 00.00.00, кандидат наук Лобанов Александр Владимирович

Лобанов Александр Владимирович
кандидат наук
2025

Специальность ВАК РФ00.00.00

Количество страниц 161

Лобанов Александр Владимирович. Безградиентные методы выпуклой оптимизации в условиях шума / Gradient-Free Methods for Convex Optimization under Noise Conditions: дис. кандидат наук: 00.00.00 - Другие cпециальности. «Московский физико-технический институт (национальный исследовательский университет)». 2025. 161 с.

Оглавление диссертации кандидат наук Лобанов Александр Владимирович

Contents

1 Introduction

1.1 Relevance and Importance

1.2 Contribution and Scientific Novelty

1.3 Presentations and Validation of Research Results

1.4 Publications

1.5 Thesis Structure

2 Non-smooth Optimization Problems with Adversarial Deterministic Noise

2.1 Introduction

2.2 Main Contributions and Structure

2.3 Smoothing Schemes

2.3.1 Notations and Assumptions

2.3.2 Smooth Approximation

2.3.3 Randomization with Two-Point Feedback

2.3.4 Randomization with One-Point Feedback

2.3.5 Smoothing Algorithm

2.4 Federated Learning

2.4.1 Optimal First-Order Algorithms

2.4.2 Optimal Zeroth-Order Algorithms

2.5 Proof Schemes

2.5.1 Proof of Lemma

2.5.2 Proof of Lemma

2.5.3 Proof of Lemma

2.5.4 Noise Level Estimates

2.6 Numerical Experiments

3 Non-smooth Optimization Problems with Adversarial Stochastic Noise

3.1 Intoduction

3.1.1 Related Works

3.1.2 Chapter Organization

3.2 Setting and Assumptions

3.3 Main Result 44 3.3.1 Smoothing scheme via l\ randomization

3.3.2 Smoothing scheme via 12 randomization

3.3.3 Maximum level of adversarial stochastic noise 47 3.4 Discussion

4 Smooth Optimization Problems under "Overparametrization" Condition

4.1 Introduction

4.1.1 Related works

4.1.2 Notations

4.1.3 Chapter Ogranization

4.2 Technical Preliminaries

4.2.1 Assumptions on the Objective Function

4.2.2 Assumptions on the Gradient Oracle

4.3 Accelerated SGD with Biased Gradient

4.4 Main Result

4.5 Experiments

5 Highly Smooth Optimization Problems under the PL Condition

5.1 Introduction

5.1.1 Contribution

5.1.2 Chapter organization

5.2 Related Work

5.3 Setup

5.3.1 Notation

5.3.2 Assumptions on the objective function

5.3.3 Assumptions on the biased gradient oracle

5.4 Zero-Order Mini-batch SGD

5.5 Extended Analysis of Convergence via Stochastic Zero-Order Oracle

5.5.1 Stochastic zero-order oracle with additive adversarial stochastic noise

5.5.2 Stochastic zero-order oracle with mixed adversarial noise

5.6 Improved estimates of Table

5.7 Discussion

5.8 Experiments

6 Conclusion 74 References

A Appendix for Chapter

A.1 Upper Bounds

A.2 Proof of Theorem

A.3 Proof of Theorem

A.4 Single-Point Zeroth-Order Algorithms

B Appendix for Chapter

B.1 Auxiliary Facts and Results

B.1.1 Squared norm of the sum

B.1.2 L smoothness function

B.1.3 Wirtinger-Poincare inequality 130 B.2 Proof Theorem

B.3 Proof Theorem on the convergence of AZO-SGD

C Appendix for Chapter

C.1 Analyzing effect of problem parameters on algorithms 140 C.2 Auxiliary Facts and Results

C.2.1 Squared norm of the sum 141 C.2.2 Fact from concentration of the measure 141 C.2.3 Gaussian smoothing. Upper bounds for the moments 141 C.2.4 Fact from Gaussian approximation 141 C.2.5 Taylor expansion 142 C.2.6 Kernel property 142 C.2.7 Bounds of the Weighted Sum of Legendre Polynomials

C.3 The convergence rate of SGD with biased gradient

C.4 Proof of Theorem

C.4.1 Kernel approximation

C.4.2 Bias square

C.4.3 Second moment

C.4.4 Convergence rate

C.5 Proof of convergence of Algorithm 2 with two-point feedback

C.5.1 Kernel approximation

C.5.2 Bias square

C.5.3 Second moment

C.5.4 Convergence rate

C.6 Proof of Theorem

C.6.1 Kernel approximation

C.6.2 Bias square

C.6.3 Second moment

C.6.4 Convergence rate

C.7 Proof of Theorem

C.7.1 Kernel approximation

C.7.2 Bias square

C.7.3 Second moment

C.7.4 Convergence rate

C.8 Gaussian smoothing approach for Theorem

C.8.1 Definition Gaussian approximation

C.8.2 Bias square

C.8.3 Second moment

C.8.4 Convergence rate

C.9 Gaussian smoothing approach with two-point feedback

C.9.1 Definition Gaussian approximation

C.9.2 Bias square

C.9.3 Second moment

C.9.4 Convergence rate

C.10 Gaussian smoothing approach for Theorem

C.10.1 Definition Gaussian approximation

C.10.2 Bias square

C.10.3 Second moment

C.10.4 Convergence rate

C.11 Gaussian smoothing approach for Theorem

C.11.1 Definition Gaussian approximation

C.11.2 Bias square

C.11.3 Second moment

C.11.4 Convergence rate

List of Figures

List of Tables

Введение диссертации (часть автореферата) на тему «Безградиентные методы выпуклой оптимизации в условиях шума / Gradient-Free Methods for Convex Optimization under Noise Conditions»

Introduction

This chapter provides an overview of the results in the thesis.

1.1 Relevance and Importance

This work addresses a standard optimization problem defined as follows:

f = min d {f (x):= E [f (x,£)]} , (1.1)

where / : Q ^ R is the objective function to be minimized, and /* denotes the optimal solution we aim to identify. When there are no constraints in computing the gradient or higherorder derivatives of the objective function, optimal first- or higher-order optimization algorithms [122, 99, 64, 30, 176, 12, 100, 90, 161] are typically employed to solve the problem (1.1). However, if computing the gradient Vf (x) is infeasible for any reason, the only viable approach may be to utilize gradient-free (zero-order) optimization algorithms [38, 144]. Among the situations in which information about the derivatives of the objective function is unavailable are the following:

a) Non-smoothness of the objective function. This scenario is perhaps the most prevalent in theoretical studies [124, 44, 153, 59, 10, 32, 137, 97, 95, 114].

b) Desire to save computational resources. In some cases, computing the gradient Vf (x) can be significantly more computationally intensive than evaluating the objective function f (x). This situation is particularly common and highly relevant in real-world applications [26].

c) Inaccessibility of the function gradient. A notable example of this situation is the problem of designing an ideal product tailored to a specific individual [121, 149, 162, 112, 35, 155, 63].

Like first-order optimization algorithms, gradient-free algorithms are characterized by the following optimality criteria: #N — the number of consecutive iterations required to achieve the desired solution accuracy e and #T — the total number of calls to the gradient-free oracle. In this context, a gradient-free (or derivative-free) oracle refers to having access only to the objective function f (x) with some bounded noise. Since the objective function is subject to noise, the gradient-free oracle effectively functions as a black box. Consequently, in the literature, the original problem formulation (1.1) with a gradient-free oracle is often referred to as a black-box optimization problem [94]. However, unlike higher-order algorithms, gradient-free algorithms have

a third optimality criterion: the maximum noise level A, at which the algorithm still converges "good." Here, by "good convergence," we mean convergence comparable to the case when A = 0.

(a) Resource saving (b) Robustness to attacks (c) Confidentiality

Figure 1.1: Motivation to find the maximum noise level A

The emergence of this seemingly unconventional criterion can be explained by the following motivational examples (see Figure 1.11). Resource Saving (Figure 1.1a): The more accurately the objective function value is calculated, the higher the computational cost. Robustness to Attacks (Figure 1.1b): Enhancing the maximum noise level increases the algorithm's robustness to adversarial attacks. Confidentiality (Figure 1.1c): Certain companies, due to confidentiality concerns, cannot disclose all information. Thus, it is crucial to address the question:

How much can the objective function be noisy?

Among zero-order methods in the literature, three approaches to the creation of algorithms are distinguished, thus clustering them into the following classes: the first is the class of evolutionary algorithms [160, 79, 17, 78, 171, 128, 130, 131, 132, 133], which can often demonstrate their efficiency only empirically. Evolutionary algorithms are commonly used in the class of non-convex multimodal problems, where the primary objective is to find a global rather than a local optimum. The second is the class based on the advantages of first-order algorithms [93, 59, 1, 111, 110], whose efficiency is provided in the form of theoretical estimates. In the class of convex problems, it is more reasonable to use these theoretically grounded algorithms, as they often guarantee faster convergence by leveraging the strengths of first-order methods. The third is the class of Bayesian Optimization methods (e.g., GP-UCB, TuRBO) [87, 92, 172, 51, 41, 52]. These methods are primarily useful for problems where the function evaluations are expensive. Bayesian Optimization aims to build a probabilistic model (typically a Gaussian Process) of the objective function to make efficient decisions about where to sample next. However, despite their effectiveness in high-cost settings, they often fall short compared to first-order-based methods when it comes to convergence rate, particularly in high-dimensional convex optimization problems.

The fundamental idea of the second approach to creating algorithms with a gradient-free oracle that are efficient according to the three aforementioned criteria is to leverage first-order algorithms by substituting a gradient approximation for the true gradient [57]. The selection of the first-order optimization algorithm depends on the formulation of the original problem (i.e., the assumptions regarding the function and the gradient oracle). However, the choice of gradient

xThe images are taken from the following resource: https://ru.freepik.com/

approximation is determined by the smoothness of the function. For instance, if the function is non-smooth, a smoothing scheme with either l1 randomization [10, 105] or l2 randomization [47, 107, 113] should be employed to solve the original problem. If the function is smooth, it is sufficient to use either li randomization [5] or l2 randomization [68, 109, 157]. However, if the objective function is not only smooth but also possesses a higher order of smoothness (fi > 2), the so-called Kernel approximation [6, 58, 148, 31, 108, 106] should be utilized as the gradient approximation. This approximation leverages the additional information about the higher order of smoothness by employing two-point feedback.

1.2 Contribution and Scientific Novelty

The results presented in this work are entirely novel and make significant contributions to the theoretical foundations of optimization in machine learning. The key contributions can be summarized as follows:

• For the first time, a detailed parallel analysis of two smoothing schemes — based on l1 and l2 randomization — has been carried out. For l1 randomization, the Lipschitz constant of the smoothed gradient, Lf , has been computed. In the case of l2 smoothing with a two-point feedback oracle, a previously unavailable constant c has been derived for estimating the second moment of the gradient approximation; its exact value was not provided in the original work [153]. Additionally, the variance of the one-point l1 randomization scheme has been evaluated. It has been established that even with a one-point oracle,

11 randomization offers advantages over l2 randomization—up to a logarithmic factor in the dimensionality d, similar to the two-point oracle case. Optimal upper bounds on the convergence rates of first-order batched methods, Minibatch SMP and Single-Machine SMP, have been obtained in the context of saddle-point optimization within a federated architecture. A new methodology for designing gradient-free algorithms has been proposed, optimized with respect to three key parameters: the number of communication rounds N, the allowable noise level A, and the oracle complexity T. A comparative analysis of l1 and

12 randomizations in a federated setting further confirmed the advantage of the former. It was shown that, to achieve e-accuracy in the objective function, one-point schemes require O(d/e2) more oracle calls than two-point schemes. Finally, the empirical performance of Minibatch Accelerated SGD under various smoothing strategies was investigated on an applied problem.

• The problem of stochastic convex optimization under limited information access — using only a zeroth-order oracle in the presence of adversarial stochastic noise — has been studied. Both smooth and non-smooth objective functions are considered. For each setting, upper bounds on the allowable noise level ensuring convergence have been derived. A new algorithm, optimal with respect to three criteria, has been proposed and theoretically justified. Additionally, the behavior of the proposed method when the noise level exceeds the allowable threshold has been analyzed.

• For the problem of stochastic smooth convex optimization in the overparameterized regime (where the number of variables exceeds the number of observations), convergence rates have been derived for a biased accelerated stochastic gradient method. A new gradinet-free method, AZO-SGD — a accelerated zeroth-order algorithm for black-box optimization — has been developed, exhibiting optimal oracle complexity within the considered problem class. The algorithm's robustness to adversarial noise has been established, with a precise estimate of the maximum allowable noise level that still ensures the desired accuracy. These results are supported by experiments on the minimization of a finite sum of functions, where the number of summands is significantly smaller than the dimensionality of the space, reflecting the overparameterized setting.

• A Zero-Order Mini-batch SGD method has been developed, which has significantly improved existing convergence estimates, particularly regarding the error floor, for gradient-free algorithms applied to smooth nonconvex problems under the Polyak—Lojasiewicz condition. An extended theoretical analysis has been carried out in a stochastic setting using one-point and two-point gradient approximations, with the zeroth-order oracle corrupted by bounded adversarial deterministic and stochastic noise, demonstrating that the proposed method has outperformed existing counterparts. The theoretical results have been empirically validated on representative PL-type problems, such as solving systems of nonlinear equations, and the practical effectiveness of randomized gradient approximations has been demonstrated, highlighting the advantages of kernel-based schemes over l2 and Gaussian alternatives.

1.3 Presentations and Validation of Research Results

The results of this thesis were presented at the following conferences and seminars.

1. "Two-Point and One-Point Gradient-Free Methods for Federated Learning". Quasilinear Equations, Inverse Problems and Their Applications. Sochi, Russia (August 22, 2022).

2. "Is a smoothing scheme with l1 randomization better?". 65th Russian Scientific Conference of MIPT. Dolgoprudny, Russia (April 5, 2023).

3. "Stochastic Adversarial Noise in the «Black Box» Optimization Problem". WAIT: Workshop on Artificial Intelligence Trustworthiness. Yerevan, Armenia (April 22, 2023).

4. "Survey of modern randomized gradient-free algorithms for convex optimization problems". All-Russian seminar on optimization named after B.T. Polyak. Dolgoprudny, Russia (online, September 15, 2023).

5. "Stochastic Adversarial Noise in the "Black Box" Optimization Problem". XIV International Conference Optimization and Applications (OPTIMA-2023). Petrovac, Montenegro (September 20, 2023).

6. "Accelerated Zero-Order SGD Method for Solving the Black Box Optimization Problem under "Overparametrization" Condition". XIV International Conference Optimization and

Applications (OPTIMA-2023). Petrovac, Montenegro (September 20, 2023).

7. "The "Overbatching" Effect? Yes, or How to Improve Error Floor in Black-Box Optimization Problems". Device Algorithm Summit 2024 (Huawei). Moscow, Russia (June 25, 2024).

8. "State-of-the-art Gradient-Free Randomized Convex Optimization Algorithms". Main scientific seminar of the Innopolis University «Innopolis. Science». Innopolis, Russia (June 27, 2024).

9. "State-of-the-art Zero-Order Randomized Algorithms for Convex Optimization". Research and Practice Workshop: At the center of AI. Saint Petersburg, Russia (September 25, 2024).

10. "Gradient-free oracle as a black box: how to solve optimization problems when the oracle has no access to derivatives". Optimization & Decision Intelligence Group Internal Seminar. Zurich, Switzerland (online, September 27, 2024).

1.4 Publications

Chapters 2-5 are based on the following papers, respectively:

[10] Belal Alashqar, Alexander Gasnikov, Darina Dvinskikh, Aleksandr Lobanov. Gradient-free federated learning methods with l 1 and l 2-randomization for non-smooth convex stochastic optimization problems //Computational Mathematics and Mathematical Physics. - 2023. - T. 63. - №. 9. - C. 1600-1653.

[105] Aleksandr Lobanov. Stochastic adversarial noise in the "black box" optimization problem // International Conference on Optimization and Applications. - Cham : Springer Nature Switzerland, 2023. - C. 60-71.

[109] Aleksandr Lobanov, Alexander Gasnikov. Accelerated zero-order sgd method for solving the black box optimization problem under "overparametrization" condition // International Conference on Optimization and Applications. - Cham : Springer Nature Switzerland, 2023. - C. 72-83.

[58] Alexander Gasnikov, Aleksandr Lobanov, Fedor Stonyakin Stonyakin. Highly smooth zeroth-order methods for solving optimization problems under the PL condition // Computational Mathematics and Mathematical Physics. - 2024. - T. 64. - №. 4. - C. 739-770.

1.5 Thesis Structure

The thesis consists of an introduction, 4 main chapters, list of 184 references, and 3 chapters in the Appendix with technical details, some proofs, and auxiliary results.

Оглавление диссертации кандидат наук Лобанов Александр Владимирович

Рекомендованный список диссертаций по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Численные методы оптимизации для задач большой размерности: неточный оракул и прямо-двойственный анализ2020 год, доктор наук Двуреченский Павел Евгеньевич

Децентрализованная оптимизация на меняющихся со временем сетях / Decentralized optimization over time-varying networks2023 год, кандидат наук Рогозин Александр Викторович

Введение диссертации (часть автореферата) на тему «Безградиентные методы выпуклой оптимизации в условиях шума / Gradient-Free Methods for Convex Optimization under Noise Conditions»

Похожие диссертационные работы по специальности «Другие cпециальности», 00.00.00 шифр ВАК

Методы решения задач, допускающих альтернативную минимизацию / Methods for Solving Problems That Allow Alternating Minimization2020 год, кандидат наук Тупица Назарий Константинович

On prospects and limitations of variational quantum algorithms/О перспективах и ограничениях вариационных квантовых алгоритмов2025 год, кандидат наук Рабинович Даниил Сергеевич

Комплексные программные методы обработки интерференционных картин/Сomplex Software Methods for Processing Interference Patterns2025 год, кандидат наук Хирьянов Тимофей Федорович