[A12] 有偏期望的性质及其在随机梯度法平均平方梯度范数估计中的应用

\quad\quad 先定义一个函数 \varphi_s(x). 对 s,x \in \mathbb{R},当 s \ne 0 时,\varphi_s(x)=\dfrac{\mathrm{e}^{sx}-1}{s};当 s = 0 时,\varphi_0(x)=x. 对给定的 s\varphi_s(x) 都是单调上升的连续函数. 接下来借助 \varphi_s(x) 来定义有偏期望(biased expectation).
\quad\quad定义 1 (有偏期望)X 为实值随机变量,I_X 为使得 \varphi_s(X) 可积的全体实数 s 构成的集合. 对任意 s \in I_X,记 \mu_s(X) = \varphi_{s}^{-1}(\mathbb{E}[\varphi_s(X)]),称 \mu_s(X)X 的有偏期望.
\quad\quad注 1 \quad\varphi_s 代入 \mu_s(X) 的定义式可知对于 s \ne 0,有 \mu_s(X) = \dfrac{1}{s} \log \mathbb{E}[\mathrm{e}^{sX}],而 s = 0\mu_s(X) 退化为数学期望 \mathbb{E}[X]. 利用这种改写可以直接得到对任意 a \in \mathbb{R}\mu_s(aX)=a\mu_{as}(X). 另外,还可以进行进一步地改写 \mu_s(X) = \log \|\mathrm{e}^{X}\|_s,那么由 L^s 范数的性质可知\ \displaystyle \lim_{s \to +\infty}\mu_s(X) = \mathrm{ess \ sup}X.
\quad\quad性质 1 \quadX,Y 为独立随机变量, s \in I_X \cap I_Y,则 \mu_s(X+Y) = \mu_{s}(X)+\mu_{s}(Y). \quad\quad证明\quad s = 0 时是平凡的. s \ne 0 时,由 X,Y 独立可知 \mathrm{e}^{sX}\mathrm{e}^{sY} 独立,因此 \mu_s(X+Y) = \dfrac{1}{s}\log \mathbb{E}[\mathrm{e}^{s(X+Y)}]=\dfrac{1}{s} \log \left( \mathbb{E}[\mathrm{e}^{sX}]\mathbb{E}[\mathrm{e}^{sY}]\right) = \mu_{s}(X)+\mu_{s}(Y). \quad\quad再定义有偏条件期望,它在随后对随机梯度法平均平方梯度范数的估计中起到了重要的作用.
\quad\quad定义 2 (有偏条件期望)X 为随机变量,\mathcal{F}X 所在概率空间中的子 \sigma-代数,记 \mu_s(X|\mathcal{F}) = \varphi_{s}^{-1}(\mathbb{E}[\varphi_s(X)|\mathcal{F}]),称 \mu_s(X)X 关于 \mathcal{F} 的有偏条件期望.
\quad\quad注 2 \quad\varphi_s 代入 \mu_s(X|\mathcal{F}) 的定义式可知对于 s \ne 0,有 \mu_s(X|\mathcal{F}) = \dfrac{1}{s} \log \mathbb{E}[\mathrm{e}^{sX}|\mathcal{F}],而 s = 0\mu_s(X|\mathcal{F}) 退化为条件期望 \mathbb{E}[X|\mathcal{F}].
\quad\quad性质 2 \quad有偏条件期望满足全期望公式 \mu_s(\mu_s(X|\mathcal{F})) = \mu_s(X).
\quad\quad证明\quad由定义 \mu_s(\mu_s(X|\mathcal{F})) = \varphi_s^{-1}\left(\mathbb{E}\left[\varphi_{s}\circ\varphi_{s}^{-1}(\mathbb{E}[\varphi_s(X)|\mathcal{F}])\right]\right)=\varphi_s^{-1}\left(\mathbb{E}\left[\mathbb{E}[\varphi_s(X)|\mathcal{F}]\right]\right),再由条件期望的全期望公式便有\mu_s(\mu_s(X|\mathcal{F})) = \varphi_s^{-1}\left(\mathbb{E}\left[\varphi_s(X)\right]\right)=\mu_s(X).\quad\quad性质 3 \quad \displaystyle \lim_{s \to +\infty}\mu_s(X|\mathcal{F}) \leqslant \mathrm{ess \ sup}X 在 a.s. 意义下成立.
\quad\quad证明\quad对于 s > 0, a.s. -\mathbb{P}\mu_s(X|\mathcal{F}) = \dfrac{1}{s} \log \mathbb{E}\left[\mathrm{e}^{sX}|\mathcal{F}\right] \leqslant \dfrac{1}{s} \log \mathbb{E}\left[\mathrm{ess \ sup}(\mathrm{e}^{sX})|\mathcal{F}\right] = \mathrm{ess \ sup} X,左侧令 s \to +\infty 即得证.
\quad\quad性质 4 \quad设两个随机变量 X,Ys \in \mathbb{R},若 s \geqslant 0,则\mu_s(X+Y | \mathcal{F}) \leqslant \mu_{2s}(X | \mathcal{F})+\mu_{2s}(Y| \mathcal{F}),而当 s < 0 时有\mu_s(X+Y | \mathcal{F}) \leqslant \mu_{s}(X | \mathcal{F})+\dfrac{\mathbb{E}[Y\mathrm{e}^{sX} | \mathcal{F}]}{\mathbb{E}[\mathrm{e}^{sX} | \mathcal{F}]}, 其中 s 的选取可使得以上两式有意义. 这两式均是 a.s. -\mathbb{P} 成立.
\quad\quad证明\quad s = 0 时是平凡的. s > 0 时,由条件 Cauchy-Schwartz 不等式,a.s. -\mathbb{P} 成立\mu_s(X+Y) = \dfrac{1}{s}\log \mathbb{E}[\mathrm{e}^{s(X+Y)}| \mathcal{F}] \leqslant \dfrac{1}{2s} \log \left( \mathbb{E}[\mathrm{e}^{2sX}| \mathcal{F}]\mathbb{E}[\mathrm{e}^{2sY}| \mathcal{F}]\right) = \mu_{2s}(X| \mathcal{F})+\mu_{2s}(Y| \mathcal{F}), s < 0 时,首先有\begin{aligned} \mu_s(X+Y| \mathcal{F}) & = \dfrac{1}{s} \log \mathbb{E}[\mathrm{e}^{sX}| \mathcal{F}] + \dfrac{1}{s} \log \mathbb{E}\left[\dfrac{\mathrm{e}^{s(X+Y)}}{\mathbb{E}[\mathrm{e}^{sX} | \mathcal{F}]} \Big| \mathcal{F} \right] \\ & = \mu_{s}(X | \mathcal{F})+\dfrac{1}{s} \log \mathbb{E}\left[\mathrm{e}^{sY}\dfrac{\mathrm{e}^{sX}}{\mathbb{E}[\mathrm{e}^{sX} | \mathcal{F}]} \Big| \mathcal{F} \right], \end{aligned}对右侧第二项使用条件 Jensen 不等式,a.s. -\mathbb{P} 成立\begin{aligned} \log \mathbb{E}\left[\mathrm{e}^{sY}\dfrac{\mathrm{e}^{sX}}{\mathbb{E}[\mathrm{e}^{sX} | \mathcal{F} ]} \Big| \mathcal{F} \right] \geqslant \mathbb{E}\left[sY\dfrac{\mathrm{e}^{sX}}{\mathbb{E}[\mathrm{e}^{sX} | \mathcal{F}]} \Big| \mathcal{F} \right], \end{aligned}从而由条件期望的性质有\mu_s(X+Y| \mathcal{F}) \leqslant \mu_{s}(X| \mathcal{F})+\dfrac{\mathbb{E}[Y\mathrm{e}^{sX} | \mathcal{F}]}{\mathbb{E}[\mathrm{e}^{sX}| \mathcal{F}]}. \quad\quad定义 3 (条件亚指数分布) X 为随机变量,\nu, \alpha>0. 称 X 在条件 \mathcal{F} 下服从 (\nu^2, \alpha)-亚指数分布,当且仅当对任意满足 |\lambda|<\dfrac{1}{\alpha}\lambda 都有 \mathbb{E}\left[\mathrm{e}^{\lambda(X-\mathbb{E}[X | \mathcal{F}])} | \mathcal{F}\right] \leqslant \mathrm{e}^{\frac{\nu^2 \lambda^2}{2}} 在 a.s. -\mathbb{P} 意义下成立.
\quad\quad性质 5 \quad设随机变量在条件 \mathcal{F} 下服从 (\sigma^2, c)-亚指数分布,且 \mathbb{E}[X | \mathcal{F}]=0,则对任意 s \in \left(0,\dfrac{1}{c}\right]都有 \mu_s(X | \mathcal{F}) \leqslant \dfrac{s \sigma^2}{2} 在 a.s. -\mathbb{P} 意义下成立.
\quad\quad证明\quad由定义,对任意 s \in \left(0,\dfrac{1}{c}\right]都 a.s. -\mathbb{P} 成立 \mathbb{E}\left[\mathrm{e}^{sX} | \mathcal{F}\right] \leqslant \mathrm{e}^{\frac{s^2 \sigma^2}{2}},即 \mu_s(X | \mathcal{F}) = \dfrac{1}{s} \log \mathbb{E}[\mathrm{e}^{sX} | \mathcal{F}] \leqslant \dfrac{s \sigma^2}{2}.\quad\quad引理 1 (条件 Markov 不等式)X 为非负随机变量,\mathcal{F}X 所在概率空间中的子 \sigma-代数,a>0 为实数,那么 a.s. -\mathbb{P} 成立 \mathbb{P}(X \geqslant a| \mathcal{F}) \leqslant \dfrac {\mathbb{E}[X|\mathcal{F}]}{a}.
\quad\quad证明\quad利用 a \mathbf{1}_{\{X \geqslant a\}} \leqslant X,在任意 B \in \mathcal{F} 上对此式两边积分得到 a \int_{B} \mathbf{1}_{\{X \geqslant a\}} \mathrm{d} \mathbb{P} \leqslant \int_{B} X \mathrm{d} \mathbb{P},由条件期望的定义就有 a \int_{B} E\left[\mathbf{1}_{\{X \geqslant a\}}|\mathcal{F}\right] \mathrm{d} \mathbb{P} \leqslant \int_{B} E[X|\mathcal{F}] \mathrm{d} \mathbb{P},因此 a.s. -\mathbb{P}E\left[\mathbf{1}_{\{X \geqslant a\}}|\mathcal{F}\right] = \mathbb{P}(X \geqslant a| \mathcal{F}) \leqslant \dfrac {\mathbb{E}[X|\mathcal{F}]}{a}. \quad\quad注 2 \quadf(0, +\infty) 上非负、上升,则条件 Markov 不等式可写为 \mathbb{P}(X \geqslant a| \mathcal{F}) \leqslant \dfrac {\mathbb{E}[f(X)|\mathcal{F}]}{f(a)}.
\quad\quad性质 6(集中不等式) 设随机变量 X,对任意 x, s >0s \in I_X\mathbb{P}\left(X \geqslant \mu_{s}(X|\mathcal{F})+x | \mathcal{F}\right) \leqslant \mathrm{e}^{-sx}. 若又有 X 非负,(-\infty, 0) \subset I_X,则对任意 s>0,又有 \mathbb{P}\left(X \geqslant x| \mathcal{F}\right) \leqslant \dfrac{1+s x}{x} \mu_{-s}(X| \mathcal{F}),这两式均是 a.s. -\mathbb{P} 成立.
\quad\quad证明\quad\mathbb{P}(X \geqslant a| \mathcal{F}) \leqslant \dfrac {\mathbb{E}\left[\mathrm{e}^{sX}|\mathcal{F}\right]}{\mathrm{e}^{sa}} 中,取 a = \mu_{s}(X|\mathcal{F}) + x 便得到 \mathbb{P}\left(X \geqslant \mu_{s}(X|\mathcal{F})+x | \mathcal{F}\right) \leqslant \mathrm{e}^{-sx} \dfrac{\mathbb{E}\left[\mathrm{e}^{sX}|\mathcal{F}\right]}{\mathrm{e}^{s \mu_s(X|\mathcal{F})}}= \mathrm{e}^{-sx},\mathbb{P}(X \geqslant x| \mathcal{F}) \leqslant \dfrac {\mathbb{E}[f(X)|\mathcal{F}]}{f(x)} 中取 f = \varphi_{-s} 便得到 \mathbb{P}\left(X \geqslant x| \mathcal{F}\right) \leqslant \dfrac {\mathbb{E}[\varphi_{-s}(X)|\mathcal{F}]}{\varphi_{-s}(x)} = \dfrac {\varphi_{-s}\left(\mu_{-s}(X)|\mathcal{F}\right)}{\varphi_{-s}(x)} \leqslant \dfrac{\mu_{-s}(X|\mathcal{F})}{\varphi_{-s}(x)} \leqslant \dfrac{1+s x}{x} \mu_{-s}(X| \mathcal{F}). \quad\quad3\quad\mathcal{F} 取成 \{\varnothing, \varOmega\} 就得到这两个不等式的无条件版本.
\quad\quad接下来利用有偏期望来研究与 SGD 收敛性分析相关的理论分析. 设整数 d \geqslant 1f: \mathbb{R}^{d} \to \mathbb{R}\beta-光滑的实函数,即 \nabla f\beta-Lipschitz 的,这意味着对任意 x,y 都有 \left|f(y)-f(x)-\langle\nabla f(x), y-x\rangle \right| \leqslant \frac{\beta}{2}\|y-x\|^{2} ,在 SGD 中,迭代次数 T、梯度步长 \eta 和初始状态 x_0 是给定的,每一步的输出由 x_{t+1}=x_{t} - \eta G_{t} 确定,其中 G_{t} = \nabla f(x_t) + X_tX_t 为随机噪声. 用 \Delta = \displaystyle f(x_0) - \min_{x \in \mathbb{R}^d}f(x) 表示初始函数值与最小值间的差. 下面给出有关 SGD 中平均平方梯度范数估计的重要结果.
\quad\quad定理 1 \quad设随机过程 \left(x_t\right)_{t \geqslant 0} 适应于滤 \left(\mathcal{F}_t\right)_{t \geqslant 0},对任意 s \in \mathbb{R},存在 \sigma_s \in \mathbb{R}_+ \cap \{+ \infty\},使得对任意的 t \geqslant 0 都有 \mu_{s}\left(\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right) \leqslant \sigma_{s}^{2};对任意 s \geqslant 0,存在 m_s \in \mathbb{R}_+ \cap \{+ \infty\},使得对任意的 t \geqslant 0,都有 \mu_{s}\left(-\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle | \mathcal{F}_{t}\right) \leqslant m_{s},而对任意 s < 0,存在 m_s \in \mathbb{R}_+ \cap \{+ \infty\},使得对任意的 t \geqslant 0,都有 \dfrac{\mathbb{E}\left[-\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle \mathrm{e}^{s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right]}{\mathbb{E}\left[\mathrm{e}^{s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right]} \leqslant m_{s}. 固定 \eta 使得 0 < \eta \leqslant \dfrac{1}{\beta},则 \mu_{s}\left(\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) \leqslant \dfrac{2 \Delta}{\eta T}+2(1-\beta \eta) m_{u}+\beta \eta \sigma_{v}^{2},s \geqslant 0u = \dfrac{4s}{T}v = \dfrac{2 \beta \eta s}{T},而 s < 0u = v = \dfrac{\beta \eta s}{T}.
\quad\quad证明\quad首先利用 f\beta-光滑的,有\begin{aligned} f\left(x_{t+1}\right) & \leqslant f\left(x_{t}\right)+\left\langle\nabla f\left(x_{t}\right), x_{t+1}-x_{t}\right\rangle+\dfrac{\beta}{2}\left\|x_{t+1}-x_{t}\right\|^{2} \\ & \leqslant f\left(x_{t}\right)-\eta\left\langle\nabla f\left(x_{t}\right), G_{t}\right\rangle+\dfrac{\beta \eta^{2}}{2}\left\|G_{t}\right\|^{2} \\ & \leqslant f\left(x_{t}\right)-\eta\left\|\nabla f\left(x_{t}\right)\right\|^{2}-\eta\left\langle\nabla f\left(x_{t}\right), X_{t}\right\rangle+\dfrac{\beta \eta^{2}}{2}\left\|\nabla f\left(x_{t}\right)+X_{t}\right\|^{2} \\ & \leqslant f\left(x_{t}\right)-\eta\left(1-\dfrac{\beta \eta}{2}\right)\left\|\nabla f\left(x_{t}\right)\right\|^{2}-\eta(1-\beta \eta)\left\langle\nabla f\left(x_{t}\right), X_{t}\right\rangle+\dfrac{\beta \eta^{2}}{2}\left\|X_{t}\right\|^{2},\end{aligned}\eta\left(1-\dfrac{\beta \eta}{2}\right)\left\|\nabla f\left(x_{t}\right)\right\|^{2}\leqslant f(x_t)-f(x_{t+1})-\eta(1-\beta \eta)\left\langle\nabla f\left(x_{t}\right), X_{t}\right\rangle+\dfrac{\beta \eta^{2}}{2}\left\|X_{t}\right\|^{2},0T-1 对上式两边求和得到\begin{aligned} \eta\left(1-\frac{\beta \eta}{2}\right) \sum_{t<T}\left\|\nabla f\left(x_{t}\right)\right\|^{2} & \leqslant f(x_0)-f(x_T)-\eta(1-\beta \eta) \sum_{t<T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle+\frac{\beta \eta^{2}}{2} \sum_{t<T}\left\|X_{t}\right\|^{2}\\ & \leqslant \Delta-\eta(1-\beta \eta) \sum_{t<T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle+\frac{\beta \eta^{2}}{2} \sum_{t<T}\left\|X_{t}\right\|^{2}, \end{aligned}0 < \eta \leqslant \dfrac{1}{\beta}\dfrac{\eta}{2} \leqslant \eta \left(1-\dfrac{\beta \eta}{2}\right),在上式两段除以 \dfrac{\eta T}{2} 得到 \frac{1}{T} \sum_{t< T}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \frac{2 \Delta}{\eta T}-\frac{2(1-\beta \eta)}{T} \sum_{t<T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle+\frac{\beta \eta}{T} \sum_{t<T}\left\|X_{t}\right\|^{2},两边取有偏期望,利用性质 1 得到 \begin{aligned} \mu_{s}\left(\dfrac{1}{T} \sum_{t<T}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) & \leqslant \mu_{s}\left(\dfrac{2 \Delta}{\eta T}-\dfrac{2(1-\beta \eta)}{T} \sum_{t<T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle+\dfrac{\beta \eta}{T} \sum_{t<T}\left\|X_{t}\right\|^{2}\right) \\ & = \dfrac{2 \Delta}{\eta T}+\mu_{s}\left(\sum_{t<T} A_{t}\right), \end{aligned}其中 A_t=-\dfrac{2(1-\beta \eta)}{T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle+\dfrac{\beta \eta}{T}\left\|X_{t}\right\|^{2}. 利用性质 2,当 s \geqslant 0 时,\begin{aligned} \mu_{s}\left(A_{t} | \mathcal{F}_{t}\right) \leqslant & \mu_{2s}\left(-\dfrac{2(1-\beta \eta)}{T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle | \mathcal{F}_t\right)+\mu_{2s}\left(\dfrac{\beta \eta}{T}\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right)\\ \leqslant & \dfrac{2(1-\beta \eta)}{T} \mu_{\frac{4s(1-\beta \eta)}{T}}\left(-\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle | \mathcal{F}_t\right) + \dfrac{\beta \eta}{T} \mu_{\frac{2\beta \eta s}{T}}\left(\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right)\\ \leqslant & \dfrac{2(1-\beta \eta)}{T} m_{u}+\dfrac{\beta \eta}{T} \sigma_{v}^{2}, \end{aligned}而当 s < 0 时,\begin{aligned} \mu_{s}\left(A_{t} | \mathcal{F}_{t}\right) \leqslant & \mu_{s}\left(\dfrac{\beta \eta}{T}\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right) +\dfrac{\mathbb{E}\left[-\dfrac{2(1-\beta \eta)}{T}\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle \mathrm{e}^{\frac{\beta \eta s}{T}\left\|X_{t}\right\|^{2}} \big| \mathcal{F}_{t}\right]}{\mathbb{E}\left[\mathrm{e}^{\frac{\beta \eta s}{T}\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right]}\\ \leqslant & \dfrac{2(1-\beta \eta)}{T} \mu_{\frac{\beta \eta s}{T}}\left(-\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle | \mathcal{F}_t\right) + \dfrac{\beta \eta}{T} \mu_{\frac{\beta \eta s}{T}}\left(\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right)\\ \leqslant & \dfrac{2(1-\beta \eta)}{T} m_{u}+\dfrac{\beta \eta}{T} \sigma_{v}^{2}, \end{aligned}其中 uv 按照定理叙述定义. 注意,由 \left(x_t\right)_{t \geqslant 0} 适应于 \left(\mathcal{F}_t\right)_{t \geqslant 0} 可知对任意 t \geqslant 0 x_t\mathcal{F}_t 可测的,那么由 x_1=x_0-\eta(X_0+\nabla f(x_0)) X_0\mathcal{F}_1 可测的,由 x_2=x_1-\eta(X_1+\nabla f(x_1))\mathcal{F}_1 \subset \mathcal{F}_2 X_1\mathcal{F}_2 可测的,以此类推, X_{t-1}\mathcal{F}_t 可测的,从而 A_{t-1}\mathcal{F}_t 可测的,那么由性质 1、性质 2 和条件期望的性质,最终有 \begin{aligned} \mu_{s}\left(\sum_{t<T}A_{t} \right) & = \mu_{s} \left(\mu_{s}\left(\sum_{t<T}A_{t} \Big| \mathcal{F}_{T-1}\right) \right)\\ &= \mu_{s} \left(\dfrac{1}{s} \log \mathbb{E}\left[\exp\left(s\sum_{t<T-1}A_{t}\right)\mathrm{e}^{sA_{T-1}} \Big| \mathcal{F}_{T-1}\right] \right)\\ &= \mu_{s} \left(\sum_{t<T-1}A_{t} + \mu_{s}\left(A_{T-1} | \mathcal{F}_{T-1}\right) \right) \\ & \leqslant \mu_{s} \left(\sum_{t<T-1}A_{t} + \dfrac{2(1-\beta \eta)}{T} m_{u}+\dfrac{\beta \eta}{T} \sigma_{v}^{2} \right)\\ & = \mu_{s} \left(\sum_{t<T-1}A_{t} \right) + \dfrac{2(1-\beta \eta)}{T} m_{u}+\dfrac{\beta \eta}{T} \sigma_{v}^{2} \\ & \leqslant 2(1-\beta \eta) m_{u}+\beta \eta \sigma_{v}^{2},\end{aligned} 从而 \mu_{s}\left(\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) \leqslant \dfrac{2 \Delta}{\eta T}+2(1-\beta \eta) m_{u}+\beta \eta \sigma_{v}^{2}. \quad\quad引理 2 \quada,b,c,p > 0f(x)=ax^p+\dfrac{b}{x},记 x^{*} = \min \left\{ \left(\dfrac{b}{pa}\right)^{\frac{1}{1+p}},c\right\},则 f\left(x^{*}\right) \leqslant \left(1+p^{-1}\right) b c^{-1}+(1+p) p^{\frac{-p}{1+p}} a^{\frac{1}{1+p}} b^{\frac{p}{1+p}}. \quad\quad证明\quad事实上,f\left(x^{*}\right) \leqslant f(c)+f\left(\left(\dfrac{b}{pa}\right)^{\frac{1}{1+p}}\right)=\left(1+p^{-1}\right) b c^{-1}+(1+p) p^{\frac{-p}{1+p}} a^{\frac{1}{1+p}} b^{\frac{p}{1+p}}. \quad\quad下面给出定理 1 在不同类型噪声下的推论.
\quad\quad推论 1 (无偏且方差有界的噪声) 假设 E\left[X_t | \mathcal{F}_t\right] = 0\mathrm{var}\left[X_t | \mathcal{F}_t\right] \leqslant \sigma^2 对所有 t<T 成立,那么取定 \eta=\min \left\{\sqrt{\dfrac{2 \Delta}{T \beta \sigma^{2}}}, \dfrac{1}{\beta}\right\},就有 \mathbb{E}\left[\frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right] \leqslant \frac{4 \beta \Delta}{T}+\sqrt{\frac{8 \beta \Delta \sigma^{2}}{T}}.\quad\quad证明\quad在定理 1 中取 s=0,则 m_0=0\sigma_0^2=\sigma^2,于是 \mathbb{E}\left[\frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right]=\mu_{0}\left(\frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) \leqslant \dfrac{2 \Delta}{\eta T}+\beta \eta \sigma^{2},那么对 \eta 应用引理 2 并在其中令 p=1 就得到\mathbb{E}\left[\frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right] \leqslant \frac{4 \beta \Delta}{T}+\sqrt{\frac{8 \beta \Delta \sigma^{2}}{T}}. \quad\quad推论 2 (固定或有界噪声) 假设 \|X_t\| \leqslant B 对所有 t<T 成立,那么取定 \eta=\dfrac{1}{\beta},就 a.s. -\mathbb{P}\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \dfrac{2 \beta \Delta}{T}+B^{2}. \quad\quad证明\quad 首先由性质 3 有\lim_{s \to +\infty}\mu_s(\|X_t\|^2|\mathcal{F_t}) \leqslant \mathrm{ess \ sup}\|X_t\|^2 \leqslant B^2 a.s. -\mathbb{P} 成立,再在定理 1 中取 \eta=\dfrac{1}{\beta} 并令 s \to +\infty就得到\mathrm{ess \ sup}\left(\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) \leqslant \dfrac{2 \beta \Delta}{T}+B^{2},即 a.s. -\mathbb{P}\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \dfrac{2 \beta \Delta}{T}+B^{2}. \quad\quad推论 3 (无偏、方差有界且服从亚指数分布的噪声) 假设 E\left[X_t | \mathcal{F}_t\right] = 0\mathrm{var}\left[X_t | \mathcal{F}_t\right] \leqslant \sigma^2 对所有 t<T 成立,且存在常数 a,b,c>0 使得在 \mathcal{F}_t 条件下,\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle 服从 (a\sigma^2, c)-亚指数分布,\|X_t\|^2 服从 (b\sigma^2, c)-亚指数分布. 记 \kappa_{1}=1+\dfrac{b}{2c},则取定 \eta=\min \left\{\sqrt{\dfrac{2 \Delta}{\kappa_{1} \sigma^{2} \beta T}}, \dfrac{1}{\beta}\right\} 时,至少以概率 1-\delta 成立 \frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \frac{4 \beta \Delta}{T}+\frac{8 c}{T}\log \dfrac{1}{\delta}+\sqrt{\frac{8 \kappa_{1} \sigma^2 \beta \Delta}{T}}+\sqrt{\frac{32 a \sigma^{2}}{T}\log \dfrac{1}{\delta}}. \quad\quad证明\quad由性质 5,对任意 s \in \left(0,\dfrac{1}{c}\right],有 \mu_{s}\left(-\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle | \mathcal{F}_{t}\right) \leqslant \dfrac{a \sigma^{2} s}{2},且 \mu_{s}\left(\|X_t\|^2 | \mathcal{F}_{t}\right) \leqslant \dfrac{b \sigma^{2} s}{2} \leqslant \left(1+\dfrac{b}{2c}\right) \sigma^{2} . 那么对任意 x,s > 0,由定理 1 得 \mu_{s}\left(\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) \leqslant \dfrac{2 \Delta}{\eta T}+2(1-\beta \eta) m_{u}+\beta \eta \sigma_{v}^{2},其中 u = \dfrac{4s}{T} \leqslant \dfrac{1}{c}v = \dfrac{2 \beta \eta s}{T} \leqslant \dfrac{1}{c}m_u = \dfrac{a \sigma^{2} u}{2}\sigma_v^2 = \kappa_1\sigma^{2} . 再由性质 6 便有\mathbb{P}\left(\dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \geqslant \frac{2 \Delta}{\eta T}+2(1-\beta \eta) m_{u}+\beta \eta \sigma_{v}^{2}+x\right) \leqslant \mathrm{e}^{-sx},\delta = \mathrm{e}^{-sx},则至少以概率 1-\delta 成立 \dfrac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \kappa_1 \beta \eta \sigma^{2}+ \dfrac{2 \Delta}{\eta T}+\dfrac{8 a \sigma^{2} s}{T}+\dfrac{1}{s} \log \dfrac{1}{\delta},那么固定 \eta=\min \left\{\sqrt{\dfrac{2 \Delta}{\kappa_{1} \sigma^{2} \beta T}}, \dfrac{1}{\beta}\right\}s=\min\left\{\dfrac{1}{\sigma}\sqrt{\dfrac{T}{8a} \log \dfrac{1}{\delta}},\dfrac{T}{4c}\right\},分别对 \etas 应用引理 2 并在其中令 p=1 就得到至少以概率 1-\delta\frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \frac{4 \beta \Delta}{T}+\sqrt{\frac{8 \kappa_{1} \beta \Delta \sigma^2}{T}}+\frac{8 c}{T}\log \dfrac{1}{\delta}+4 \sigma \sqrt{\frac{ 2a }{T}\log \dfrac{1}{\delta}}. \quad\quad引理 3 \quadX 为非负随机变量,则 a.s. -\mathbb{P} 成立 \mathbb{E}[X | \mathcal{F}]= \displaystyle \int_{0}^{+\infty} \mathbb{P}(X \geqslant x | \mathcal{F}) \mathrm{d} x.
\quad\quad证明\quadX = \displaystyle \int_0^{+\infty}\mathbf{1}_{\{X \geqslant x\}} \mathrm{d} x 在任意 B \in \mathcal{F} 上两边积分得到\ \displaystyle \int_B X \mathrm{d} \mathbb{P} = \int_B \int_0^{+\infty}\mathbf{1}_{\{X \geqslant x\}} \mathrm{d} x \mathrm{d} \mathbb{P},那么由条件概率和条件期望的定义和 Fubini 定理就有 \begin{aligned} \int_B \mathbb{E}[X | \mathcal{F}] \mathrm{d} \mathbb{P} & = \int_B X \mathrm{d} \mathbb{P} = \int_B \int_0^{+\infty}\mathbf{1}_{\{X \geqslant x\}} \mathrm{d} x \mathrm{d} \mathbb{P} = \int_0^{+\infty} \int_B \mathbf{1}_{\{X \geqslant x\}} \mathrm{d} \mathbb{P} \mathrm{d} x \\ & = \int_0^{+\infty} \int_B \mathbb{E}\left[\mathbf{1}_{\{X \geqslant x\}} | \mathcal{F} \right] \mathrm{d} \mathbb{P} \mathrm{d} x = \int_0^{+\infty} \int_B \mathbb{P}(X\geqslant x | \mathcal{F}) \mathrm{d} \mathbb{P} \mathrm{d} x\\ & = \int_B \int_{0}^{+\infty} \mathbb{P}(X\geqslant x | \mathcal{F}) \mathrm{d} x \mathrm{d} \mathbb{P}, \end{aligned} 因此 a.s. -\mathbb{P} 成立 \mathbb{E}[X | \mathcal{F}]= \displaystyle \int_{0}^{+\infty} \mathbb{P}(X \geqslant x | \mathcal{F}) \mathrm{d} x.
\quad\quad引理 4 \quad 假设 E\left[X_t | \mathcal{F}_t\right] = 0,且存在常数 a,b,c>01<b \leqslant 2 使得对任意 s \in \left(0,\dfrac{1}{c}\right],都有 \mu_{-s}\left(\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right) \leqslant a \cdot s^{\frac{b-2}{2}},那么对任意 x \geqslant \sqrt{c}\mathbb{P}\left(\left\|X_{t}\right\| \geqslant x | \mathcal{F}_t \right) \leqslant 2ax^{-b}.
\quad\quad证明\quad在性质 6 中将 s 取成 x^{-2},就有 \mathbb{P}\left(\left\|X_{t}\right\| \geqslant x \big| \mathcal{F}_{t}\right)=\mathbb{P}\left(\left\|X_{t}\right\|^{2} \geqslant x^{2} \big| \mathcal{F}_{t}\right) \leqslant \frac{2 \mu_{-x^{-2}}\left(\left\|X_{t}\right\|^{2} \big| \mathcal{F}_{t}\right)}{x^{2}} \leqslant 2 a x^{-b}. \quad\quad推论 4 (无偏的重尾噪声) 假设 f 既是 \beta-光滑的,还是 L-Lipschitz 的,E\left[X_t | \mathcal{F}_t\right] = 0,且存在常数 a,b,c>01<b \leqslant 2 使得对任意 s \in \left(0,\dfrac{1}{c}\right] 都有 \mu_{-s}\left(\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right) \leqslant a \cdot s^{\frac{b-2}{2}}. 那么存在仅依赖于 a,b,c 的常数 \kappa_{2},\kappa_{3},\kappa_{4},\kappa_{5} 使得取定 \eta=\min \left\{\dfrac{T \varepsilon}{\beta}\left(\dfrac{4 \beta \Delta}{a b T^{3} \varepsilon^{2}}\right)^{\frac{2}{2+b}}, \dfrac{1}{\beta}\right\} 时,至少以概率 1-\delta 成立 \frac{1}{T} \sum_{t=0}^{T-1}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \frac{\kappa_{2} \beta \Delta}{T \delta}+\frac{\kappa_{3} \sqrt{\beta \Delta}}{T^{\frac{4-b}{4}} \delta^{\frac{2+b}{4}}}+\frac{\kappa_{4} L^{\frac{2+b}{3 b}}(\beta \Delta)^{\frac{b-1}{3 b}}}{T^{\frac{b-1}{b}} \delta^{\frac{2+b}{3 b}}}+\frac{\kappa_{5} \sqrt{\beta \Delta}}{T^{\frac{b-1}{b}} \delta^{\frac{2+b}{2 b}}}. \quad\quad证明\quad 对任意 s \in \left(0,\dfrac{1}{c}\right],首先有 \mathbb{E}\left[\mathrm{e}^{-s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right] =\mathrm{e}^{-s \mu_{-s}\left(\left\|X_{t}\right\|^{2} \mid \mathcal{F}_{t}\right)} \geqslant 1-s \mu_{-s}\left(\left\|X_{t}\right\|^{2} | \mathcal{F}_{t}\right) \geqslant 1-a s^{\frac{b}{2}} \geqslant 1-a c^{-\frac{b}{2}},Y = - \left\langle\nabla f\left(x_{t}\right), X_{t}\right\rangle,由 E\left[Y | \mathcal{F}_t\right] = 0fL-Lipschitz 的,有 \begin{aligned} \mathbb{E}\left[Y \mathrm{e}^{-s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right] & =\mathbb{E}\left[Y_{+} \mathrm{e}^{-s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right]-\mathbb{E}\left[Y_{-} \mathrm{e}^{-s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right] \\ & \leqslant \mathbb{E}\left[Y_{+} | \mathcal{F}_{t}\right]-\mathbb{E}\left[Y_{-} \mathrm{e}^{-s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right] \\ & =\mathbb{E}\left[Y_{-}\left(1-\mathrm{e}^{-s\left\|X_{t}\right\|^{2}}\right) \big| \mathcal{F}_{t}\right] \\ & \leqslant L \mathbb{E}\left[\left\|X_{t}\right\|\left(1-\mathrm{e}^{-s\left\|X_{t}\right\|^{2}}\right) \big| \mathcal{F}_{t}\right],\end{aligned} 最后一个不等号是利用了 Cauchy-Schwartz 不等式. 考虑函数 g(x)=x\left(1-\mathrm{e}^{-sx^2}\right),它在 g(x)=\left(0,+\infty\right) 上单调上升,且 g(x) \leqslant \min\left\{x, sx^3\right\},因此 g^{-1}(x) \geqslant \min\left\{x, \left(\dfrac{x}{s}\right)^{\frac{1}{3}}\right\}. 由引理 3、引理 4 就有 \begin{aligned} \mathbb{E}\left[\left\|X_{t}\right\|\left(1-\mathrm{e}^{-s\left\|X_{t}\right\|^{2}}\right) \mid \mathcal{F}_{t}\right] & =\mathbb{E}\left[g\left(\left\|X_{t}\right\|\right)| \mathcal{F}_t\right] \\ & =\int_{0}^{+\infty} \mathbb{P}(g(\left\|X_{t}\right\|) \geqslant x | \mathcal{F}_t ) \mathrm{d} x \\ & =\int_{0}^{+\infty} \mathbb{P}\left(\left\|X_{t}\right\| \geqslant g^{-1}(x) | \mathcal{F}_t\right) \mathrm{d} x \\ & =\int_{0}^{g(\sqrt{c})} \mathbb{P}\left(\left\|X_{t}\right\| \geqslant g^{-1}(x) | \mathcal{F}_t \right) \mathrm{d} x+\int_{g(\sqrt{c})}^{+\infty} \mathbb{P}\left(\left\|X_{t}\right\| \geqslant g^{-1}(x) | \mathcal{F}_t \right) \mathrm{d} x \\ & \leqslant g(\sqrt{c})+2 a \int_{0}^{+\infty}\left(g^{-1}(x)\right)^{-b} \mathrm{d} x \\ & \leqslant s c^{\frac{3}{2}}+2 a \int_{0}^{+\infty} \min\left\{x, \left(\dfrac{x}{s}\right)^{\frac{1}{3}}\right\}^{-b} \mathrm{d} x \\ & = s c^{\frac{3}{2}}+2 a \left( \int_{0}^{s^{-\frac{1}{2}}} \left(\dfrac{x}{s}\right)^{-\frac{b}{3}} \mathrm{d} x + \int_{s^{-\frac{1}{2}}}^{+\infty} x^{-b} \mathrm{d} x\right) \\ & = s c^{\frac{3}{2}}+\dfrac{4 a b}{(b-1)(3-b)} s^{\frac{b-1}{2}} \\ & \leqslant \left(c^{\frac{b}{2}}+\dfrac{4 a b}{(b-1)(3-b)}\right) s^{\frac{b-1}{2}}, \end{aligned}\kappa_{6}=\left(1-a c^{-\frac{b}{2}}\right)^{-1}\left(c^{\frac{b}{2}}+\dfrac{4 a b}{(b-1)(3-b)}\right),那么就有 \frac{\mathbb{E}\left[-\left\langle X_{t}, \nabla f\left(x_{t}\right)\right\rangle \mathrm{e}^{-s\left\|X_{t}\right\|^{2}} \big| \mathcal{F}_{t}\right]}{\mathbb{E}\left[\mathrm{e}^{-s\left\|X_{t}\right\|^{2}} | \mathcal{F}_{t}\right]} \leqslant \kappa_{6} L s^{\frac{b-1}{2}}, 接下来由定理 1,对任意 s \in \left(0,\dfrac{T}{\beta \eta c}\right]\mu_{-s}\left(\frac{1}{T} \sum_{t=1}^{T}\left\|\nabla f\left(x_{t}\right)\right\|^{2}\right) \leqslant \frac{2 \Delta}{\eta T}+\beta \eta a\left(\frac{\beta \eta s}{T}\right)^{\frac{b-2}{2}}+2\kappa_{6} L\left(\frac{\beta \eta s}{T}\right)^{\frac{b-1}{2}},再由性质 6,对任意 \varepsilon>0\mathbb{P}\left(\frac{1}{T} \sum_{t=1}^{T}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \geqslant \varepsilon\right) \leqslant \frac{1+s \varepsilon}{\varepsilon}\left[\frac{2 \Delta}{\eta T}+\beta \eta a\left(\frac{\beta \eta s}{T}\right)^{\frac{b-2}{2}}+ 2 \kappa_{6} L\left(\frac{\beta \eta s}{T}\right)^{\frac{b-1}{2}}\right] ,s = \min \left\{\dfrac{1}{\varepsilon}, \dfrac{T}{\beta \eta c}\right\},令 y = \dfrac{\beta \eta}{T\varepsilon} 就有 \mathbb{P}\left(\frac{1}{T} \sum_{t=1}^{T}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \geqslant \varepsilon\right) \leqslant \frac{2}{\varepsilon}\left[\frac{2 \beta \Delta}{T^{2} \varepsilon y}+a T \varepsilon y^{\frac{b}{2}}+a c^{\frac{2-b}{2}} T \varepsilon y+2\kappa_{6} L y^{\frac{b-1}{2}}\right],y = \min \left\{\left(\dfrac{4 \beta \Delta}{a b T^{3} \varepsilon^{2}}\right)^{\frac{2}{2+b}}, \dfrac{1}{T \varepsilon}\right\} 并对右侧前两项中的 y 应用引理 2,得到 \mathbb{P}\left(\frac{1}{T} \sum_{t=1}^{T}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \geqslant \varepsilon\right) \leqslant A+B+C+D,其中\begin{aligned} A & =\frac{4(2+b) \beta \Delta}{b T \varepsilon}, \\ B & =(2+b)\left(\frac{b}{2}\right)^{-\frac{b}{2+b}} \varepsilon^{-1}(a T \varepsilon)^{\frac{2}{2+b}}\left(\frac{2 \beta \Delta}{T^{2} x}\right)^{\frac{b}{2+b}} = a^{\frac{2}{2+b}}(2+b)b^{-\frac{b}{2+b}}(\beta \Delta)^{\frac{b}{2+b}}T^{\frac{2(1-b)}{2+b}}\varepsilon^{-\frac{2b}{2+b}}, \\ C & =2 a c^{\frac{2-b}{2}} T\left(\frac{4 \beta \Delta}{a b T^{3} \varepsilon^{2}}\right)^{\frac{2}{2+b}} = 2^{\frac{6-b}{2+b}} a^{\frac{b}{2+b}} b^{-\frac{2}{2+b}} c^{\frac{2-b}{2}}(\beta \Delta)^{\frac{2}{2+b}}T^{\frac{b-4}{2+b}}\varepsilon^{-\frac{4}{2+b}},\\ D & =\frac{4 \kappa_{6} L}{\varepsilon}\left(\frac{4 \beta \Delta}{a b T^{3} \varepsilon^{2}}\right)^{\frac{b-1}{2+b}} =4^{\frac{2b+1}{2+b}}(ab)^{\frac{1-b}{2+b}}(\beta \Delta)^{\frac{b-1}{2+b}}\kappa_{6} L T^{-\frac{3(b-1)}{2+b}}\varepsilon^{-\frac{3b}{2+b}}, \end{aligned}\delta = \max\{4A,4B,4C,4D\} 便得到至少以概率 1-\delta 成立 \frac{1}{T} \sum_{t=1}^{T}\left\|\nabla f\left(x_{t}\right)\right\|^{2} \leqslant \frac{\kappa_{2} \beta \Delta}{T \delta}+\frac{\kappa_{3} \sqrt{\beta \Delta}}{T^{\frac{4-b}{4}} \delta^{\frac{2+b}{4}}}+\frac{\kappa_{4} L^{\frac{2+b}{3 b}}(\beta \Delta)^{\frac{b-1}{3 b}}}{T^{\frac{b-1}{b}} \delta^{\frac{2+b}{3 b}}}+\frac{\kappa_{5} \sqrt{\beta \Delta}}{T^{\frac{b-1}{b}} \delta^{\frac{2+b}{2 b}}}, 其中 \kappa_{2} =\dfrac{16(2+b)}{b},\ \kappa_{3} =4^{\frac{10+3b}{4}} a^{\frac{b}{4}} b^{-\frac{1}{2}} c^{\frac{4-b^{2}}{8}},\ \kappa_{4} =4^{\frac{1+b}{b}} \kappa_{6}^{\frac{2+b}{3 b}}(a b)^{\frac{1-b}{3 b}},\ \kappa_{5} =a^{\frac{1}{b}}(4(2+b))^{\frac{2+b}{2 b}} b^{-\frac{1}{2}}.

发表评论

您的电子邮箱地址不会被公开。