Read the full article: http://tecdat.cn/?p=17375
Recently we were asked by a client to write a research report on extreme value analysis, including some graphical and statistical output.
To help customers use the POT model, this guide contains practical examples on using this model. This article quickly introduces extreme value theory (EVT), some basic examples, and finally, a specific statistical analysis of the extreme value of the river is carried out through the case.
Introduction to EVT
Univariate case
Suppose there exist normalization constants an > 0 and bn such that:
According to the extreme value type theorem (Fisher and Tippett, 1928), G must be a Fr'echet, Gumbel or negative Weibull distribution. Jenkinson (1955) pointed out that these three distributions can be combined into a family of parameters: the Generalized Extreme Value (GEV) distribution. GEV has the distribution function defined below:
From this result, Pickands (1975) stated that the limiting distribution of the standardized excess of the threshold threshold is the generalized Pareto distribution (GPD) when the threshold is close to the endpoint µend of the target variable. That is, if X is a random variable, then:
basic usage
Random Numbers and Distribution Functions
First, let's start with the basics. Use R for random number generation and distribution functions.
> rgpd(5, loc = 1, scale = 2, shape = -0.2) [1] 1.523393 2.946398 2.517602 1.199393 2.541937 > rgpd(6, c(1, -5), 2, -0.2) [1] 1.3336965 -4.6504749 3.1366697 -0.9330325 3.5152161 -4.4851408 > rgpd(6, 0, c(2, 3), 0) [1] 3.1139689 6.5900384 0.1886106 0.9797699 3.2638614 5.4755026 > pgpd(c(9, 15, 20), 1, 2, 0.25) [1] 0.9375000 0.9825149 0.9922927 > qgpd(c(0.25, 0.5, 0.75), 1, 2, 0) [1] 1.575364 2.386294 3.772589 > dgpd(c(9, 15, 20), 1, 2, 0.25) [1] 0.015625000 0.003179117 0.001141829
Use the option lower.tail = TRUE or lower.tail = FALSE to calculate the no-exceed or exceed probabilities, respectively;
specify whether quantiles exceed probabilities with options lower.tail = TRUE or lower.tail = FALSE respectively;
Specifies whether to compute the density or the logarithmic density using the options log=FALSE or log=TRUE respectively.
Threshold Selection Map
Additionally, the Fisher information can be used to calculate confidence intervals.
> x <- runif(10000) > par(mfrow = c(1, 2))
The result is shown in the figure. We can clearly see that a threshold of 0.98 is a reasonable choice.
Confidence intervals can be added to this plot, since the empirical mean can be considered normally distributed (Central Limit Theorem). However, for high thresholds, normality no longer holds, and moreover, by construction, the graph always converges to the point (xmax; 0).
Here's another comprehensive example.
> x <- rnorm(10000) plot(x, u.range = c(1, quantile(x, probs = 0.995)), col =
L-moment diagram
L-moments are summary statistics for probability distributions and data samples. They are similar to ordinary moments {they provide measures of location, dispersion, skewness, kurtosis, and other aspects of a probability distribution or data sample shape {but are computed from linear combinations of ordered data values (hence the prefix L).
Here is a simple example.
> x <- c(1 - abs(rnorm(200, 0, 0.2)), rgpd(100, 1, 2, 0.25))
We have found that this graph often performs poorly on real data.
Dispersion Index Diagram
Dispersion index plots are especially useful when working with time series. EVT states that the excess beyond the threshold can be approximated by GPD. However, EVT must represent the occurrence of these excesses through a Poisson process.
For the next example, we use the dataset included in the POT package. Also, since the flood data is a time series and thus highly autocorrelated, we must "extract" extreme events while maintaining independence between events.
5, clust.max = TRUE) > diplot(events, u.range = c(2, 20))
The dispersion index diagram is shown in the figure. As can be seen from the plot, a threshold of around 5 is reasonable.
Click on the title to view past content
Swipe left and right to see more
01
02
03
04
Fitting GPD
Univariate case
GPD can be fitted from multiple estimators.
MLE is a special case because it is the only one where varying thresholds are allowed.
We give some teaching examples here.
scale shape mom 1.9538076495 0.2423393 mle 2.0345084386 0.2053905 pwmu 2.0517348996 0.2043644 pwmb 2.0624399910 0.2002131 pickands 2.3693985422 -0.0708419 med 2.2194363549 0.1537701 mdpd 2.0732577511 0.1809110 mple 2.0499646631 0.1960452 ad2r 0.0005539296 27.5964097
MLE, MPLE and MGF estimation allow scale or shape parameters. For example, if we want to fit an exponential distribution:
> fit(x, thresh = 1, shape = 0, est = "mle") Estimator: MLE Deviance: 322.686 AIC: 324.686 Varying Threshold: FALSE Threshold Call: 1 Number Above: 100 Proportion Above: 1 Estimates scale 1.847 Standard Error Type: observed Standard Errors scale 0.1847 Asymptotic Variance Covariance scale scale 0.03410 Optimization Information Convergence: successful Function Evaluations: 7 Gradient Evaluations: 1 > fitgpd(x, thresh = 1, scale = 2, est = "mle") Estimator: MLE Deviance: 323.3049 AIC: 325.3049 Varying Threshold: FALSE Threshold Call: 1 Number Above: 100 Proportion Above: 1 Estimates shape 0.0003398 Standard Error Type: observed Standard Errors shape 0.06685 Asymptotic Variance Covariance shape shape 0.004469 Optimization Information Convergence: successful Function Evaluations: 5 Gradient Evaluations: 1 If now, we want to fit a GPD with a varying threshold, just do: > x <- rgpd(500, 1:2, 0.3, 0.01) > fitgpd(x, 1:2, est = "mle") Estimator: MLE Deviance: -176.1669 AIC: -172.1669 Varying Threshold: TRUE Threshold Call: 1:2 Number Above: 500 Proportion Above: 1 Estimates scale shape 0.3261 -0.0556 Standard Error Type: observed Standard Errors scale shape 0.02098 0.04632 Asymptotic Variance Covariance scale shape scale 0.0004401 -0.0007338 shape -0.0007338 0.0021451 Optimization Information Convergence: successful Function Evaluations: 62 Gradient Evaluations: 11
scale1
shape1
scale2
shape2
α
6.784e-02
5.303e-02
2.993e-02
3.718e-02
2.001e-06
Asymptotic Variance Covariance scale1 shape1 scale2 shape2 alpha scale1 4.602e-03 -2.285e-03 1.520e-06 -1.145e-06 -3.074e-11 shape1 -2.285e-03 2.812e-03 -1.337e-07 4.294e-07 -1.843e-11 scale2 1.520e-06 -1.337e-07 8.956e-04 -9.319e-04 8.209e-12 shape2 -1.145e-06 4.294e-07 -9.319e-04 1.382e-03 5.203e-12 alpha -3.074e-11 -1.843e-11 8.209e-12 5.203e-12 4.003e-12 Optimization Information Convergence: successful Function Evaluations: 150 Gradient Evaluations: 21
bivariate case
Fit bivariate POT. All of these models are fitted using maximum likelihood estimators.
vgpd(cbind(x, y), c(0, 2), model = "log") > Mlog Estimator: MLE Dependence Model and Strenght: Model : Logistic lim_u Pr[ X_1 > u | X_2 > u] = NA Deviance: 1313.260 AIC: 1323.260 Marginal Threshold: 0 2 Marginal Number Above: 500 500 Marginal Proportion Above: 1 1 Joint Number Above: 500 Joint Proportion Above: 1 Number of events such as {Y1 > u1} U {Y2 > u2}: 500 Estimates scale1 shape1 scale2 shape2 alpha 0.9814 0.2357 0.5294 -0.2835 0.9993 Standard Errors
In the summary, we can see that lim\_u Pr[X\_1>u | X_2>u] = 0.02. This is the χ statistic from Coles et al. (1999). For parametric models we have:
For independent variables, χ = 0, while for full dependencies, χ = 1. In our application, a value of 0.02 indicates that the variables are independent {this is obvious. From this point of view, some parameters can be fixed.
vgpd(cbind(x, y), c(0, 2), model = "log" Dependence Model and Strenght: Model : Logistic lim_u Pr[ X_1 > u | X_2 > u] = NA Deviance: 1312.842 AIC: 1320.842 Marginal Threshold: 0 2 Marginal Number Above: 500 500 Marginal Proportion Above: 1 1 Joint Number Above: 500 Joint Proportion Above: 1 Number of events such as {Y1 > u1} U {Y2 > u2}: 500 Estimates scale1 shape1 scale2 shape2 0.9972 0.2453 0.5252 -0.2857 Standard Errors scale1 shape1 scale2 shape2 0.07058 0.05595 0.02903 0.03490 Asymptotic Variance Covariance
scale1 shape1 scale2 shape2 scale1 4.982e-03 -2.512e-03 -3.004e-13 3.544e-13 shape1 -2.512e-03 3.130e-03 1.961e-13 -2.239e-13 scale2 -3.004e-13 1.961e-13 8.427e-04 -8.542e-04 shape2 3.544e-13 -2.239e-13 -8.542e-04 1.218e-03 Optimization Information Convergence: successful Function Evaluations: 35 Gradient Evaluations: 9
Note that since all bivariate extreme value distributions are asymptotically correlated, the χ statistic of Coles et al. (1999) is always equal to 1.
Another way to detect dependency strength is to plot the dependency function of Pickands:
> pickdep(Mlog)
Horizontal lines correspond to independence, while others correspond to perfect dependence. Note that by construction, mixed and asymmetric mixed models cannot model perfect dependence variables.
Model dependency structures using Markov chains
Transcendental Markov Chains The classical approach to peak analysis above a threshold is to fit the GPD to a maximum. However, there is a waste of data since only cluster maxima are considered. The main idea is to use Markov chains to model the dependency structure, and the joint distribution is obviously a multivariate extreme value distribution. This idea was first proposed by Smith et al. (1997). In the remainder of this section, we will only focus on first-order Markov chains. Therefore, all excess possibilities are:
For our application, we simulate first-order Markov chains with extreme value dependency structures.
mcgpd(mc, 2, "log") Estimator: MLE Dependence Model and Strenght: Model : Logistic lim_u Pr[ X_1 > u | X_2 > u] = NA Deviance: 1334.847 AIC: 1340.847 Threshold Call: Number Above: 998 Proportion Above: 1 Estimates scale shape alpha 0.8723 0.1400 0.5227 Standard Errors scale shape alpha 0.08291 0.04730 0.02345 Asymptotic Variance Covariance scale shape alpha scale 0.0068737 -0.0016808 -0.0009066 shape -0.0016808 0.0022369 -0.0004412 alpha -0.0009066 -0.0004412 0.0005501 Optimization Information Convergence: successful Function Evaluations: 70 Gradient Evaluations: 15
confidence interval
Confidence intervals are usually given after fitting a statistical model. Currently, only mle, pwmu, pwmb, and moment estimators can compute confidence intervals.
conf.inf.scale conf.sup.scale 2.110881 2.939282 conf.inf.scale conf.sup.scale 1.633362 3.126838 conf.inf.scale conf.sup.scale 2.138851 3.074945 conf.inf.scale conf.sup.scale 2.149123 3.089090
For shape parameter confidence intervals:
conf.inf conf.sup 2.141919 NA
conf.inf conf.sup 0.05757576 0.26363636
Confidence intervals for quantiles are also available.
conf.inf conf.sup 2.141919 NA
conf.inf conf.sup 0.05757576 0.26363636
conf.inf conf.sup 8.64712 12.22558
conf.inf conf.sup 8.944444 12.833333
conf.inf conf.sup 8.64712 12.22558
conf.inf conf.sup 8.944444 12.833333
The profile confidence interval function both returns the confidence interval and plots the profile log-likelihood function.
model checking
To examine the fitted model, the user must invoke the function graph.
> plot(fitted, npy = 1)
Figure shows the graphics window obtained by execution.
clustering technique
Spikes that exceed a threshold can be problematic when processing time series. Indeed, since time series are usually highly autocorrelated, choosing values above the threshold may lead to correlated events.
This function attempts to identify peaks above a threshold while meeting the independence criteria. For this, the function requires at least two parameters: the threshold u and the time condition for independence. The cluster ID is as follows:
1. The first cluster will be started when the first time exceeds;
2. The first observation below the threshold will "terminate the cluster", unless tim.cond does not hold;
3. The next cluster exceeding tim.cond will start a new cluster;
4. Iterate as needed.
A preliminary study shows that two flood events can be considered independent if they are not within 8 days, note that the units defining tim.cond must be the same as the data being analyzed.
Returns a list containing identified clusters.
clu(dat, u = 2, tim.cond = 8/365)
other
return cycle
Convert return periods to non-exceeded probabilities and vice versa. It requires both return periods and non-transcendence probabilities. Additionally, the average number of events per year must be specified.
npy retper prob 1 1.8 50 0.9888889 npy retper prob 1 2.2 1.136364 0.6
unbiased sample L moment
The function samlmu computes unbiased sample L moments.
l_1
l_2
t_3
t_4
t_5
0.455381591
0.170423740
0.043928262 -0.005645249 -0.009310069
3.7.3
River Threshold Analysis
In this section, we provide a comprehensive and detailed analysis of river thresholds.
Moving Average Window for Time Series
Computes the "average" time series from the initial time series ts. This is achieved by using a moving average window of length d on the initial time series.
> plot(dat, type = "l", col = "blue") > lines(tsd, col = "green")
The execution process is shown in the figure. The graph depicts the instantaneous flood quantity - the blue line.
Since this is a time series, we have to select independent events above a threshold. First, we fix a relatively low threshold to "extract" more events. Therefore, some of them are not extreme events but routine events. This is necessary to choose a reasonable threshold for the asymptotic approximation of GPD.
> par(mfrow = c(2, 2)) > plot(even[, "obs"]) > abline(v = 6, col = "green"
According to the figure, the threshold 6m3 = s should be reasonable. The average remaining life plot - upper left panel - indicates that a threshold of around 10m3 = s should be sufficient. However, the chosen threshold must be low enough that there are enough events above it to reduce the variance, and not so low that it would increase bias3.
Therefore, we can now \re-extract events above the threshold 6m3 = s to obtain the object event1. Note that an estimate of the extreme value exponent can be used.
> events1 <-365, clust.max = TRUE) > np time obs Min. :1970 Min. : 0.022 1st Qu.:1981 1st Qu.: 0.236 Median :1991 Median : 0.542 Mean :1989 Mean : 1.024 3rd Qu.:1997 3rd Qu.: 1.230 Max. :2004 Max. :44.200 NA's : 1.000
The results give the name of the estimator (threshold, number and proportion of observations above the threshold, parameter estimates, standard error estimates and types, asymptotic variance-covariance matrix and convergence diagnostics.
The plot shows graphical diagnostics of the fitted model. As can be seen, the fitted model "mle" seems to be suitable. Suppose we want to know the return level relative to the 100-year return period.
> rp2p npy retper prob 1 1.707897 100 0.9941448 > scale 36.44331
The graph depicts distribution confidence intervals for the quantiles associated with the 100-year return period, taking into account uncertainties.
conf.inf conf.sup 25.56533 90.76633
Sometimes it is necessary to know the estimated return period for a given event. Let's work on larger events.
> maxEvent [1] 44.2 shape 0.9974115 > prob npy retper prob 1 1.707897 226.1982 0.9974115
Thus, the return period for the largest event in June 2000 is about 240 years.
conf.inf conf.sup 25.56533 90.76633
Click "Read the original" at the end of the article
Get the full text and complete information.
This article is selected from "Extreme Value Analysis of Flood by Markov Chain (MC) with Extreme Value (EVT) Dependent Structure in R Language".
Click on the title to view past content
R language extreme value theory: Hill HILL statistics tail index parameter estimation visualization
Extreme value theory EVT, POT super threshold, GARCH model analysis Stock index VaR, conditional CVaR: Diversified investment portfolio forecast risk measurement analysis
R language POT ultra-threshold model and extreme value theory EVT analysis
R language extreme value inference: generalized Pareto distribution GPD using maximum likelihood estimation, contour likelihood estimation, Delta method
R language extreme value theory EVT: Fire loss distribution analysis based on GPD model
Analysis of extreme value of flood by Markov chain (MC) with extreme value (EVT) dependent structure in R language
R language POT ultra-threshold model and extreme value theory EVT analysis
R language mixed normal distribution maximum likelihood estimation and EM algorithm
R language polynomial linear model: maximum likelihood estimation quadratic curve
R language Wald test vs likelihood ratio test
R language GARCH-DCC model and DCC (MVT) modeling estimation
R language non-parametric method: using kernel regression smoothing estimation and K-NN(K nearest neighbor algorithm) classification to predict heart disease data
matlab implements MCMC's Markov transformation ARMA-GARCH model estimation
Confidence Interval Estimation Method for Linear Regression Prediction Based on Bootstrap in R Language
R language random search variable selection SSVS estimation Bayesian vector autoregressive (BVAR) model
Matlab Markov chain Monte Carlo method (MCMC) estimates stochastic volatility (SV, Stochastic Volatility) model
Estimation of GDP Growth Rate Using Matlab Markov Regional System Transformation Dynamic Regression ModelR language extreme value inference: generalized Pareto distribution GPD using maximum likelihood estimation, contour likelihood estimation, Delta method