R language has extreme value (EVT) dependent structure Markov chain (MC) analysis of flood extreme value | attached code data

Read the full article: http://tecdat.cn/?p=17375

Recently we were asked by a client to write a research report on extreme value analysis, including some graphical and statistical output.

To help customers use the POT model, this guide contains practical examples on using this model. This article quickly introduces extreme value theory (EVT), some basic examples, and finally, a specific statistical analysis of the extreme value of the river is carried out through the case.

Introduction to EVT

Univariate case

Suppose there exist normalization constants an > 0 and bn such that:

According to the extreme value type theorem (Fisher and Tippett, 1928), G must be a Fr'echet, Gumbel or negative Weibull distribution. Jenkinson (1955) pointed out that these three distributions can be combined into a family of parameters: the Generalized Extreme Value (GEV) distribution. GEV has the distribution function defined below:

From this result, Pickands (1975) stated that the limiting distribution of the standardized excess of the threshold threshold is the generalized Pareto distribution (GPD) when the threshold is close to the endpoint µend of the target variable. That is, if X is a random variable, then:

basic usage

Random Numbers and Distribution Functions

First, let's start with the basics. Use R for random number generation and distribution functions.

> rgpd(5, loc = 1, scale = 2, shape = -0.2)
[1] 1.523393 2.946398 2.517602 1.199393 2.541937
> rgpd(6, c(1, -5), 2, -0.2)
[1] 1.3336965 -4.6504749 3.1366697 -0.9330325 3.5152161 -4.4851408
> rgpd(6, 0, c(2, 3), 0)
[1] 3.1139689 6.5900384 0.1886106 0.9797699 3.2638614 5.4755026
> pgpd(c(9, 15, 20), 1, 2, 0.25)
[1] 0.9375000 0.9825149 0.9922927
> qgpd(c(0.25, 0.5, 0.75), 1, 2, 0)
[1] 1.575364 2.386294 3.772589
> dgpd(c(9, 15, 20), 1, 2, 0.25)
[1] 0.015625000 0.003179117 0.001141829

Use the option lower.tail = TRUE or lower.tail = FALSE to calculate the no-exceed or exceed probabilities, respectively;
specify whether quantiles exceed probabilities with options lower.tail = TRUE or lower.tail = FALSE respectively;
Specifies whether to compute the density or the logarithmic density using the options log=FALSE or log=TRUE respectively.

Threshold Selection Map

Additionally, the Fisher information can be used to calculate confidence intervals.

> x <- runif(10000)
> par(mfrow = c(1, 2))

The result is shown in the figure. We can clearly see that a threshold of 0.98 is a reasonable choice.

Confidence intervals can be added to this plot, since the empirical mean can be considered normally distributed (Central Limit Theorem). However, for high thresholds, normality no longer holds, and moreover, by construction, the graph always converges to the point (xmax; 0).
Here's another comprehensive example.

> x <- rnorm(10000)
plot(x, u.range = c(1, quantile(x, probs = 0.995)), col =

L-moment diagram

L-moments are summary statistics for probability distributions and data samples. They are similar to ordinary moments {they provide measures of location, dispersion, skewness, kurtosis, and other aspects of a probability distribution or data sample shape {but are computed from linear combinations of ordered data values ​​(hence the prefix L).

Here is a simple example.

> x <- c(1 - abs(rnorm(200, 0, 0.2)), rgpd(100, 1, 2, 0.25))

We have found that this graph often performs poorly on real data.

Dispersion Index Diagram

Dispersion index plots are especially useful when working with time series. EVT states that the excess beyond the threshold can be approximated by GPD. However, EVT must represent the occurrence of these excesses through a Poisson process.

For the next example, we use the dataset included in the POT package. Also, since the flood data is a time series and thus highly autocorrelated, we must "extract" extreme events while maintaining independence between events.

5, clust.max = TRUE)
> diplot(events, u.range = c(2, 20))

The dispersion index diagram is shown in the figure. As can be seen from the plot, a threshold of around 5 is reasonable.

Click on the title to view past content

Extreme value theory EVT, POT super threshold, GARCH model analysis Stock index VaR, conditional CVaR: Diversified investment portfolio forecast risk measurement analysis

Swipe left and right to see more

01

02

03

04

Fitting GPD

Univariate case

GPD can be fitted from multiple estimators.
MLE is a special case because it is the only one where varying thresholds are allowed.
We give some teaching examples here.

scale shape
mom 1.9538076495 0.2423393
mle 2.0345084386 0.2053905
pwmu 2.0517348996 0.2043644
pwmb 2.0624399910 0.2002131
pickands 2.3693985422 -0.0708419
med 2.2194363549 0.1537701
mdpd 2.0732577511 0.1809110
mple 2.0499646631 0.1960452
ad2r 0.0005539296 27.5964097

MLE, MPLE and MGF estimation allow scale or shape parameters. For example, if we want to fit an exponential distribution:

> fit(x, thresh = 1, shape = 0, est = "mle")
Estimator: MLE
Deviance: 322.686
AIC: 324.686
Varying Threshold: FALSE


Threshold Call: 1
Number Above: 100
Proportion Above: 1
Estimates
scale
1.847
Standard Error Type: observed
Standard Errors
scale
0.1847
Asymptotic Variance Covariance
scale
scale 0.03410
Optimization Information
Convergence: successful
Function Evaluations: 7
Gradient Evaluations: 1
> fitgpd(x, thresh = 1, scale = 2, est = "mle")
Estimator: MLE
Deviance: 323.3049
AIC: 325.3049
Varying Threshold: FALSE
Threshold Call: 1
Number Above: 100
Proportion Above: 1
Estimates
shape
0.0003398
Standard Error Type: observed
Standard Errors
shape
0.06685
Asymptotic Variance Covariance
shape
shape 0.004469
Optimization Information
Convergence: successful
Function Evaluations: 5
Gradient Evaluations: 1
If now, we want to fit a GPD with a varying threshold, just do:
> x <- rgpd(500, 1:2, 0.3, 0.01)
> fitgpd(x, 1:2, est = "mle")
Estimator: MLE
Deviance: -176.1669
AIC: -172.1669
Varying Threshold: TRUE
Threshold Call: 1:2
Number Above: 500
Proportion Above: 1
Estimates
scale shape
0.3261 -0.0556
Standard Error Type: observed
Standard Errors
scale shape
0.02098 0.04632
Asymptotic Variance Covariance
scale shape
scale 0.0004401 -0.0007338
shape -0.0007338 0.0021451
Optimization Information
Convergence: successful
Function Evaluations: 62
Gradient Evaluations: 11

scale1

shape1

scale2

shape2

α

6.784e-02

5.303e-02

2.993e-02

3.718e-02

2.001e-06

Asymptotic Variance Covariance
scale1 shape1 scale2 shape2 alpha
scale1 4.602e-03 -2.285e-03 1.520e-06 -1.145e-06 -3.074e-11
shape1 -2.285e-03 2.812e-03 -1.337e-07 4.294e-07 -1.843e-11
scale2 1.520e-06 -1.337e-07 8.956e-04 -9.319e-04 8.209e-12
shape2 -1.145e-06 4.294e-07 -9.319e-04 1.382e-03 5.203e-12
alpha -3.074e-11 -1.843e-11 8.209e-12 5.203e-12 4.003e-12
Optimization Information
Convergence: successful
Function Evaluations: 150
Gradient Evaluations: 21

bivariate case

Fit bivariate POT. All of these models are fitted using maximum likelihood estimators.

vgpd(cbind(x, y), c(0, 2), model = "log")
> Mlog
Estimator: MLE
Dependence Model and Strenght:
Model : Logistic
lim_u Pr[ X_1 > u | X_2 > u] = NA
Deviance: 1313.260
AIC: 1323.260
Marginal Threshold: 0 2
Marginal Number Above: 500 500
Marginal Proportion Above: 1 1
Joint Number Above: 500
Joint Proportion Above: 1
Number of events such as {Y1 > u1} U {Y2 > u2}: 500
Estimates
scale1 shape1 scale2 shape2 alpha
0.9814 0.2357 0.5294 -0.2835 0.9993
Standard Errors

In the summary, we can see that lim\_u Pr[X\_1>u | X_2>u] = 0.02. This is the χ statistic from Coles et al. (1999). For parametric models we have:

For independent variables, χ = 0, while for full dependencies, χ = 1. In our application, a value of 0.02 indicates that the variables are independent {this is obvious. From this point of view, some parameters can be fixed.

vgpd(cbind(x, y), c(0, 2), model = "log"
Dependence Model and Strenght:
Model : Logistic
lim_u Pr[ X_1 > u | X_2 > u] = NA
Deviance: 1312.842
AIC: 1320.842
Marginal Threshold: 0 2
Marginal Number Above: 500 500
Marginal Proportion Above: 1 1
Joint Number Above: 500
Joint Proportion Above: 1
Number of events such as {Y1 > u1} U {Y2 > u2}: 500
Estimates
scale1 shape1 scale2 shape2
0.9972 0.2453 0.5252 -0.2857
Standard Errors
scale1 shape1 scale2 shape2
0.07058 0.05595 0.02903 0.03490
Asymptotic Variance Covariance

scale1 shape1 scale2 shape2
scale1 4.982e-03 -2.512e-03 -3.004e-13 3.544e-13
shape1 -2.512e-03 3.130e-03 1.961e-13 -2.239e-13
scale2 -3.004e-13 1.961e-13 8.427e-04 -8.542e-04
shape2 3.544e-13 -2.239e-13 -8.542e-04 1.218e-03
Optimization Information
Convergence: successful
Function Evaluations: 35
Gradient Evaluations: 9

Note that since all bivariate extreme value distributions are asymptotically correlated, the χ statistic of Coles et al. (1999) is always equal to 1.
Another way to detect dependency strength is to plot the dependency function of Pickands:

> pickdep(Mlog)

Horizontal lines correspond to independence, while others correspond to perfect dependence. Note that by construction, mixed and asymmetric mixed models cannot model perfect dependence variables.

Model dependency structures using Markov chains

Transcendental Markov Chains The classical approach to peak analysis above a threshold is to fit the GPD to a maximum. However, there is a waste of data since only cluster maxima are considered. The main idea is to use Markov chains to model the dependency structure, and the joint distribution is obviously a multivariate extreme value distribution. This idea was first proposed by Smith et al. (1997). In the remainder of this section, we will only focus on first-order Markov chains. Therefore, all excess possibilities are:

For our application, we simulate first-order Markov chains with extreme value dependency structures.

mcgpd(mc, 2, "log")
Estimator: MLE
Dependence Model and Strenght:
Model : Logistic
lim_u Pr[ X_1 > u | X_2 > u] = NA
Deviance: 1334.847
AIC: 1340.847
Threshold Call:
Number Above: 998
Proportion Above: 1
Estimates
scale shape alpha
0.8723 0.1400 0.5227
Standard Errors
scale shape alpha
0.08291 0.04730 0.02345
Asymptotic Variance Covariance
scale shape alpha
scale 0.0068737 -0.0016808 -0.0009066
shape -0.0016808 0.0022369 -0.0004412
alpha -0.0009066 -0.0004412 0.0005501
Optimization Information
Convergence: successful
Function Evaluations: 70
Gradient Evaluations: 15

confidence interval

Confidence intervals are usually given after fitting a statistical model. Currently, only mle, pwmu, pwmb, and moment estimators can compute confidence intervals.

conf.inf.scale conf.sup.scale
2.110881 2.939282

conf.inf.scale conf.sup.scale
1.633362 3.126838

conf.inf.scale conf.sup.scale
2.138851 3.074945

conf.inf.scale conf.sup.scale
2.149123 3.089090

For shape parameter confidence intervals:

conf.inf conf.sup
2.141919 NA
conf.inf conf.sup
0.05757576 0.26363636

Confidence intervals for quantiles are also available.

conf.inf conf.sup
2.141919 NA
conf.inf conf.sup
0.05757576 0.26363636

 conf.inf conf.sup
8.64712 12.22558
conf.inf conf.sup
8.944444 12.833333

conf.inf conf.sup
8.64712 12.22558
conf.inf conf.sup
8.944444 12.833333

The profile confidence interval function both returns the confidence interval and plots the profile log-likelihood function.

model checking

To examine the fitted model, the user must invoke the function graph.

> plot(fitted, npy = 1)

Figure shows the graphics window obtained by execution.

clustering technique

Spikes that exceed a threshold can be problematic when processing time series. Indeed, since time series are usually highly autocorrelated, choosing values ​​above the threshold may lead to correlated events.
This function attempts to identify peaks above a threshold while meeting the independence criteria. For this, the function requires at least two parameters: the threshold u and the time condition for independence. The cluster ID is as follows:
1. The first cluster will be started when the first time exceeds;
2. The first observation below the threshold will "terminate the cluster", unless tim.cond does not hold;
3. The next cluster exceeding tim.cond will start a new cluster;
4. Iterate as needed.
A preliminary study shows that two flood events can be considered independent if they are not within 8 days, note that the units defining tim.cond must be the same as the data being analyzed.
Returns a list containing identified clusters.

 clu(dat, u = 2, tim.cond = 8/365)

other

return cycle

Convert return periods to non-exceeded probabilities and vice versa. It requires both return periods and non-transcendence probabilities. Additionally, the average number of events per year must be specified.

npy retper prob
1 1.8 50 0.9888889
npy retper prob
1 2.2 1.136364 0.6

unbiased sample L moment

The function samlmu computes unbiased sample L moments.

l_1

l_2

t_3

t_4

t_5

0.455381591

0.170423740

0.043928262 -0.005645249 -0.009310069

3.7.3

River Threshold Analysis

In this section, we provide a comprehensive and detailed analysis of river thresholds.

Moving Average Window for Time Series

Computes the "average" time series from the initial time series ts. This is achieved by using a moving average window of length d on the initial time series.

> plot(dat, type = "l", col = "blue")
> lines(tsd, col = "green")


The execution process is shown in the figure. The graph depicts the instantaneous flood quantity - the blue line.

Since this is a time series, we have to select independent events above a threshold. First, we fix a relatively low threshold to "extract" more events. Therefore, some of them are not extreme events but routine events. This is necessary to choose a reasonable threshold for the asymptotic approximation of GPD.

> par(mfrow = c(2, 2))
> plot(even[, "obs"])
> abline(v = 6, col = "green"


According to the figure, the threshold 6m3 = s should be reasonable. The average remaining life plot - upper left panel - indicates that a threshold of around 10m3 = s should be sufficient. However, the chosen threshold must be low enough that there are enough events above it to reduce the variance, and not so low that it would increase bias3.
Therefore, we can now \re-extract events above the threshold 6m3 = s to obtain the object event1. Note that an estimate of the extreme value exponent can be used.

> events1 <-365, clust.max = TRUE)
> np
time obs
Min. :1970 Min. : 0.022
1st Qu.:1981 1st Qu.: 0.236
Median :1991 Median : 0.542
Mean :1989 Mean : 1.024
3rd Qu.:1997 3rd Qu.: 1.230
Max. :2004 Max. :44.200
NA's : 1.000

The results give the name of the estimator (threshold, number and proportion of observations above the threshold, parameter estimates, standard error estimates and types, asymptotic variance-covariance matrix and convergence diagnostics.
The plot shows graphical diagnostics of the fitted model. As can be seen, the fitted model "mle" seems to be suitable. Suppose we want to know the return level relative to the 100-year return period.

> rp2p
npy retper prob
1 1.707897 100 0.9941448
> 
scale
36.44331

The graph depicts distribution confidence intervals for the quantiles associated with the 100-year return period, taking into account uncertainties.

conf.inf conf.sup
25.56533 90.76633

Sometimes it is necessary to know the estimated return period for a given event. Let's work on larger events.

> maxEvent
[1] 44.2

shape
0.9974115
> prob
npy retper prob
1 1.707897 226.1982 0.9974115

Thus, the return period for the largest event in June 2000 is about 240 years.

conf.inf conf.sup
25.56533 90.76633

Click "Read the original" at the end of the article

Get the full text and complete information.

This article is selected from "Extreme Value Analysis of Flood by Markov Chain (MC) with Extreme Value (EVT) Dependent Structure in R Language".

Click on the title to view past content

R language extreme value theory: Hill HILL statistics tail index parameter estimation visualization
Extreme value theory EVT, POT super threshold, GARCH model analysis Stock index VaR, conditional CVaR: Diversified investment portfolio forecast risk measurement analysis
R language POT ultra-threshold model and extreme value theory EVT analysis
R language extreme value inference: generalized Pareto distribution GPD using maximum likelihood estimation, contour likelihood estimation, Delta method
R language extreme value theory EVT: Fire loss distribution analysis based on GPD model
Analysis of extreme value of flood by Markov chain (MC) with extreme value (EVT) dependent structure in R language
R language POT ultra-threshold model and extreme value theory EVT analysis
R language mixed normal distribution maximum likelihood estimation and EM algorithm
R language polynomial linear model: maximum likelihood estimation quadratic curve
R language Wald test vs likelihood ratio test
R language GARCH-DCC model and DCC (MVT) modeling estimation
R language non-parametric method: using kernel regression smoothing estimation and K-NN(K nearest neighbor algorithm) classification to predict heart disease data
matlab implements MCMC's Markov transformation ARMA-GARCH model estimation
Confidence Interval Estimation Method for Linear Regression Prediction Based on Bootstrap in R Language
R language random search variable selection SSVS estimation Bayesian vector autoregressive (BVAR) model
Matlab Markov chain Monte Carlo method (MCMC) estimates stochastic volatility (SV, Stochastic Volatility) model
Estimation of GDP Growth Rate Using Matlab Markov Regional System Transformation Dynamic Regression ModelR language extreme value inference: generalized Pareto distribution GPD using maximum likelihood estimation, contour likelihood estimation, Delta method

Tags: Algorithm Deep Learning Machine Learning Data Mining

Posted by forcom on Fri, 06 Jan 2023 20:57:52 +0530