::: CEE 700/800 SAS Sources - Descriptive Statistics (Univariate) :::

'Life is after all a recursive summation, indeed

Go back to
SAS Source Page

Descriptive Statistics (Univariate)

SAS Source: UNIV.SAS

Description

You can use the SAS source shown below as a template to calculate a Descriptive Statistics for a univariate sample population.

The UNIVARIATE procedure descriptive statistics for numeric variable(s) including quantiles. It provides great detail on the distribution of a variable. UNIVARIATE can provide; (1) details on the extreme values of a variable, (2) quantiles, (3) frequency tables, (4) several plots to illustrate the distribution, (5) tests of central location, and (6) a test to determine whether the data are normally distributed.

Test of Normality

When the NORMAL option is specified, UNIVARIATE produces a test statistic for the null hypothesis that the input data values are a random sample from a normal distribution.

H₀: Data values are a random sample from a normal distribution
H_a: Data values are NOT a random sample from a normal distribution

or in brevity,

H₀: X ~ N(.)
H_a: X != N(.)

To determine whether to reject the null hypothesis of normality, it is only necessary to examine the probability associated with the test statistic (i.e., p-value). The probability (= p-value) is labeled PR<W (Shapiro-Wilk test). If this value is less than the level of significance you choose (such as 0.05 for 95%), then the null hypothesis is rejected, and you can conclude that the data do not come from a normal distribution.

Shapiro-Wilk test is designed for the sample size less or equal to 2000, and it will computes the Shapiro-Wilk statistic, W. Distributon of W-statistic ranges in between 0 and 1, highly skewded to the right. In case of a sample size greater than 2000, Kolmogorov test, D-statistic will be used instead.

For example, from UNIVARIATE SAS Listing,

	W:Normal    0.97542  Pr<W         0.8474

p-value for W-statistics is 0.8474 which is greater than the SAS default level of significance, alpha=0.05. This means that with 95% confidence, there is an insufficient evidence to conclude that the data is not normal (i.e., you're not rejecting null hypothesis of normality), and we conclude that the sample data distribution is (or came from) a normal distribution with N(mean, variance).

Layman's Interpretation on Shapiro-Wilk p-value

Let's say that you're playing a Blackjack game with a dealer. The prize is the normality of the sample data (= truth in the null hypothesis). Rule is rather simple -- whoever holds the higher odd wins. If you win, you can say "Hurray, the sample data is normal!"

If you lose (by having a smaller odd than the dealer), the winner, dealer will take away the prize (of normality in sample data), and you are now stuck with a second prize of the alternative hypothesis, which says of non-normality in sample data. Afterall, it is like the dealrer is saying that "Pity, the sample data is *NOT* normal for *YOU*."

Now, dealer's odd for winning this game is a fixed 0.05 (if tested against 95% confidence, which is almost most of time), and your odd will be the p-value calculated by Shapiro-Wilk W-statistics that is depending on the sample data.

If we're using above example's p-value from Shapiro-Wilk W-statistic, 0.8474, of no doubt, you win the game quite comfortably and get the prize (= null hypothesis) of saying "the sample data is normal." as stated in the null hypothesis. (because your odd of 0.8474 is way higher than the dealer's 0.05 -- no contest here)

O.K., let's think about another scenario. This time, you ended up with a p-value from Shapiro-Wilk W-statistic of 0.001. The dealer's odd still remains 0.05 (unless you're testing against a different level of confidence such as 90% or 99%).

Apparently, this time the dealer wins the game hands down. And as a result, the dealer can say with 95% confidence, "the sample data is *NOT* normal. Forget about t-test, regression, ANOVA and all the goodies. You'd better look around for nonparametric statistics for analyzing the sample data. Because normality is not applicable here."

Last word of advice -- Even though you'd be pretty confident about the normality of the sample data by judging from p-value for Shapiro-Wilk W-statistic, always make a habit of confirming the sample normality by double-checking the normal probability plot result [in UNIVARIATE procedure].

SAS Listing

/* Set the max. column of the output to 72.        */
/* If not explicitly defined, 132 column will be   */
/* used as a default which will make it difficult  */
/* to print and read in 8.5"x11" paper. Use 80 or  */
/* less.                                           */

OPTIONS LINESIZE=72;

TITLE1 'Descriptive Statistics Example';
TITLE2 '** UNIVARIATE PROCEDURE **';

/* Name the data set, let's use Chemical Hydro-    */
/* carbon or 'ChemHC' - the data set represent     */
/* a relationship between the purity of oxygen     */
/* produced in a chemical distillation process     */
/* and the percentage of hydrocarbon that are      */
/* presented in the main condenser of the          */
/* distillation unit.                              */

/* Data set name could be anything you want, Max.  */
/* 8 characters, and it is case-insensitive        */
DATA ChemHC;

/* Define the order of data variables to be read   */
/* in the data set -- 'Purity' first, then         */
/* 'H_Carbon' and repeat till the data set exhaust */
/* @@ means a loop in reading input variable sequence */
INPUT Purity H_Carbon @@;
CARDS;
90.01 0.99   89.05 1.02   91.43 1.15
93.74 1.29   96.73 1.46   94.45 1.36
87.59 0.87   91.77 1.23   99.42 1.55
93.65 1.40   93.54 1.19   92.52 1.15
90.56 0.98   89.54 1.01   89.85 1.11
90.39 1.20   93.25 1.26   93.41 1.32
94.98 1.43   87.33 0.95
;
RUN;

/* Printing original data set for your reord      */
PROC PRINT;
RUN;


/* Define Descriptive Statistics to be performed   */
/* NORMAL = normality                              */
/* FREQ   = frequency table                        */
/* PLOT   = stem-and-leaf, box, normal prob. plots */

PROC UNIVARIATE FREQ PLOT NORMAL;
/* Designate a target variable for Descriptive     */
/* Statistics - in this case, the target variable  */
/* is Purity                                       */
  VAR Purity;
RUN;

SAS Listing

Oy, bit overwhelmed? If you remove all comments from above SAS source code, it would look quite simple as shown below, and it still does exactly the same analysis as above (hairy) SAS source code would.

Remember, comments in any source code are for the virtue of later-day sanity. Please do make a habit of commenting comments.

OPTIONS LINESIZE=72;
TITLE1 'Descriptive Statistics Example';
TITLE2 '** UNIVARIATE PROCEDURE **';
DATA ChemHC;
INPUT Purity H_Carbon @@;
CARDS;
90.01 0.99   89.05 1.02   91.43 1.15
93.74 1.29   96.73 1.46   94.45 1.36
87.59 0.87   91.77 1.23   99.42 1.55
93.65 1.40   93.54 1.19   92.52 1.15
90.56 0.98   89.54 1.01   89.85 1.11
90.39 1.20   93.25 1.26   93.41 1.32
94.98 1.43   87.33 0.95
;
RUN;

PROC PRINT;
RUN;

PROC UNIVARIATE FREQ PLOT NORMAL;
  VAR Purity;
RUN;

Also, if you would like to run UNIVARIATE procedure for multiple variables (i.e., Purity and H_Carbon in this case) all at once, all you have to do is to add VAR statement(s) for additional variable(s) without creating separate SAS file(s). (See below for examples)

OPTIONS LINESIZE=72;
TITLE1 'Descriptive Statistics Example';
TITLE2 '** UNIVARIATE PROCEDURE **';
DATA ChemHC;
INPUT Purity H_Carbon @@;
CARDS;
90.01 0.99   89.05 1.02   91.43 1.15
93.74 1.29   96.73 1.46   94.45 1.36
87.59 0.87   91.77 1.23   99.42 1.55
93.65 1.40   93.54 1.19   92.52 1.15
90.56 0.98   89.54 1.01   89.85 1.11
90.39 1.20   93.25 1.26   93.41 1.32
94.98 1.43   87.33 0.95
;
RUN;

PROC PRINT;
RUN;

PROC UNIVARIATE FREQ PLOT NORMAL;
  VAR Purity;
  VAR H_Carbon;
RUN;


/* above VAR statements can be also */
/* merged into one-line                     */
/*                                                   */
/* VAR Purity H_Carbon;                  */
/*                                                  */

OPTIONS LINESIZE=72;
TITLE1 'Descriptive Statistics/Normality';
TITLE2 '** UNIVARIATE PROCEDURE / Fruits **';
DATA Fruits;
INPUT Apple Grape Orange Pear @@;
CARDS;  
15 16 73 46   28 18 68 56   39 16 74 47
40 12 71 52   59 18 67 34   13 23 73 44
23 22 67 43   32 29 75 55   42 27 72 34
53 23 70 39   13 33 75 47   25 36 68 58
33 30 78 38   45 33 73 38   52 32 68 58 
11 44 73 31   27 49 71 37   37 41 75 39
42 49 75 41   59 40 69 30 
;
RUN;

PROC UNIVARIATE FREQ PLOT NORMAL;
  VAR Apple;
  VAR Grape;
  VAR Orange;
  VAR Pear;
RUN;


/* above VAR statements can be also */
/* merged into one-line                     */
/*                                                   */
/* VAR Apple Grape Orange Pear;     */
/*                                                   */

SAS User Guide (SUG) for Procedures (PROC) used in the Source

	OPTIONS procedure
	TITLE procedure
	DATA procedure
	INPUT procedure
	PRINT procedure
	UNIVARIATE procedure

Go back to
SAS Source Page