The Statistical Analysis System (SAS) offers: a powerful, high
level programming language; a large series of built in mathematical
and statistical functions (much greater than Fortran); a state of the
art statistical package; an interactive matrix language; time series
analysis; model fitting by OLS, GLS, MLE for many problems, both
linear and nonlinear; graphics; and report writing. In most IBM
environments (SAS was initially written in PL/1, a language for the
IBM), SAS quickly lead to the demise of SPSS and BMD because it was
not simply a statistical package--it was also a high level
programming language like FORTRAN. Consequently, it was possible to
manipulate and manage data prior to an analysis without writing
several different programs to get data into the proper format for
SPSS. SAS also can produce tables that are manuscript ready.
A SAS beginner, however, pays for the flexibility by confronting such a large series of options and operations that it appears almost impossible to learn. This is meant as an introduction to highlight the most important commands useful for those interested in using SAS more for number crunching than for fancy report writing. For the most part, the statements included herein will get you through SAS with no problem. It is suggested that you go over this to become familiar with the SAS statements that will be most useful for you, and then consult SAS manuals for further information about these statements.
SAS statements may be in upper or lower case and may begin on any
column. SAS statements may also extend across lines, and more than
one SAS statement may appear on a single line. SAS statements always
end with a semicolon (;). Thus, the following set of statements are
all equivalent:
IF SEX=3 THEN DO;
PUT 'BAD SEX FOR ID NUMBER' ID;
DELETE;
END;
/* the following statements are equivalent to those given above */
IF
SEX=3
THEN DO ;
PUT 'BAD SEX FOR ID NUMBER' ID; DELETE; END;
There are two general steps in SAS. The first is the DATA step in which data are read in, manipulated, edited, etc. The second is the PROC or procedure step in which some statistical procedure (e.g., MEANS, ANOVA) is performed on the data. Any number of DATA and PROC steps can occur in a single program. For example, one can read a set of data in the first DATA step, perform a regression (PROC REG) that outputs predicted values and standardized residuals to the data, use a second DATA step to remove outliers, do another PROC REG without the outliers, and merge the full data set with an exiting SAS data file in a third DATA step. The two steps are discussed in turn.
There are 4 basic uses of the SAS DATA step: (1) getting data into
a SAS data set; (2) manipulating the data; (3) managing the data set;
(4) creating data. Each is discussed below:
(1) DATA STEP: Getting data into a SAS data set:
(1.a) There are three ways of getting data into a SAS data set.
They are:
1. Including the data in the SAS command stream. In this
sense the data are like a card deck placed into the stream of SAS
commands. Use an INPUT command to list the variables and a
CARDS statement right before the data to be read in.
DATA CARDSIN;
INPUT IDNUM SEX AGE;
CARDS;
1 1 25
2 2 33
4 1 55
2. Read the data in from a disk file. Here one uses the
INFILE command to name the disk area with the data and the
INPUT command to list the variables.
DATA DISKIN;
INFILE 'RAWDATA.DAT';
INPUT IDNUM SEX AGE;
3. Create a new data set from an existing SAS data set.
Here, the SET command is used to name the existing SAS data
set. The following example creates two new SAS data sets from an
existing SAS data set:
DATA FATHERS MOTHERS; SET PARENTS;
IF SEX=1 THEN OUTPUT FATHERS;
ELSE OUTPUT MOTHERS;
(1.b) INPUT statement. There are three forms of the INPUT statement used for reading in raw data: (1) free format input; (2) columnwise input; and (3) formatted input.
1. Free formatted input: Just list the variables. On the
raw data file, each case must start on a separate record and each
variable must be separated by at least one space. For example,
INPUT IDNUM SEX AGE;
2. Columwise input: The input statement lists the variable and its inclusive column numbers. For example, INPUT IDNUM 1-4 SEX 6-6 AGE 8-9;
3. Formatted input: This lists the variables in parentheses, then the format, also in parentheses. For example, INPUT (IDNUM SEX AGE) (4. +1 1. +1 2.);
SAS does not distinguish between integers and real numbers. It
treats all variables as real. Thus, the "4." is equivalent to an
"F4.0" in FORTRAN, a 4.3 to "F4.3", etc. In formatted input, SAS uses
a "+n" to skip n columns, a "/" to skip records, an "@n" to transfer
to a column n, and a "#n" to transfer to record n. Thus the following
FORTRAN statement and the two SAS formats are equivalent:
READ(1, '(I2,F8.3,T20,F10.6/T10,I8)') ITEM,SCORE,X,ISEX
INPUT (ITEM SCORE X SEX) (2. 8.3 @20 10.6 / @10 8.);
or
INPUT (ITEM SCORE X SEX) (#1 2. 8.3 @20 10.6 #2 @10 8.);
By using the INFILE N=n statement, one can also skip around on
formatted input. For example, you can read one variable from columns
6-10 on the first record, read the second variable from columns 77-78
of the fourth record, go back to the first record and read columns
1-2, etc.
NOTE WELL: In SAS formatted input, the format list is
processed only until the number of variables in the variable list.
Thus, slashes ("/") at the end of a format are NOT processed
and SAS will not skip records the way some FORTRAN dialects will. The
following formats may NOT be equivalent:
READ(1,'(I2, F8.3/)') ITEM,SCORE
INPUT (ITEM SCORE) (2. 8.3 /);
(2) DATA STEP: Manipulating data
SAS is a very rich language for data manipulation. The most
useful commands are given here. See the SAS manual for their
syntax.
ARRAY Works like a DIMENSION statement in Fortran. For
example,
ARRAY A [51] A0-A50;
ARRAY B [51] B0-B50;
DO I=0 TO 50 BY 2;
A[I] = B[I];
END;
DELETE Deletes the observation from a data set.
IF SEX NE 'MALE' AND SEX NE 'FEMALE' THEN DELETE;
DO ... END Works like a do loop in fortran.
IF SEX NE 'MALE' AND SEX NE 'FEMALE' THEN DO;
PUT 'VALUE OF SEX IN ERROR FOR ID NUMBER' ID;
DELETE;
RETURN;
END;
OUTPUT Outputs the observation to the current data set.
MISSING Defines missing values. By default, SAS uses a period (.) as a missing value.
GO TO Like a GOTO in BASIC or FORTRAN, but SAS statements
are given name labels that end with a colon.
IF RESID GT 3 THEN GOTO OUTLIER;
<other SAS statements in here>
OUTLIER:
PUT 'ID NUMBER' ID 'IS AN OUTLIER. '
RESID= .;
DELETE;
RETURN;
LINK...RETURN This is analogous but not identical to a
subroutine call. Processing returns to the SAS statement after the
LINK statement. Becareful about the RETURN statement. A RETURN
statement that is not part of a LINK series begins processing of a
new observation. A RETURN statement after a LINK returns to the next
statement following its call.
IF IDNUM NE LASTID THEN LINK NEWID;
SUMX=SUMX+X;
RETURN;
NEWID:
OUTPUT;
SUMX=0;
LASTID=IDNUM;
RETURN;
SELECT An economial way to write a number of IF ... THEN statements.
PUT Equivalent of WRITE or PRINT in Fortran. If a
FILE statement is used, the output goes to the named file;
otherwise, it goes to the print file. The PUT statement has the same
3 syntax forms as the INPUT statement: free form, columnwise, and
formatted.
FILE 'SASOUT.DAT';
PUT (ID SEX AGE) (4. 2. 3.);
(3) DATA STEP: Data Management
SAS data sets can exist only for the duration of a single job or
they can be stored on a more permanent basis like an SPSS system
file. Permanent SAS data sets are very easy to manipulate, update,
etc. But when a SAS data set is first made permanent or later
updated, it is a good idea to manage the data to reduce storage
costs. The following statements are useful in this regard:
DROP Drop the variables from the data set. Useful for
variables only used for temporary programming convenience.
DROP I J TEST1-TEST5 ABX--XXY;
KEEP Keeps the listed variables. Opposite of DROP.
LENGTH Sets the storage size for variables. If you are
creating or analyzing large SAS data sets, then it is highly
recommended that a length statement be used for integer type data
because all SAS variables default to real variable type storage (and
thus take up more room). The following statement sets the 566
Minnesota Multiphasic Personality Inventory items to three bytes of
storage each:
LENGTH MMPI1-MMPI566 3;
(4) DATA STEP: Data creation
The SAS DATA step can also be used for general programming when
no "real" data are used. SAS has a wide variety of built in functions
that can Monte Carlo data or generate a statistic that is hard to
calculate by hand. See the SAS manual for the functions.
DATA _NULL_;
/* THIS GIVES THE EXACT PROBABILITY OF A CHI SQUARE OF 27.4275
WITH
36 DEGREES OF FREEDOM */
P1=PROBCHI(27.4275,36); PUT P1=;
/* THIS GIVES THE SAME, BUT WITH A NONCENTRALITY PARAMETER OF 10
*/
P2=PROBCHI(27.4275,36,10); PUT P2=;
The PROC step calls up a SAS procedure. Some SAS procedures are
for data management, others are for statistical analysis. Several SAS
speciality packages exist for specialized graphics, data management
(along the lines of dBASE-III), communications, econometric &
time series analysis, computer-based training, census tract and SMSAs
data bases and directories, etc. SAS can convert SPSS, BMD-P, OSIRIS
and other data sets into SAS format with a one line command. It can
call up a BMDP program to analyze data in a SAS data set. It will
even draw you a map of the wine growing regions of Portugal, if you
so desire. Only the basic PROCs are listed here. SAS often has
several PROCs for most types of analyses; one is a general purpose
PROC, the others are more efficient subtypes. PROC steps may also
produce specialized data sets that can be input into other SAS
procedures (e.g., using PROC FACTOR to produce a factor data set that
then can be used to try different rotational strategies, different
number of factor extraction, etc.).
| |
PROC SORT |
Sorts a data set by one or more variables. PROC SORT; BY ID; will sort the data set by the values of the variable ID. |
PROC CONTENTS |
Displays the contents of the data set. |
PROC DATASETS |
Manages SAS data set libraries. |
PROC RANK |
Rank orders one or more variables. |
PROC STANDARDIZE |
Rescales variables to a specified mean and/or standard deviation. |
PROC SCORE |
Generates linear scores for certain procedures like factor analysis and discriminant analysis. |
PROC TRANSPOSE |
Transposes a data set. |
| |
PROC CHART |
Pie, bar, and star charts. |
PROC PLOT |
Two dimensional plots. |
| |
PROC FREQ |
Simple frequencies and contingency tables for categorical variables. |
PROC MEANS |
Number of observations, mean, standard deviation, and minimum and maximum values for continuous variables. |
PROC UNIVARIATE |
More detailed descriptive statistics for continuous variables. |
PROC TABULATE |
Produces tables of frequencies and/or descriptive statistics. |
PROC SUMMARY |
Descriptive statistics broken down by groups; particularly useful for generating a data set of descriptive statistics for input into other procedures. |
PROC CORR |
Parametric and nonparametric correlations. |
| |
PROC REG |
General purpose linear regression and multivariate regression. |
PROC GLM |
General linear models, including regression, analysis of variance/covariance, and multivariate analysis of variance/covariance. |
PROC RSQUARE |
All possible subsets of regression. |
PROC RSREG |
Quadratic response surface regression. |
PROC LOGISTIC |
Logistic regression. |
PROC PROBIT |
Probit regression. |
| |
PROC ANOVA |
Analysis of variance for orthogonal data. |
PROC GLM |
General linear models, including regression, analysis of variance/covariance, and multivariate analysis of variance/covariance. |
PROC NESTED |
Nested analysis of variance. |
PROC VARCOMP |
Variance components. |
| |
PROC DISCRIM |
General purpose parametric and nonparametric discriminant analysis. |
PROC CANDISC |
Canonical discriminant analysis. |
| |
PROC PRINCOMP |
Principal components. |
PROC FACTOR |
Factor analysis. |
| |
PROC LIFETEST |
Nonparametric and life tables. |
PROC LIFEREG |
Parametric survival analysis. |
| |
PROC CLUSTER |
Clustering observations. |
PROC FASTCLUS |
Disjoint clustering for large data sets. |
PROC VARCLUS |
Clustering variables. |
To illustrate the utility of the SAS data step used in conjunction
with various SAS procedures, consider the problem of getting the
correlation matrices for a multivariate twin analysis of the National
Merit Twin data on the National Merit test. SAS is used to "double
enter" the twins, creating an intraclass matrix. The data are then
sorted by sex and zygosity, the correlation matrices are calculated,
and output to a raw data file so that model fitting can be done. (SAS
can also efficiently do model fitting with IML, but that is too
advanced for our purposes here). Finally, other MZ and DZ matrices
are computed by pooling males with females and output.
DATA NMTWINS;
INFILE 'NMTDIR\NMTNMT.DAT' N=2;
/* A PAIR OF TWINS ARE READ IN. THE NATIONAL MERIT SCORES FOR
TWIN A
ARE MNEUMONIC, THOSE FOR TWIN B ARE CALLED NMT1 THROUGH
NMT5 */
INPUT (SEX ZYG ENGLISH MATH SOCSCI NATSCI VOCAB
NMT1-NMT5)
(@6 2*2. @10 5*2. #2 @10 5*2.);
ARRAY TA [5] ENGLISH--VOCAB;
ARRAY TB [5] NMT1-NMT5;
OUTPUT;
/* REVERSE TWIN SCORES */
DO I=1 TO 5;
TEMP=TA[I];
TA[I]=TB[I];
TB[I]=TEMP;
END;
DROP TEMP I;
OUTPUT;
/* SORT THE DATA SET BY SEX AND ZYGOSITY */
PROC SORT; BY SEX ZYG;
/* GET THE CORRELATION MATRICES FOR EACH SEX & ZYGOSITY
COMBINATION.
AND SAVE THE MATRICES IN DATA SET TWINCORR */
PROC CORR OUT=TWINCORR;
BY SEX ZYG;
VAR ENGLISH--NMT5;
/* WRITE A VECTOR OF STANDARD DEVIATIONS FOLLOWED BY THE
CORRELATION
MATRIX ONTO DISKFILE NMTNMT.COR */
DATA _NULL_;
SET TWINCORR;
IF _TYPE_='STD' OR _TYPE_='CORR';
FILE 'C:\NMTDIR\NMTNMT.COR';
PUT (ENGLISH--NMT5) (5*12.8/5*12.8);
/* POOL MALES & FEMALES--STANDARDIZE BY SEX TO REMOVE
MEAN GENDER
DIFFERENCES */
PROC STANDARD DATA=NMTWINS;
BY SEX;
VAR ENGLISH--NMT5;
PROC SORT;
BY ZYG;
PROC CORR OUT=TWINCOR2;
BY ZYG;
VAR ENGLISH--NMT5;
DATA _NULL_;
SET TWINCOR2;
IF _TYPE_='STD' OR _TYPE_='CORR';
FILE 'NMTDIR\NMTNMT2.COR';
PUT (ENGLISH--NMT5) (5*12.8/5*12.8);