Data Analysis with Stata

Stata is a popular statistical package for data analysis. This is one of the popular statistical packages that is being used by researchers and practitioners alike. Most notable users are from Epidemiology, Econometrics, Sociology. Many renowned academic institutions have adopted Stata as a teaching software, including the Harvard Kennedy School, Johns Hopkins Bloomberg School of Public Health, London School of Hygiene & Tropical Medicine, Kellogg School of Management, Northwestern University, and the School of International and Public Affairs, Columbia University (Wikipedia, Accessed April 13, 2011)

If you are a beginner, start here!

Yes, every programming language or data analysis software has its own style of writing codes (called syntax). Stata is no exception. If you think Stata is weired, you are not alone. At first, you will feel like that. However, you will be able to have a greater understanding of Stata commands sooner than most of the other data analysis software available. Because Stata commands are very natural, almost English conversation.

How to read and understand commands here

When reporting or publishing Stata commands on webpage or on books, it is a common practice to begin the command with a dot (.). But when you try the command on a Stata console, do not enter that dot. There is already a dot in Stata near the blinking cursor.

For example, the following command starts with a dot. You just type the rest in Stata and get the output.

. sysuse auto

Simple Summaries

Summarizing

. sysuse auto
. summarize price mpg rep78

Conditional selection using if

. sysuse auto
. summarize  mpg rep78 if  price>=10000

Merging or Linking two Data Sets

Two data sets can be merged based on a single variable. Consider we have two data sets called data1.dta and data2.dta. We wish to merge them by a common variable id. For the one-to-one merge to run, the both data sets must be sorted by the the id variable (or the variable by which you are merging.

. use data1.dta, clear
. sort id
 
. use data2.dta, clear // needs to be already sorted by id
. merge 1:1 id using data2.dta
 
	Result                   # of obs.
	----------------------------------
	not matched                     0
	matched                        27 (_merge==3)
	----------------------------------

Note that a new variable _merge is generated which stores values that indicates the status or merge process. It takes values 1 if the master data contributes, 2 if the using data set contributes, and 3 if both data sets contribute . After running a tab on _merge, you can safely delete the variable from the newly created data set (i.e., the merged data set).

Save and Export Data

Saving the data To save the data after modification. It is best not to save any system dat

. sysuse auto
. save auto2.dta, replace

Missing Data Handling

Missing values play a crucial role in any data analysis. One must make every effort to ckeck whether presence or absence of missing values have any impact on the results. I cannot emphasize enough to read the relevant sections from the help file or manual about how to read and interpret missing values in Stata. Every statistical package has its own style of handling missing values.

Consider the following example. We are using a data available on Stata website. That is why we are using the command webuse.

. webuse studentsurvey.dta, clear
(Student Survey)
 
. webuse studentsurvey.dta, clear
(Student Survey)
 
. misstable patterns, frequency
 
   Missing-value patterns
     (1 means complete)
 
              |   Pattern
    Frequency |  1  2  3
  ------------+-------------
          116 |  1  1  1
              |
            6 |  1  1  0
            3 |  0  0  0
  ------------+-------------
          125 |
 
  Variables are  (1) age  (2) female  (3) dept

The table shows that there are 125 observations (cases) in the studentsurvey data set. The patterns corresponds to the variables that are written in the bottom of the table. That is,

Pattern 1 corresponds to the variable age
Pattern 2 corresponds to the variable female
Pattern 3 corresponds to the variable dept

Frequency has four rows: first row, blank row, third row, and fourth row.

Third and fourth rows tell us that there are 9 cases that have missing values in one or more of the three variables
MORE...

Using conditions within misstable

The following codes are applicable to you. It was applied on a local data set which cannot be released.

. misstable patterns if (cyear>=2001 & cyear < 2008) , freq
 
   Missing-value patterns
     (1 means complete)
 
              |   Pattern
    Frequency |  1  2
  ------------+-------------
            9 |  1  1
              |
            9 |  1  0
            2 |  0  0
            1 |  0  1
  ------------+-------------
           21 |
 
  Variables are  (1) ta45_49  (2) ta50up

Miscellaneous Tips

Copy or export a table from Stata to Excel

You can create an output in a tabular form, and export it to word processor preserving the structure. Follow the example below using the the auto data.

. sysuse auto, clear
(1978 Automobile Data)
 
. tabulate foreign
 
   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

Select the output and use Copy Table from the Edit menu or Shift+Ctrl+C if you prefer keyboard. Then paste it in Excel. This will preserve the table sturcuture which you can copy to your word processor.


More to come

Category: 

Comments

Submitted by Anonymous (not verified) on

In above mentioned models, if the outcomes of a response exists as a number of success in a total number of trails, how can we use STATA for above modelS? I saw in the menu to run a probit the dependent variable will assume 1 for success and o for failure. The analysis what I have wanted is facilitated in the SPSS 16.0.

I hope you will solve my problem.

Name: Rajendra
Email:

Add new comment

Author information

Enayetur Raheem's picture

Biography

The author is a PhD candidate, major in Statistics at University of Windsor, Windsor, Ontario, Canada. He is the founder of statlter.com

Email subscription

Enter your email address:

Delivered by FeedBurner