This post uses Griliches (1976) data and formulas in Baum’s “An Introduction to Modern Econometrics Using Stata” to compute the 2SLS estimator manually. The goal is to show detailed examples of the elements in the matrices used for estimation.

Stata has various routines to deal with panel or time-series data. To use the built in functionality, the researcher must first denote the data as either panel or time-series using xtset or tsset, respectively. Xtset requires that together the firm identifier and time period uniquely identify each observation. If this condition is not met, the following error code will be returned:

. xtset gvkey fyear

repeated time values within panel

r(451);

This post uses data from a web query pulldown from Compustat for all firms in the Fundamentals Annual table between 1980 and 2016 to explore various sources of duplicate observations in the panel dataset.

In an ordinary least squares (OLS) regression model, the marginal effect of an independent variable on the dependent variable is simply the regression coefficient estimate reported by the statistical software package. Assume a simple model where y is regressed on x, x takes on values from 1 to 100, and the regression parameter estimate for Beta_1 is 2 (i.e., y= B0 + B1x + e, where B1=2). What OLS has given is an average marginal effect across all the values of x. It doesn’t matter if we are predicting y using an x value of 1 or an x value of 100. We will use the constant, average marginal effect of 2 times the value of x to predict y in this simple model.

Various model specifications and functional forms may be used to relax this assumption. For example, specifying the model with ln(y) rather than y estimates a constant **percentage** change in y per change in x, including an interaction term in a multivariate model, or including x and x^2 in a multivariate model all allow for non-constant marginal effects.

Returning to the simple OLS model, the marginal effect of x on y is a derivative. The model tells us what a one unit change in x does to y. Since y= B0 + B1x +e, dy/dx = B1. However, for probit and logit models we can’t simply look at the regression coefficient estimate and immediately know what the marginal effect of a one unit change in x does to y. These are nonlinear models where various values of x have different marginal effects on y. In the example above where x goes from 1 to 100 the impact on y when x equals 1 will be different than the impact on y when x equals 100. This is completely different than the simple OLS example where the underlying values of x did not matter and the marginal effect of x on y was always 2.

The sign of the impact x has on y is known by looking at the statistical software package output for probit and logit models, but the marginal effect is not. The coefficient estimate is important, but it is only one piece of the marginal effect. This is because the probit model uses the cumulative distribution function (CDF) of the standard normal distribution evaluated at the predicted value of y (i.e., B0 + B1x1, and this is commonly referred to as “**XB**” in econometrics texts), and the logit model uses the cumulative distribution function (CDF) of the standard logistic distribution evaluated at the predicted value of y (i.e., B0 + B1x1, or **XB** ). Calculating the same derivative for a probit or logit model, dy/dx now uses the chain rule from calculus. The derivative of the CDF of the relevant distribution evaluated at **XB** is 1) the probability density function (PDF) of the relevant distribution (standard normal or standard logistic) at **XB **times 2) the derivative of **XB** with respect to x, which is the regression coefficient estimate Beta_1 in this simple example. Note that the PDF is the derivative of the CDF for the first part of the derivative and the second part of the derivative come froms the chain rule.

For example, in the case of the probit model, the marginal effect of x on y is the probability distribution function (PDF) of the standard normal distribution multiplied by Beta_1. What follows is a Stata .do file that does the following for both probit and logit models: 1) illustrates that the coefficient estimate is not the marginal effect 2) calculates the predicted probability “by hand” based on **XB** 3) calculates the marginal effect at the mean of x “by hand” and 4) calculates the mean marginal effect of x “by hand.”

This post contains an example which shows why a degree of freedom is lost each time a regressor is added to an OLS model. The OLS first order conditions, and thinking about OLS as a series of partial derivatives which minimize the sum of squared residuals, are the foundations behind the posted Stata code.

An alternative title for this post could easily be “Merging CRSP and COMPUSTAT: Date Considerations.” Say your research question involves how the stock market reacts to new information about particular financial statement data or disclosures. Linking accounting variables to stock returns, stock prices, or perhaps trading volume is necessary for this examination. Multiple date variables are included in various COMPUSTAT tables, but which date should be used to construct the merges and event windows? The short answer is that it depends on your research question and the assumptions behind the research design. I am not advocating for any one date variable. The goal here is to discuss possible choices, where these variables are, and the differences between them. A detailed example is given using an earnings announcement from AAPL.

This post is motivated by a lecture I gave in ACCT 3100: Financial Statement Analysis. The first half of the semester covers basic journal entries, drafting financial statements, and the articulation among various financial statements. The second half of the semester covers the behavior of returns, fundamentals of bonds, and equity valuation.

The textbook I use is Stephen Penman’s Financial Statement Analysis and Security Valuation (5th ed). On page 51, Figure 2.3 shows various percentiles of Price-Earnings (P/E) ratios from 1963 to 2010. The students in the course told me at the beginning of the semester that they wanted more practice working with data, and I decided recreating Figure 2.3 would be a good way to accomplish this while also bringing the textbook to life. During a lecture the class compared historical prices per Yahoo! Finance and Google Finance to price data from COMPUSTAT and CRSP. At first glance, there appear to be discrepancies between the various providers of historical price data, but this is often explained by prices that are retroactively adjusted for certain corporate actions (e.g., stock splits). This is shown using CME Group as a detailed example.

Everyone makes mistakes. The same blunders I made on my first few research papers I saw repeated in the PhD student cohorts that followed me. As a committee member on MS Economics and Finance theses, I have noticed that strikingly similar problems tend to surface. These mistakes are not limited to students. I think anyone that has worked with data for a substantial amount of time has had that sinking feeling in the pit of the stomach when realizing that a programming, data management, or organization problem exists. An old adage in boxing is that the punch a boxer doesn’t see coming is the one that does the most damage, and my opinion is that this also applies to working with data.

Behavioral researchers in accounting regularly design experiments with two primary independent variables of interest. The researcher creates four separate written cases with various combinations of these two independent variables, and each participant in the study receives one of the four cases for the experiment.

Recent papers (e.g., Bills, Jeter, Stein (2015); Reichelt and Wang (2010)) have included measures which identify some audit firms as “specialist” or “dominant” auditors. This post includes an example of how to calculate these indicator variables using data that was obtained from the Audit Analytics- Audit Fees online menu query within WRDS.

This post manually calculates standard errors under a variety of assumptions. The example dataset purposely includes a cluster where a dummy variable is nonzero in only one cluster. This causes a “.” to be reported for the model’s F statistic. The post concludes with manual calculation of the outer product of gradients (OPG) variance estimator.