I’ve been working on an ETL project that needs to receive Excel files from multiple companies, process them, and then hand them off to a SQL server. Each company has a separate directory where an Excel file will land at an unspecified time each month. The listener program will watch the folder, and once an Excel file that has not been previously processed lands in the folder, the listener will hand the file off to a second program that processes the file and gets it ready for the SQL import.

Read the rest of this entry »

A common question in business schools is how to best prepare students for the data analysis skills that are increasingly important to employers. Faculty meetings often revolve around what skills are actually needed, which tools should be taught, which departments or professors should do the teaching, and if the optimal approach is to require separate courses or embed analytics in existing courses.

Related discussions include whether teaching students to code is necessary or if GUI-based solutions are sufficient, the extent to which courses should focus on visualizations, how deeply to go into statistics, if the current math prerequisites are doing much of anything, how much math is really even necessary for analytics, and where database management skills fit in to all of this.

Read the rest of this entry »

Friday night I gave a talk to business school students about machine learning. The goal of the talk was to put some context around the topic using detailed examples. The talk began with exploratory data analysis, examining summary statistics, and checking the dataset for erroneous observations (e.g., negative prices). The dataset used contains housing prices and the characteristics of each house- size, age, etc. I also included two irrelevant variables: final grades from my undergraduate auditing courses and a randomly generated variable.

Read the rest of this entry »

Benford’s Law describes an expected frequency distribution of naturally occurring numbers based on their relative position. For example, the number 2934 has a “2” in the first position, a “9” in the second position, etc. Benford’s Law states that a “1” in the first position occurs 30.1% of the time, a “2” in the first position occurs 17.6% of the time, and a “3” in the first position occurs 12.5% of the time. The probabilities continually decrease as the digit in the first position increases and ends with a “9” in the first position occurring 4.6% of the time.

In “Fraud Examination,” a textbook by Albrecht et al. and published by Cengage, the authors write, “According to Benford’s Law, the first digit of random data sets will begin with a 1 more often than with a 2, a 2 more often than with a 3, and so on. In fact, Benford’s Law accurately predicts for many kinds of financial data that the first digits of each group of numbers in a set of random numbers will conform to the predict distribution pattern.” Albrecht et al. go on to explain that Benford’s Law does not apply to assigned or sequentially generated numbers (e.g., invoice numbers, SSN’s). The authors also write that Benford’s Law applies to lake sizes, stock prices, and accounting numbers.

In this post we will:

  1. Generate random numbers from three different distributions using Excel and see how closely they follow Benford’s Law.
  2. Discuss the underlying distributions of random numbers that we would and would not expect to follow Benford’s Law.
  3. Use CRSP and COMPUSTAT data to see if the following data conform to Benford’s Law: stock price, total assets, revenue, and total liabilities.

Read the rest of this entry »

With the exception of 2018, the S&P 500 has performed well over recent years. My question is how representative of the overall market is this? Daniel Kahneman talks about “base rates” in his book Thinking, Fast and Slow and the mistakes that people make when not correctly considering base rates in decision making.

Assuming most investors buy shares of companies they believe will increase in value, how often does this happen on average? What percentage of stocks have share prices that have increased in value each year over a given period? Naturally, this assumption ignores dividends and diversification considerations, but neither of those impact the base rate calculation.

Read the rest of this entry »

There are 10 quizzes throughout the semester in my undergraduate Auditing course. I “drop” the lowest quiz score before calculating the final grades for my students because I realize people get sick, personal issues arise, car trouble, etc. Additionally, my policy is that if a student misses the midterm for any reason then the final exam grade counts twice- once for the final exam and once for the missed midterm.

This link contains Python code to 1) import grades from Blackboard, 2) identify and subsequently drop the lowest quiz score, 3) replace missing midterm grades with the final exam score, and 4) calculate the final grades for the semester.

I teach undergraduate and graduate auditing and have some older examples of audit reports with going concern or dual dated audit reports. The examples are now obsolete since the audit report format for public companies has changed. I believe that showing students actual audit reports is more useful than sticking to textbook examples, and I wanted to update the examples I use in my lectures.

This Python code identifies recent going concern auditor report modifications in the Audit Analytics Audit Opinions dataset. There are several variables in the data dictionaries associated with dual dating, but I was unable to locate these variables in the Audit Opinions data. This could be due to errors on my end, errors in the Audit Analytics data dictionaries, the inclusion of these variables in another dataset, or the particulars of the subscription that the university had to Audit Analytics when I pulled the opinions data.



The code provided pulls the 12/31/2017 Balance Sheet for FCCY into a pandas DataFrame. The ultimate goal of this project is to automate this process for all SEC fillings for a given type and time period, but doing this for a given firm and year is the first step.

Scaling or deflating variables is common in accounting and finance research. It is often done to mitigate heteroskedasticity or the influence of firm size on parameter estimates. However, using analytic results and Monte Carlo simulations we show that common forms of scaling induce substantial spurious correlation via biased parameter estimates. Researchers are typically better off dealing with both heteroskedasticity and the influence of large firms using techniques other than scaling.

The full paper is here:

code in .txt form

example data

capture log close
log using cost.log, replace
import excel data.xlsx, first

/* Sort the data and take a look at the scatter
to see if there appears to be an obvious
relationship between y and x. */

gsort x
list y x
scatter y x

/* The general pattern is that when x <= 10, y=0
and when x >= 18, y=1. There are two exceptions
to this: (x=7, y=1) and (x=20, y=0) */

logit y x

Read the rest of this entry »