Archives for posts with tag: data science

Part 1 and Part 2 show the history and evolution of my attempts to web scrape EDGAR files from the SEC. You can see detailed discussions of the approaches in the previous posts. The implementation issues with Part 2 caused me to rethink my approach to the problem.

I was hesitant to use XBRL tags initially because 1) this data isn’t currently audited and 2) my dissertation used both XBRL data and hand-collected data and I saw differences around 30% of the time between what I hand-collected vs. the XBRL tags. But my opinion now is that using the XBRL tags is the only viable solution to the problem. Previously, I was able to scrape the actual titles used by the company for each financial statement line item, but the substantial amount of variation in the titles used for the same accounts made appending this data together into one DataFrame problematic (e.g., slight differences in naming conventions cause the same underlying variable to be spread out among multiple columns for different firms. This could be coded around in theory, but in practice it would be a nightmare).

Read the rest of this entry »

An automated approach to web scraping publicly traded companies’ financial statements is something that I’ve been working on for a while. My first post¬†identified the balance sheet due to a firm-specific characteristic of FCCY’s 12/31/2017 balance sheet- namely that it had a row titled “ASSETS.” Of course, not every firm is going to have this header in all caps to identify which table in the .html file is the balance sheet. But it was a start. The next, currently unpublished, step pulled the annual report and then identified the balance sheet using the summation of the number of times accounts commonly found on balance sheets were present.

Some of the issues I ran into while programming this: ¬†the amount of account variation between balance sheets of companies in different industries, the presence of extra spaces or characters present in the .html file that are not readily apparent to human eyes, the use of capital letters in account titles (or not) by different firms, and variation in how the same account may be called by different companies (e.g., additional paid in capital vs. additional paid-in capital and stockholder’s equity(deficiency) vs. shareholders’ equity vs. stockholders’ equity, etc.), financial statements being split in two across separate pages and identified in the file as two separate tables, notes at the bottom of the page with the financial statements also being tagged as tables by the issuer, substantial variation in the exact titles firms use for the various financial statements, and the actual layout of the tables after they have been scraped (e.g., multiple columns for a given year with data spread across the columns).

All of these are things that can be programmed around, and some of these issues we will see later in the post with FCCY’s 12/31/2017 10-K after we scrape it.

Read the rest of this entry »

A common question in business schools is how to best prepare students for the data analysis skills that are increasingly important to employers. Faculty meetings often revolve around what skills are actually needed, which tools should be taught, which departments or professors should do the teaching, and if the optimal approach is to require separate courses or embed analytics in existing courses.

Related discussions include whether teaching students to code is necessary or if GUI-based solutions are sufficient, the extent to which courses should focus on visualizations, how deeply to go into statistics, if the current math prerequisites are doing much of anything, how much math is really even necessary for analytics, and where database management skills fit in to all of this.

Read the rest of this entry »

Benford’s Law describes an expected frequency distribution of naturally occurring numbers based on their relative position. For example, the number 2934 has a “2” in the first position, a “9” in the second position, etc. Benford’s Law states that a “1” in the first position occurs 30.1% of the time, a “2” in the first position occurs 17.6% of the time, and a “3” in the first position occurs 12.5% of the time. The probabilities continually decrease as the digit in the first position increases and ends with a “9” in the first position occurring 4.6% of the time.

In “Fraud Examination,” a textbook by Albrecht et al. and published by Cengage, the authors write, “According to Benford’s Law, the first digit of random data sets will begin with a 1 more often than with a 2, a 2 more often than with a 3, and so on. In fact, Benford’s Law accurately predicts for many kinds of financial data that the first digits of each group of numbers in a set of random numbers will conform to the predict distribution pattern.” Albrecht et al. go on to explain that Benford’s Law does not apply to assigned or sequentially generated numbers (e.g., invoice numbers, SSN’s). The authors also write that Benford’s Law applies to lake sizes, stock prices, and accounting numbers.

In this post we will:

  1. Generate random numbers from three different distributions using Excel and see how closely they follow Benford’s Law.
  2. Discuss the underlying distributions of random numbers that we would and would not expect to follow Benford’s Law.
  3. Use CRSP and COMPUSTAT data to see if the following data conform to Benford’s Law: stock price, total assets, revenue, and total liabilities.

Read the rest of this entry »