An automated approach to web scraping publicly traded companies’ financial statements is something that I’ve been working on for a while. My first post identified the balance sheet due to a firm-specific characteristic of FCCY’s 12/31/2017 balance sheet- namely that it had a row titled “ASSETS.” Of course, not every firm is going to have this header in all caps to identify which table in the .html file is the balance sheet. But it was a start. The next, currently unpublished, step pulled the annual report and then identified the balance sheet using the summation of the number of times accounts commonly found on balance sheets were present.

Some of the issues I ran into while programming this:  the amount of account variation between balance sheets of companies in different industries, the presence of extra spaces or characters present in the .html file that are not readily apparent to human eyes, the use of capital letters in account titles (or not) by different firms, and variation in how the same account may be called by different companies (e.g., additional paid in capital vs. additional paid-in capital and stockholder’s equity(deficiency) vs. shareholders’ equity vs. stockholders’ equity, etc.), financial statements being split in two across separate pages and identified in the file as two separate tables, notes at the bottom of the page with the financial statements also being tagged as tables by the issuer, substantial variation in the exact titles firms use for the various financial statements, and the actual layout of the tables after they have been scraped (e.g., multiple columns for a given year with data spread across the columns).

All of these are things that can be programmed around, and some of these issues we will see later in the post with FCCY’s 12/31/2017 10-K after we scrape it.

The code I wrote took around 114 hours to run and pulled around 60% of the firms that filed a Q1 2020 10-K with EDGAR. The accuracy for the firms that were pulled was reasonable, the speed was not, and missing around 40% of firms that filed meant that I decided to try and modify the approach. One problem I noticed was that there can be supplemental schedules in the 10-K that are not part of the basic financial statements that still score well on the cumulative count of account titles. So when my code thought it was pulling a balance sheet it may, for example, have actually pulled an ASC 820 fair value disclosure table.

My current solution uses the structure of the 10-K to limit the number of tables iterated through. This should improve both speed and accuracy of the process. The financial statements are typically below the auditor’s reports and above the notes to the financial statements. The current code identifies these two lines in the 10-K as boundaries for the search and then performs similar cumulative counts to identify each of the basic financial statements based on the number of hits returned.

FCCY’s balance sheet conceptually contains one column of account titles, one column for 2017 data, and one column for 2016 data. But the table in the .html file actually has eight columns. Some columns are mostly duplicates with the addition of a leading dollar sign in place of the account number, some columns are full of null values, and the accounting convention of parentheses around numbers has also been split across the columns.

The code deals with all of these issues and pulls the table into a cleaned and formatted DataFrame with a multi-index based on CIK and Year. I have not written code to process the other financial statements yet. That is a work in progress, and I’ll turn to that after I see how the new automation routine performs. If I can get a higher percentage of filed companies into the DataFrame, then I will post the automated code which loops through all 10-K’s filed with the SEC for a given quarter. I’m just not there yet.

You can find the code posted to my GitHub.