Part 1 and Part 2 show the history and evolution of my attempts to web scrape EDGAR files from the SEC. You can see detailed discussions of the approaches in the previous posts. The implementation issues with Part 2 caused me to rethink my approach to the problem.

I was hesitant to use XBRL tags initially because 1) this data isn’t currently audited and 2) my dissertation used both XBRL data and hand-collected data and I saw differences around 30% of the time between what I hand-collected vs. the XBRL tags. But my opinion now is that using the XBRL tags is the only viable solution to the problem. Previously, I was able to scrape the actual titles used by the company for each financial statement line item, but the substantial amount of variation in the titles used for the same accounts made appending this data together into one DataFrame problematic (e.g., slight differences in naming conventions cause the same underlying variable to be spread out among multiple columns for different firms. This could be coded around in theory, but in practice it would be a nightmare).

Rather than trying to find each financial statement as in Part 2, and then scraping the entire financial statement, Part 3 tries to scrape Net Income from companies that filed with the SEC in QTR1 of 2020.

The big picture is:

Step 1) Download the company.idx file from EDGAR which contains data for each firm that filed in a fixed-width text file.

Step 2) Clean up the text file to remove the header information such that the DataFrame has a list of CIK’s and url’s with the .txt file of the filing.

Step 3) Retain observations related to the annual report (i.e., 10-K’s).

Step 4) Initialize an empty list of exceptions so that any firms that don’t successfully pull may be investigated. The code can be revised based on what is found by manually inspecting the exceptions.

Step 5) Use BeautifulSoup to find all of the tables in the text file. Search the tables for various ways firms title their income statements. Note there is a lot of variation here, and I wasn’t able to find a specific XBRL tag that would identify the income statement- but this may exist and I just missed it. This step converts everything to lower case and removes spaces and new line characters when searching for the income statements. You can see some common titles for the income statements as a series of “if” statements during the iterations.

Step 6) Once tables that look like they may be the consolidated income statements have been retained, the next step is to then search these tables for the XBRL tags firms use when tagging Net Income. I saw some firms used the _ProfitLoss tag and some used _NetIncomeLoss when tagging Net Income. Also, firms seemed a little inconsistent in whether they tagged Net Income attributable to common shareholders or Net Income before NCI’s and you can see examples of that in the Jupyter Notebook posted to my GitHub.

Step 7) Find the “tr” and “td” elements with the XBRL tag and retain lengths of 4 or less. The usual case is that a firm that has been in operations for multiple years will file three years for the income statement. The length is four because the tag itself counts as one. Of course, the code also has to keep firms that may be in their first or second year of operations, too. These firms would have lengths of 3 if they filed two years, two if the filed one year, etc. Some firms have XBRL tagged statements that contain more than 4 if the statement with the XBRL tag contains monthly or quarterly data in addition to the annual data. My code doesn’t process these instances, but it is something you should be aware of.

Step 8) Create a DataFrame frame with the scraped values. The variables are: CIK, XBRL tag, current year net income, lagged net income, second lag of net income.

Step 9) Clean out the other text parts that were pulled along with the net income number, which currently exists as a string.

Step 10) Go back to the actual 10-K’s to audit the results.

The gist of the above is that the script pulls around 88 of the first 100 firms successfully. However, the code is slow due to the abundance of “if” statements. You can see more detailed discussions of the results and caveats on GitHub repository. The output and code are available on my GitHub.