Part 1 and Part 2 show the history and evolution of my attempts to web scrape EDGAR files from the SEC. You can see detailed discussions of the approaches in the previous posts. The implementation issues with Part 2 caused me to rethink my approach to the problem.

I was hesitant to use XBRL tags initially because 1) this data isn’t currently audited and 2) my dissertation used both XBRL data and hand-collected data and I saw differences around 30% of the time between what I hand-collected vs. the XBRL tags. But my opinion now is that using the XBRL tags is the only viable solution to the problem. Previously, I was able to scrape the actual titles used by the company for each financial statement line item, but the substantial amount of variation in the titles used for the same accounts made appending this data together into one DataFrame problematic (e.g., slight differences in naming conventions cause the same underlying variable to be spread out among multiple columns for different firms. This could be coded around in theory, but in practice it would be a nightmare).

Read the rest of this entry »