Treffer: Scraping EDGAR with Python
Postsecondary Education
Weitere Informationen
This article presents Python codes that can be used to extract data from Securities and Exchange Commission (SEC) filings. The Python program web crawls to obtain URL paths for company filings of required reports, such as Form 10-K. The program then performs a textual analysis and counts the number of occurrences of words in the filing that reflect, for example, uncertainty (or any other quality specified by the researcher). The program can be easily modified to conduct other searches by changing the word list, company names, or SEC filings. The Python program could be used in an introductory graduate data analytics course in finance that has a web crawling or textual analysis component.
As Provided
AN0123351974;jeb01may.17;2019Mar06.14:50;v2.2.500
Scraping EDGAR with Python.
This article presents Python codes that can be used to extract data from Securities and Exchange Commission (SEC) filings. The Python program web crawls to obtain URL paths for company filings of required reports, such as Form 10-K. The program then performs a textual analysis and counts the number of occurrences of words in the filing that reflect, for example, uncertainty (or any other quality specified by the researcher). The program can be easily modified to conduct other searches by changing the word list, company names, or SEC filings. The Python program could be used in an introductory graduate data analytics course in finance that has a web crawling or textual analysis component.
Keywords: Computer programs; data collection; education; higher education
Scraping the Securities and Exchange Commission (SEC) Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) filings using programs such as Python (Python Software Foundation, Wilmington, DE), R (R Project for Statistical Computing, Vienna, Austria), or SAS (SAS Institute, Cary, NC) has become a widely used tool for researchers and practitioners to extract data and other information that are not readily available in a databases such as COMPUSTAT, Execucomp, or SDC. García and Norli ([1]) described using the Perl programming language (The Perl Foundation, Holland, MI) to crawl EDGAR. Engelberg and Sankaraguruswamy ([2]) documented how to crawl EDGAR using SAS. In this article I document and explain basic Python codes that are designed to extract data and information from SEC filings. I provide the steps to obtain the URL address for a filing (e.g., Form 10-K) by a company from the SEC EDGAR Master Index. To illustrate how to obtain information from any filing I show the steps to find key words that reflect uncertainty, following Loughran and McDonald's ([3]) financial dictionary.
Textual analysis is becoming a primary research field in many areas, including finance and accounting. Studies have documented, for example, whether negative words used by popular news outlets affect individual stocks and aggregate market returns (Tetlock, [5]; Tetlock, Saar-Tsechansky, & Macskassy, [6]). Loughran and McDonald ([3]) argued that some positive and negative words from the Harvard Psychosociological Dictionary, which is a commonly used source for word classifications, lead to misleading interpretations in the financial context.[1] The authors created a list of words related to finance in six categories: negative, positive, uncertainty, litigious, strong modal, and weak modal. They showed a significant relation between the words and 10-K filing date return, trading volume, subsequent return volatility, unexpected earnings, and fraud. As a simple illustration of how to extract data from SEC filings using the Python program, I selected 10 words from the [3] financial sentiment dictionary representing uncertainty and use those to scrape through the 10-K filings for 16 firms.
I present algorithms to extract textual data from SEC EDGAR filings using Python programs. First, I present how to use Python program to count number of occurrences of the selected words signaling uncertainty in a 10-K filing for a given company in year when the URL address of the filing is known. Next, I present algorithm and Python program in case the URL address of the company filing is not known in advance and has to be extracted from SEC Master Index. The SEC's computer system uses a Central Index Key (CIK) to identify companies that have filed a disclosure with the SEC. Using the CIK, the Python program searches in the EDGAR Master Index for a specific quarter in which a company filed the 10-K, obtains the URL path for the file, and then opens and loads the file to compute the frequency of the chosen words indicating uncertainty. Finally, I present the algorithm of Python program that performs the aforementioned task for a list of firms for corresponding years for which 10-K filings are extracted and a word count is performed for each company. As an example, I search in the 10-K filings of Microsoft and its 15 peers in 2015 for the words indicating uncertainty following Loughran and McDonald's ([3]) financial dictionary and report the results. I do not perform any analysis with the words extracted. My main objective is to explain to researchers and practitioners how to use Python codes, which they can later change to address their own questions.
Extracting uncertainty words from 10-K filings
There is an abundance of information in SEC filings such as the 10-K and 8-K. One can investigate whether a positive outlook or sentiment, or a company's risk-taking attitude, is expressed predominantly in annual reports and affects the future performance of the firm, returns to its stock, the growth of the firm, or its future risk-taking behavior. In this article I do not present any hypotheses related to textual analysis with EDGAR data. I only explain how Python code can be used to count the frequency of certain words that appear in SEC filings. The code can be expanded to extract other key words relevant to any research project.
I present a simple Python code to extract key words from an SEC filing in.txt format. Using the URL for the filing, for an SEC filing in.txt format, this program searches each line of the filing looking for key words provided by the researcher. The code is titled Program1.py and the details of the code are provided in the Appendix. The algorithm of the code Program1.py is as follows:
The URL path name of the SEC filing may not be known in advance but needs to be searched from the SEC EDGAR Master Index. In the next section I describe how to search for the URL path in the Master Index and then use it in the logic in Program1.py.
EDGAR indexes
The SEC's HTTPS file system allows comprehensive access to the SEC's EDGAR filings by corporations, funds, and individuals. EDGAR Full Indexes list all public SEC filings for each quarter starting in the third quarter of 1994 to the present (U.S. Securities and Exchange Commission, [7]).
There are four types of EDGAR indexes that list company filings for all four quarters of each year. Three of the main indexes sort on the basis of (a) company name, (b) form type, and (c) CIK number. For each filing the EDGAR indexes provide CIK, company name, form type, date filed, and file path (URL address). Using the following link it is possible to access the EDGAR Full Index listed by year: https://www.sec.gov/Archives/edgar/full-index/. Figure 1 shows archives of the SEC EDGAR Full Index year by year from 1993 to the current year. After selecting a specific year, such as 2013, the following link directs to the index of the four quarters for the year: https://www.sec.gov/Archives/edgar/full-index/2013/. Figure 2 shows the SEC filing indexes for the four quarters of 2014.
Graph: Figure 1. Archives of the Securities and Exchange Commission EDGAR Full Index for 1993 to present.
Graph: Figure 2. Securities and Exchange Commission Edgar Index for four quarters in 2014.
Choosing a particular quarter will direct you to four different types of indexes: company, form, master, and XBRL, as shown in Figure 3. The first three indexes show the same information, sorted by company name, form type, and CIK number, respectively. XBRL includes voluntary filer program submissions and is sorted by CIK number. Our program will search the Master Index to obtain the path for a company's filing. A typical Master Index is shown in Figure 4.
Graph: Figure 3. Securities and Exchange Commission EDGAR archives for four different indexes: Company, Form, Master, and XBRL in the third quarter of 2014.
Graph: Figure 4. A typical Master Index File.
Opening the SEC filing for a company using the Master Index
Say we want to find the URL path for a particular SEC filing, such as a 10-K, for a given CIK in a quarter-year. For example: for AMAZON COM INC. (CIK: 1018724) in the first quarter of 2013, I want to search the Master Index for the URL address of the 10-K filing. I record the URL of the filing and open it to perform a textual analysis as described in the section titled "Extracting uncertainty words from 10-K filings." I use Program2.py (provided in the Appendix) to perform this task using the following algorithm:
Graph: Figure 5. URL for Form 10-K filing is obtained from the Maser Index for AMAZON COM in first quarter of 2013.
If the CIK of the company is not known, the previous program is modified to incorporate a search in the Master Index for the CIK for a specific company name (discussed in the next section).
Search the SEC filings of multiple companies
It is possible to search for SEC filings by multiple companies with a single program and perform the word count for each company filing. This search uses the following logic:[2]
Results
I present a simple exercise to illustrate how to access and extract information or data from EDGAR filings. I construct a list of 10 words that reflect uncertainty following Loughran and McDonald's ([3]) financial dictionary. The sample list of 10 key words is the following:
In Program1.py (presented in the Appendix), using the URL path for the filing, I measure the frequency of the previously mentioned key words listed in the 10-K filing for AMAZON COM. in 2013. The program prints out the frequency of the "uncertainty" word list. If the URL path of the 10-K filing is not known, Program2.py (presented in the Appendix) can crawl through the EDGAR Master Index and obtain it before performing the task word count. In this program the inputs are CIK, year, and the SEC file to be retrieved. The output of the program is the URL path and the word count in Amazon's 10-K filing in 2013. These outputs are shown in the Appendix after running each program.
Next, I expand the previous programs to search for SEC filings of multiple companies and perform the word count for each company filing.[3] I look for the words indicating uncertainty in the 10-K filings of Microsoft and its peers in 2015 as reported in the proxy statement of Microsoft (SEC filing DEF 14A). Table 1 provides the name of the companies and the corresponding years for which the program searches in the corresponding 10-K filings for the key words. One limitation is that the names must be provided in the exact format in which they appear in the SEC filing to be matched with the company names.
Table 1. List of company names and corresponding years.
3
The output of the program is constructed in the format of Table 2, where it reports the CIK, the quarter for which the 10-K is obtained in 2015, and the number of occurrences of each key word. If the company does not report filing a 10-K for the specified year for any of the four quarters, the program will output the symbol (*) in the corresponding cell(s). For example, HEWLETT PACKARD CO did not report a 10-K filing in 2015 for any of the four quarters, so the report outputs (*) for each quarter. The program will also output (*) if it is unable to detect any company with the specified name.
Table 2. Count of uncertainty words extracted from the form 10-K filings.
4
Summary
With the advancement of computational power and machine learning tools, gathering, processing, and analyzing data efficiently has become one of the essential elements for advancement both in research and in practice. Traditionally, researchers and practitioners would rely on data from COMPUSTAT, Execucomp, or SDC to get valuable data on companies' performance metrics, executive compensation, and mergers and acquisitions deals. With programs such as Python and R it is now easy to web crawl and gather specific or unique data from SEC EDGAR filings, news articles, and social media such as Twitter and analyze or predict economic, social, or consumer behavior. In this article I describe Python programs for beginners that allow data to be scraped from SEC filings for further analysis. The codes can be modified to meet the needs of the researcher to obtain data from any SEC filings. The techniques described in this article could be used in a graduate-level course on data analytics that has a web-crawling component and to explore the abundance of data in SEC filings.
Notes
Appendix
<h31 id="AN0123351974-10">Program 1</h31># Given a URL path for EDGAR 10-K file in.txt format for a company (CIK) in a year this code will perform word count
import urllib2
import time
import csv
import sys
CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = ' https://www.sec.gov/Archives/'+string%5fmatch1
#load the page in for the given url in response3
#urllib2 is a Python module that can be used for fetching URLs\
response3 = urllib2.urlopen(url3)
#words: list of uncertainty words from Loughran and McDonald ([3])
words = ['anticipate', 'believe', 'depend', 'fluctuate', 'indefinite', 'likelihood', 'possible', 'predict', 'risk', 'uncertain']
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
#The method split() returns a list of all the words in the string
#The split function splits a single string into a string array using #the separator defined.
#If no separator is defined, whitespace is used.
for line in response3:
elements = line.split()
for word in words:
count[word] = count[word]+elements.count(word)
print (CIK)
print (Year)
print (url3)
print (count)
<h31 id="AN0123351974-11">Program1.py output:</h31>### URL address of 10-K filing
https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt
# Number of occurrences of uncertainty words
{'believe': 28, 'depend': 2, 'risk': 25, 'predict': 5, 'likelihood': 15, 'anticipate': 0, 'possible': 24, 'uncertain': 10, 'indefinite': 0, 'fluctuate': 3}
<h31 id="AN0123351974-12">Program 2</h31># Given a Master Index URL of EDGAR find the path of raw text filing #(10-K) for AMAZON COM (CIK 1018724) in year 2013
import urllib2
import time
import csv
import sys
CIK = '1018724' #AMAZON COM
Year = '2013' #GIVEN
FILE = '10-K' #GIVEN
#####Get the Master Index File for the given Year
url = ' https://www.sec.gov/Archives/edgar/full-index/%s/QTR1/master.idx' %(Year)
response = urllib2.urlopen(url)
string_match1 = 'edgar/data/'
element2 = None
element3 = None
element4 = None
###Go through each line of the master index file and find given CIK #and FILE (10-K) and extract the text file path
for line in response:
if CIK in line and FILE in line:
for element in line.split(' '):
if string_match1 in element:
element2 = element.split('|')
for element3 in element2:
if string_match1 in element3:
element4 = element3
# The path of the 10-K filing
url3 = ' https://www.sec.gov/Archives/'+element4
response3 = urllib2.urlopen(url3)
words = ['anticipate', 'believe', 'depend', 'fluctuate', 'indefinite', 'likelihood', 'possible', 'predict', 'risk', 'uncertain']
count = {}
for elem in words:
count[elem] = 0
for line in response3:
elements = line.split()
for word in words:
count[word] = count[word]+elements.count(word)
print (CIK)
print (Year)
print (url3)
print (count)
<h31 id="AN0123351974-13">Program2.py output:</h31>### URL address of 10-K filing
https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt
# Number of occurrences of uncertainty words
{'believe': 28, 'depend': 2, 'risk': 25, 'predict': 5, 'likelihood': 15, 'anticipate': 0, 'possible': 24, 'uncertain': 10, 'indefinite': 0, 'fluctuate': 3}
Footnotes
1 See Loughran and McDonald ([4]) for a survey on textual analysis.
2 Due to the length of Program 3, these codes are not provided in the article and are available upon request.
3 These codes are not provided in the article and are available upon request.
4 Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/vjeb.
References
García, D., & Norli, Ø. (2012). Crawling EDGAR. The Spanish Review of Financial Economics, 10(1), 1–10.
Engelberg, J., & Sankaraguruswamy, S. (2007). How to gather data using a web crawler: An application using SAS to search EDGAR. Working paper, Northwestern University and National University of Singapore.
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66, 35–65.
Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54, 1187–1230.
5 Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62, 1139–1168.
6 Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008). More than words: Quantifying language to measure firms' fundamentals. The Journal of Finance, 63, 1437–1467.
7 U.S. Securities and Exchange Commission. (2017). Accessing EDGAR data. Retrieved from https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm
By Rasha Ashraf
Reported by Author