Treffer: Scraping EDGAR with Python

Title:

Scraping EDGAR with Python

Language:

English

Authors:

Ashraf, Rasha

Source:

Journal of Education for Business. 2017 92(4):179-185.

Availability:

Routledge. Available from: Taylor & Francis, Ltd. 530 Walnut Street Suite 850, Philadelphia, PA 19106. Tel: 800-354-1420; Tel: 215-625-8900; Fax: 215-207-0050; Web site: http://www.tandf.co.uk/journals

Peer Reviewed:

Page Count:

Publication Date:

2017

Document Type:

Fachzeitschrift Journal Articles<br />Reports - Descriptive

Education Level:

Higher Education
Postsecondary Education

Descriptors:

Information Retrieval, Search Engines, Search Strategies, Online Searching, Electronic Libraries, Business Administration Education, Graduate Study, Data Analysis, Data Processing, Educational Technology, Technology Uses in Education

DOI:

10.1080/08832323.2017.1323720

ISSN:

0883-2323

Number of References:

Entry Date:

2017

Accession Number:

EJ1143599

Database:

ERIC

Weitere Informationen

This article presents Python codes that can be used to extract data from Securities and Exchange Commission (SEC) filings. The Python program web crawls to obtain URL paths for company filings of required reports, such as Form 10-K. The program then performs a textual analysis and counts the number of occurrences of words in the filing that reflect, for example, uncertainty (or any other quality specified by the researcher). The program can be easily modified to conduct other searches by changing the word list, company names, or SEC filings. The Python program could be used in an introductory graduate data analytics course in finance that has a web crawling or textual analysis component.

As Provided

AN0123351974;jeb01may.17;2019Mar06.14:50;v2.2.500

Scraping EDGAR with Python.

Keywords: Computer programs; data collection; education; higher education

Scraping the Securities and Exchange Commission (SEC) Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) filings using programs such as Python (Python Software Foundation, Wilmington, DE), R (R Project for Statistical Computing, Vienna, Austria), or SAS (SAS Institute, Cary, NC) has become a widely used tool for researchers and practitioners to extract data and other information that are not readily available in a databases such as COMPUSTAT, Execucomp, or SDC. García and Norli ([1]) described using the Perl programming language (The Perl Foundation, Holland, MI) to crawl EDGAR. Engelberg and Sankaraguruswamy ([2]) documented how to crawl EDGAR using SAS. In this article I document and explain basic Python codes that are designed to extract data and information from SEC filings. I provide the steps to obtain the URL address for a filing (e.g., Form 10-K) by a company from the SEC EDGAR Master Index. To illustrate how to obtain information from any filing I show the steps to find key words that reflect uncertainty, following Loughran and McDonald's ([3]) financial dictionary.

Textual analysis is becoming a primary research field in many areas, including finance and accounting. Studies have documented, for example, whether negative words used by popular news outlets affect individual stocks and aggregate market returns (Tetlock, [5]; Tetlock, Saar-Tsechansky, & Macskassy, [6]). Loughran and McDonald ([3]) argued that some positive and negative words from the Harvard Psychosociological Dictionary, which is a commonly used source for word classifications, lead to misleading interpretations in the financial context.[1] The authors created a list of words related to finance in six categories: negative, positive, uncertainty, litigious, strong modal, and weak modal. They showed a significant relation between the words and 10-K filing date return, trading volume, subsequent return volatility, unexpected earnings, and fraud. As a simple illustration of how to extract data from SEC filings using the Python program, I selected 10 words from the [3] financial sentiment dictionary representing uncertainty and use those to scrape through the 10-K filings for 16 firms.

I present algorithms to extract textual data from SEC EDGAR filings using Python programs. First, I present how to use Python program to count number of occurrences of the selected words signaling uncertainty in a 10-K filing for a given company in year when the URL address of the filing is known. Next, I present algorithm and Python program in case the URL address of the company filing is not known in advance and has to be extracted from SEC Master Index. The SEC's computer system uses a Central Index Key (CIK) to identify companies that have filed a disclosure with the SEC. Using the CIK, the Python program searches in the EDGAR Master Index for a specific quarter in which a company filed the 10-K, obtains the URL path for the file, and then opens and loads the file to compute the frequency of the chosen words indicating uncertainty. Finally, I present the algorithm of Python program that performs the aforementioned task for a list of firms for corresponding years for which 10-K filings are extracted and a word count is performed for each company. As an example, I search in the 10-K filings of Microsoft and its 15 peers in 2015 for the words indicating uncertainty following Loughran and McDonald's ([3]) financial dictionary and report the results. I do not perform any analysis with the words extracted. My main objective is to explain to researchers and practitioners how to use Python codes, which they can later change to address their own questions.

Extracting uncertainty words from 10-K filings

There is an abundance of information in SEC filings such as the 10-K and 8-K. One can investigate whether a positive outlook or sentiment, or a company's risk-taking attitude, is expressed predominantly in annual reports and affects the future performance of the firm, returns to its stock, the growth of the firm, or its future risk-taking behavior. In this article I do not present any hypotheses related to textual analysis with EDGAR data. I only explain how Python code can be used to count the frequency of certain words that appear in SEC filings. The code can be expanded to extract other key words relevant to any research project.

I present a simple Python code to extract key words from an SEC filing in.txt format. Using the URL for the filing, for an SEC filing in.txt format, this program searches each line of the filing looking for key words provided by the researcher. The code is titled Program1.py and the details of the code are provided in the Appendix. The algorithm of the code Program1.py is as follows:

1. Open the page and load using the URL for the 10-K filing of a company in a year.

2. The program searches each line of the page and split each line into elements (words) with whitespace as a separator.

3. For a given word list, the program searches each element in each line of the page and count the frequency of the specified words.

4. The program provides a final count of the list of specified words. It also outputs the CIK, year, and the file path URL address. CIK is a unique identifier for each company (or any other entity that files with SEC) in the EDGAR system.

The URL path name of the SEC filing may not be known in advance but needs to be searched from the SEC EDGAR Master Index. In the next section I describe how to search for the URL path in the Master Index and then use it in the logic in Program1.py.

EDGAR indexes

The SEC's HTTPS file system allows comprehensive access to the SEC's EDGAR filings by corporations, funds, and individuals. EDGAR Full Indexes list all public SEC filings for each quarter starting in the third quarter of 1994 to the present (U.S. Securities and Exchange Commission, [7]).

There are four types of EDGAR indexes that list company filings for all four quarters of each year. Three of the main indexes sort on the basis of (a) company name, (b) form type, and (c) CIK number. For each filing the EDGAR indexes provide CIK, company name, form type, date filed, and file path (URL address). Using the following link it is possible to access the EDGAR Full Index listed by year: https://www.sec.gov/Archives/edgar/full-index/. Figure 1 shows archives of the SEC EDGAR Full Index year by year from 1993 to the current year. After selecting a specific year, such as 2013, the following link directs to the index of the four quarters for the year: https://www.sec.gov/Archives/edgar/full-index/2013/. Figure 2 shows the SEC filing indexes for the four quarters of 2014.

Graph: Figure 1. Archives of the Securities and Exchange Commission EDGAR Full Index for 1993 to present.

Graph: Figure 2. Securities and Exchange Commission Edgar Index for four quarters in 2014.

Choosing a particular quarter will direct you to four different types of indexes: company, form, master, and XBRL, as shown in Figure 3. The first three indexes show the same information, sorted by company name, form type, and CIK number, respectively. XBRL includes voluntary filer program submissions and is sorted by CIK number. Our program will search the Master Index to obtain the path for a company's filing. A typical Master Index is shown in Figure 4.

Graph: Figure 3. Securities and Exchange Commission EDGAR archives for four different indexes: Company, Form, Master, and XBRL in the third quarter of 2014.

Graph: Figure 4. A typical Master Index File.

Opening the SEC filing for a company using the Master Index

Say we want to find the URL path for a particular SEC filing, such as a 10-K, for a given CIK in a quarter-year. For example: for AMAZON COM INC. (CIK: 1018724) in the first quarter of 2013, I want to search the Master Index for the URL address of the 10-K filing. I record the URL of the filing and open it to perform a textual analysis as described in the section titled "Extracting uncertainty words from 10-K filings." I use Program2.py (provided in the Appendix) to perform this task using the following algorithm:

1. Access the Master Index for a given quarter-year using the URL from the SEC Edgar Archives. For example, for access to the Master Index for the third quarter of 2013, the URL is as follows: url1 = https://www.sec.gov/Archives/edgar/full-index/2013/QTR3/master.idx.

2. Open and load the page with the url1.

3. Go through each line of the Master Index and find the CIK (1018724) and filing (e.g., 10-K), extract the text file path highlighted in Figure 5, and store it as follows: element4 = 'edgar/data/1018724/0001193125-13-028520.txt.'

Graph: Figure 5. URL for Form 10-K filing is obtained from the Maser Index for AMAZON COM in first quarter of 2013.

4. The path of the 10-K filing is stored as url2 = 'https://www.sec.gov/Archives/'+element4 url2 = https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt.

5. Open the page at url2, then perform word count using Program1.py.

If the CIK of the company is not known, the previous program is modified to incorporate a search in the Master Index for the CIK for a specific company name (discussed in the next section).

Search the SEC filings of multiple companies

It is possible to search for SEC filings by multiple companies with a single program and perform the word count for each company filing. This search uses the following logic:[2]

1. This program first opens a file that contains the names of the companies and the corresponding year for which the filing (e.g., 10-K) needs to be searched. The filing name (10-K) is hard coded in the program but can be changed according to the needs of the researcher.

2. After reading the company names and corresponding years, the program then searches the Master Index for all quarters for the CIK for each company and the quarter in which the filing is available.

3. Then the program opens the Master Index again for the selected quarter and looks for the CIK and filing (e.g., 10-K) and obtains the URL path for the filing (similar to the logic in Program2.py).

4. Then the page of the filing (10-K) is loaded using the URL obtained in step 3.

5. The program then goes over each line of the filing page (.txt format) looking for the key words and counting the frequency of the words (similar to the logic in Program1.py).

6. For each company the program outputs the number of occurrences of each key word. If a filing for the company is not found, then the code reports an empty word count.

Results

I present a simple exercise to illustrate how to access and extract information or data from EDGAR filings. I construct a list of 10 words that reflect uncertainty following Loughran and McDonald's ([3]) financial dictionary. The sample list of 10 key words is the following: anticipate, believe, depend, fluctuate, indefinite, likelihood, possible, predict, risk, uncertain. These words are hard coded in the program. The list can be expanded or modified to omit some words or include others.

In Program1.py (presented in the Appendix), using the URL path for the filing, I measure the frequency of the previously mentioned key words listed in the 10-K filing for AMAZON COM. in 2013. The program prints out the frequency of the "uncertainty" word list. If the URL path of the 10-K filing is not known, Program2.py (presented in the Appendix) can crawl through the EDGAR Master Index and obtain it before performing the task word count. In this program the inputs are CIK, year, and the SEC file to be retrieved. The output of the program is the URL path and the word count in Amazon's 10-K filing in 2013. These outputs are shown in the Appendix after running each program.

Next, I expand the previous programs to search for SEC filings of multiple companies and perform the word count for each company filing.[3] I look for the words indicating uncertainty in the 10-K filings of Microsoft and its peers in 2015 as reported in the proxy statement of Microsoft (SEC filing DEF 14A). Table 1 provides the name of the companies and the corresponding years for which the program searches in the corresponding 10-K filings for the key words. One limitation is that the names must be provided in the exact format in which they appear in the SEC filing to be matched with the company names.

Table 1. List of company names and corresponding years.

<table><thead><tr><td>Name</td><td>Year</td></tr></thead><tbody><tr><td>Microsoft Corp.</td><td>2015</td></tr><tr><td>Accenture PLC</td><td>2015</td></tr><tr><td>Adobe Systems</td><td>2015</td></tr><tr><td>Amazon.com</td><td>2015</td></tr><tr><td>Apple Inc.</td><td>2015</td></tr><tr><td>Cisco Systems</td><td>2015</td></tr><tr><td>EMC Corp.</td><td>2015</td></tr><tr><td>Facebook Inc.</td><td>2015</td></tr><tr><td>Google Inc.</td><td>2015</td></tr><tr><td>Hewlett Packard Co.</td><td>2015</td></tr><tr><td>International Business Machines Corp.</td><td>2015</td></tr><tr><td>Intel Corp.</td><td>2015</td></tr><tr><td>Oracle Corp.</td><td>2015</td></tr><tr><td>Qualcomm Inc.</td><td>2015</td></tr><tr><td>Symantec Corp.</td><td>2015</td></tr><tr><td>Yahoo Inc.</td><td>2015</td></tr></tbody></table>

3 Note. The table reports Microsoft Corp and its 2015 technology peer groups as reported in proxy statement (DEF 14A SEC Filing). This provides list of company names and corresponding years for which I look for company Central Index Key and then the URL path for the Form 10-K filing.

The output of the program is constructed in the format of Table 2, where it reports the CIK, the quarter for which the 10-K is obtained in 2015, and the number of occurrences of each key word. If the company does not report filing a 10-K for the specified year for any of the four quarters, the program will output the symbol (*) in the corresponding cell(s). For example, HEWLETT PACKARD CO did not report a 10-K filing in 2015 for any of the four quarters, so the report outputs (*) for each quarter. The program will also output (*) if it is unable to detect any company with the specified name.

Table 2. Count of uncertainty words extracted from the form 10-K filings.

<table><thead><tr><td>Name</td><td>CIK</td><td>Quarter</td><td>Anticipate</td><td>Believe</td><td>Depend</td><td>Fluctuate</td><td>Indefinite</td><td>Likelihood</td><td>Possible</td><td>Predict</td><td>Risk</td><td>Uncertain</td></tr></thead><tbody><tr><td>Microsoft Corp.</td><td>789019</td><td>3</td><td>12</td><td>31</td><td>4</td><td>0</td><td>0</td><td>2</td><td>9</td><td>1</td><td>53</td><td>16</td></tr><tr><td>Accenture PLC</td><td>1467373</td><td>4</td><td>2</td><td>23</td><td>6</td><td>1</td><td>6</td><td>0</td><td>21</td><td>4</td><td>55</td><td>24</td></tr><tr><td>Adobe Systems</td><td>796343</td><td>1</td><td>14</td><td>80</td><td>3</td><td>5</td><td>1</td><td>5</td><td>27</td><td>6</td><td>92</td><td>9</td></tr><tr><td>Amazon.com</td><td>1018724</td><td>1</td><td>0</td><td>25</td><td>2</td><td>4</td><td>0</td><td>15</td><td>19</td><td>2</td><td>31</td><td>11</td></tr><tr><td>Apple Inc.</td><td>320193</td><td>4</td><td>3</td><td>7</td><td>11</td><td>1</td><td>12</td><td>16</td><td>14</td><td>0</td><td>113</td><td>11</td></tr><tr><td>Cisco Systems</td><td>858877</td><td>3</td><td>5</td><td>47</td><td>7</td><td>3</td><td>24</td><td>9</td><td>30</td><td>6</td><td>110</td><td>20</td></tr><tr><td>EMC Corp.</td><td>790070</td><td>1</td><td>4</td><td>46</td><td>2</td><td>7</td><td>0</td><td>14</td><td>9</td><td>6</td><td>82</td><td>10</td></tr><tr><td>Facebook Inc.</td><td>1326801</td><td>1</td><td>11</td><td>76</td><td>5</td><td>3</td><td>2</td><td>6</td><td>30</td><td>3</td><td>25</td><td>23</td></tr><tr><td>Google Inc.</td><td>1288776</td><td>1</td><td>6</td><td>37</td><td>2</td><td>7</td><td>0</td><td>6</td><td>36</td><td>3</td><td>66</td><td>10</td></tr><tr><td>Hewlett Packard Co.</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td><td>*</td></tr><tr><td>International Business Machines Corp.</td><td>51143</td><td>1</td><td>3</td><td>3</td><td>6</td><td>0</td><td>0</td><td>11</td><td>29</td><td>11</td><td>148</td><td>11</td></tr><tr><td>Intel Corp.</td><td>50863</td><td>1</td><td>2</td><td>46</td><td>0</td><td>6</td><td>2</td><td>15</td><td>20</td><td>0</td><td>131</td><td>25</td></tr><tr><td>Oracle Corp.</td><td>1341439</td><td>2</td><td>4</td><td>64</td><td>13</td><td>3</td><td>20</td><td>13</td><td>44</td><td>2</td><td>74</td><td>44</td></tr><tr><td>Qualcomm Inc.</td><td>804328</td><td>4</td><td>5</td><td>27</td><td>8</td><td>2</td><td>2</td><td>21</td><td>22</td><td>20</td><td>68</td><td>14</td></tr><tr><td>Symantec Corp.</td><td>849399</td><td>2</td><td>3</td><td>26</td><td>1</td><td>5</td><td>1</td><td>22</td><td>43</td><td>1</td><td>37</td><td>29</td></tr><tr><td>Yahoo Inc.</td><td>1011006</td><td>1</td><td>8</td><td>26</td><td>3</td><td>3</td><td>3</td><td>0</td><td>39</td><td>1</td><td>113</td><td>34</td></tr></tbody></table>

4 Note. The table displays the count of uncertainty words provided in the Results section appeared in 10-K filings of MICROSOFT CORP and its 2015 technology peer groups as reported in the proxy statement (DEF 14A SEC Filing). The table also provides the Central Index Key (CIK) number of the companies and the quarter in which 10-K filing is available for respective companies in 2015.

Summary

With the advancement of computational power and machine learning tools, gathering, processing, and analyzing data efficiently has become one of the essential elements for advancement both in research and in practice. Traditionally, researchers and practitioners would rely on data from COMPUSTAT, Execucomp, or SDC to get valuable data on companies' performance metrics, executive compensation, and mergers and acquisitions deals. With programs such as Python and R it is now easy to web crawl and gather specific or unique data from SEC EDGAR filings, news articles, and social media such as Twitter and analyze or predict economic, social, or consumer behavior. In this article I describe Python programs for beginners that allow data to be scraped from SEC filings for further analysis. The codes can be modified to meet the needs of the researcher to obtain data from any SEC filings. The techniques described in this article could be used in a graduate-level course on data analytics that has a web-crawling component and to explore the abundance of data in SEC filings.

Notes

Appendix

<h31 id="AN0123351974-10">Program 1</h31>

# Given a URL path for EDGAR 10-K file in.txt format for a company (CIK) in a year this code will perform word count

import urllib2

import time

import csv

import sys

CIK = '0001018724'

Year = '2013'

string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'

url3 = ' https://www.sec.gov/Archives/'+string%5fmatch1

#load the page in for the given url in response3

#urllib2 is a Python module that can be used for fetching URLs\

response3 = urllib2.urlopen(url3)

#words: list of uncertainty words from Loughran and McDonald ([3])

words = ['anticipate', 'believe', 'depend', 'fluctuate', 'indefinite', 'likelihood', 'possible', 'predict', 'risk', 'uncertain']

count = {} # is a dictionary data structure in Python

for elem in words:

count[elem] = 0

#The method split() returns a list of all the words in the string

#The split function splits a single string into a string array using #the separator defined.

#If no separator is defined, whitespace is used.

for line in response3:

elements = line.split()

for word in words:

count[word] = count[word]+elements.count(word)

print (CIK)

print (Year)

print (url3)

print (count)

<h31 id="AN0123351974-11">Program1.py output:</h31> 0001018724 #CIK of AMAZON COM 2013 #Given YEAR

### URL address of 10-K filing

https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt

# Number of occurrences of uncertainty words

{'believe': 28, 'depend': 2, 'risk': 25, 'predict': 5, 'likelihood': 15, 'anticipate': 0, 'possible': 24, 'uncertain': 10, 'indefinite': 0, 'fluctuate': 3}

<h31 id="AN0123351974-12">Program 2</h31>

# Given a Master Index URL of EDGAR find the path of raw text filing #(10-K) for AMAZON COM (CIK 1018724) in year 2013

import urllib2

import time

import csv

import sys

CIK = '1018724' #AMAZON COM

Year = '2013' #GIVEN

FILE = '10-K' #GIVEN

#####Get the Master Index File for the given Year

url = ' https://www.sec.gov/Archives/edgar/full-index/%s/QTR1/master.idx' %(Year)

response = urllib2.urlopen(url)

string_match1 = 'edgar/data/'

element2 = None

element3 = None

element4 = None

###Go through each line of the master index file and find given CIK #and FILE (10-K) and extract the text file path

for line in response:

if CIK in line and FILE in line:

for element in line.split(' '):

if string_match1 in element:

element2 = element.split('|')

for element3 in element2:

if string_match1 in element3:

element4 = element3

# The path of the 10-K filing

url3 = ' https://www.sec.gov/Archives/'+element4

response3 = urllib2.urlopen(url3)

words = ['anticipate', 'believe', 'depend', 'fluctuate', 'indefinite', 'likelihood', 'possible', 'predict', 'risk', 'uncertain']

count = {}

for elem in words:

count[elem] = 0

for line in response3:

elements = line.split()

for word in words:

count[word] = count[word]+elements.count(word)

print (CIK)

print (Year)

print (url3)

print (count)

<h31 id="AN0123351974-13">Program2.py output:</h31> 0001018724 #CIK of AMAZON COM 2013 #Given YEAR

### URL address of 10-K filing

https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt

# Number of occurrences of uncertainty words

{'believe': 28, 'depend': 2, 'risk': 25, 'predict': 5, 'likelihood': 15, 'anticipate': 0, 'possible': 24, 'uncertain': 10, 'indefinite': 0, 'fluctuate': 3}

Footnotes

1 See Loughran and McDonald ([4]) for a survey on textual analysis.

2 Due to the length of Program 3, these codes are not provided in the article and are available upon request.

3 These codes are not provided in the article and are available upon request.

4 Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/vjeb.

References

García, D., & Norli, Ø. (2012). Crawling EDGAR. The Spanish Review of Financial Economics, 10(1), 1–10.

Engelberg, J., & Sankaraguruswamy, S. (2007). How to gather data using a web crawler: An application using SAS to search EDGAR. Working paper, Northwestern University and National University of Singapore.

Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66, 35–65.

Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54, 1187–1230.

5 Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62, 1139–1168.

6 Tetlock, P. C., Saar-Tsechansky, M., & Macskassy, S. (2008). More than words: Quantifying language to measure firms' fundamentals. The Journal of Finance, 63, 1437–1467.

7 U.S. Securities and Exchange Commission. (2017). Accessing EDGAR data. Retrieved from https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm

By Rasha Ashraf

Reported by Author