Many people doing research in finance use WRDS as a gateway to high-quality financial data. Think CRSP for stock prices, Compustat for accounting info, or IBES for stock analyst forecasts.
For some time now, it has been possible to download data from WRDS using Python. This approach has many benefits:
writing SQL queries from within Python;
downloading data into pandas dataframes;
The last point is a big one. With Python, it becomes very easy to write papers that others can reproduce with a single push of a button—from downloading data to running models.
To use the WRDS-Python magic, first download the WRDS package from GitHub (link) that is kindly provided by the folks at Wharton. The package mostly just connects you to the WRDS PostgreSQL database.
To illustrate the workflow, let me show how you may use Python to grab some data from WRDS and run a simple analysis.
First, set up some packages:
Here, matplotlib is for graphs, pandas is for dataframes, and seaborn makes the graphs just a bit more better looking.
Now, connect to WRDS (you will be prompted for your login details):
Say you want to calculate total earnings of US publicly traded companies relative to their market cap, a measure of how expensive the overall stock market is. To do this, grab some data from Compustat:
The query is a basic PostgreSQL statement. We select a few variables from the Compustat annual table (subject to a few conditions: total assets and liabilities are positive, no duplicates, and only US-incorporated firms). Then, we ask pandas to retrieve the data.
Now, the downloaded dataframe contains data on earnings, stock prices and shares outstanding for every publicly traded firm in the US. Since we are only interested in the aggregate stock market, we need to add up the numbers for every firm:
Here is the resulting picture:
The stock market has evidently become much more expensive since 1980.
Now, if we are only interested in the graph above, it is a bit silly to download the data for every firm in the Compustat database. Instead, why not ask PostgreSQL to do the calculations on the WRDS server? That is typically a much more faster and efficient approach.
Here is the code to do just that:
The difference from the previous snippet is that now all of the aggregation is done on the WRDS server via PosgreSQL. The downloaded dataframe is already aggregated.
If you plot the result, here is what you see:
Phew, the graph looks identical to the previous one, as it should.
If you use WRDS in your work, see if the Python-WRDS combo makes sense for you. For me personally, it has been a real game changer. Some shameless self-promotion: My recent paper on dividends gives a full example of the workflow.
To get the full Jupyter notebook, click here.
(By the way, I used the awesome hilite service to get the nicely formated Python code above. Wix, which I am using for this website, really does not shine when it comes to things like this.)