How to access WRDS in Python
In this page, I explain how to work with the WRDS database using Python.
Setup
The first thing we need to do, is to set up a connection to the WRDS database. I am assuming you have credentials to log in. Check the log in page to make sure.
The second requirement is the wrds Python package.
pip3 install wrds
Now, in order to connect to the WRDS database, you just need to run the following commang in Python.
import wrds
db = wrds.Connection()
Then, you will be propted to input your WRDS username and password.
However, if you are using a Python IDE such as PyCharm, you cannot run the command from the Python Console. Moreover, you might want to save your credentials once and for all, so that you don’t have to log in every time.
First, walk to your home directory from the Terminal (/Users/username
).
cd
Now create an empty .pgpass
file.
touch .pgpass
Now you write your_username
and your_password
into the .pgpass
file.
echo "wrds-pgdata.wharton.upenn.edu:9737:wrds:your_username:your_password" >> .pgpass
You also need to restrict permissions to the file.
chmod 600 ~/.pgpass
Now you can go back to your Python IDE and access the database by just inputing your username.
import wrds
db = wrds.Connection(wrds_username='your_username')
If everything works, you should see the following output.
Loading library list...
Done
Query
The available functions are:
db.connection()
db.list_libraries()
db.list_tables()
db.get_table()
db.describe_table()
db.raw_sql()
db.close()
I make a simple example of how they work. Suppose first you want to list all the libraries in the WRDS database.
db.list_libraries()
Then you can list all the datasets within a given library.
db.list_tables(library='comp')
Before downloading a table, you can describe it.
df = db.describe_table(library='comp', table='funda')
To download the dataset you can use the get_table()
function.
df = db.get_table(library='comp', table='funda')
You can restrict both the rows and the columns you want to query.
df_short = db.get_table(library='comp', table='funda', columns = ['conm', 'gvkey', 'cik'], obs=5)
You can also query the database directly using SQL.
df_sql = db.raw_sql('''select conm, gvkey, cik FROM comp.funda WHERE fyear>2010 AND (indfmt='INDL')''')