Data & Teaching‎ > ‎

Linking CRSP and Compustat

Drafted by Rui Dai in Jun 2016 (Revised at Oct 2017)
In empirical finance and accounting literature, CRSP and Compustat have been the two most widely used database since early 90’s. From 2001 to 2015, the combined web queries of those two databases together occupy more than 80% web traffic at WRDS. In this WRDS Research Application, we provide a general guidance to link those two databases at security level. This rest of this application is organized as following:
  1. Preliminary: security identifiers
  2. Data Structure: CRSP v.s. Compustat
  3. Matching Implementation:
    • Linking by CUSIP
    • Linking by CCM Linktable
1. Preliminary: Security Identifiers
Among all commonly used security/company identifiers, TICKER and CUSIP are probably the most often mentioned ones. A stock TICKER is an abbreviation used to uniquely identify publicly traded shares of a particular stock, while CUSIP is a nine-character alphanumeric code assigned by CUSIP bureau to identify various securities. There are a few problems associated with those two identifiers for researchers to match historical records. First of all, neither identifier is permanent. A change in a security’s CUSIP, for example, could be triggered by any change in the security, including non-fundamental events such as splits and name changes. In addition, retired stock TICKER are often reused. As shown in Table below, both TICKER and CUSIP are changed 3 times, while the name of the company is also changed once. 

Date

CUSIP

Ticker

Name

CRSP Security Identifier

(PERMNO)

S&P Identifier

(GVKEY)

S&P Security-level

Sub-Identifier (IID)

20060630

67461T10

RHEO

Occulogix Inc

90488

161071

01

20060901

67461T10

OCCX

Occulogix Inc

90488

161071

01

20081106

67461T20

OCCX

Occulogix Inc

90488

161071

01

20090116

67461T20

TEAR

Occulogix Inc

90488

161071

01

20100521

87819310

TEAR

Tearlab Corp

90488

161071

01


For those reasons, most databases create permanent identifiers to indicate the same securities and companies. For example, CRSP employs PERMNO to track stock, while Compustat uses GVKEY to follow companies and a combination of GVKEY and IID to track security. As indicated in Table above, both PERMNO and GVKEY (+IID) remain the same, regardless of changes in TICKER, CUSIP, and Company Names. 

2. Data Structure:
To understand the matching mechanism and results, researchers should keep the structural differences of two databases in mind. The Table below demonstrates four major structural differences between CRSP and Compustat: 

 

CRSP

Compustat North American

Time

From 1925[1]

From 1950

Universe

Firms listed in USA Major Exchanges

  • NYSE, NASDAQ, AMEX, and ARCA

USA (10K filers) and Canada Firms

  • listed in major USA or Canada Exchanges,
  • listed in regional exchanges,
  • with shares traded as OTC, or
  • with certain amount of public bonds

Frequency

Daily or Monthly

Quarterly and Annually

Inclusion

Securities are included since IPO

Traditional, Compustat includes a firm only after a few years the firm publicly disclose its financial statement.[2]

Those differences typically create three kinds of unmatchable scenarios.

  1. Prior to 1950, trading records (CRSP) cannot be matched to fundamental items (Compustat).
  2. The fundamental items cannot be matched to trading records, when firms are listed in regional exchanges, are traded as OTC, and/or only have bonds publicly traded.
  3. The trading records cannot be matched to fundamental items, when those stocks did not survive long enough (e.g. M&A) for Compustat to include its fundamental items.

3. Matching Implementation:

Merge by CUSIP

CUSIP is a nine-character alphanumeric identifier for publicly traded securities, and is created by CUSIP Service Bureau in 1968.Unlike TICKER, CUSIP is not a reusable security identifier, so that it has been widely used to match different databases. CRSP and Compustat used slightly different approach to keep CUSIP records. CRSP drops the 9th checksum digit in CUSIP and keeps historical CUSIPs, while Compustat use full CUSIP and keep only the most current CUSIP.[3] The following is a SAS sample code to match CRSP and Compustat at security level by CUSIP:

proc sql;

      create table cusip_link as

      select distinct a.permno, intnx('month',date,0,'e') as date format=date9.,

      round(abs(prc), 0.01) as prc, c.gvkey, c.iid

      from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b

      /*Keep All Possible Common Stocks and ARDs*/

      on a.permno=b.permno

      inner join comp.security as c

      on substr(c.cusip,1,8)=substr(b.ncusip,1,8) and lengthn(strip(ncusip))>6

      inner join (

            select distinct gvkey, datadate from comp.funda

            where sale>. or at>.  /*Keep company with valid Total Asset or Sales Figure*/

      ) as d

      on d.gvkey=c.gvkey and

      intnx('month', datadate, -24, 'e')<a.date<=intnx('month', datadate, 12, 'e');

      /*Keep valid CUSIP linked PERMNO and GVKEY pairs around fiscal year ends*/

quit;

 There are three major limitations to match CRSP and Compustat by CUSIP:

  1.  Prior to 1968, CUSIP does not exist, so that the CUSIP-based match is only reliable from 1968 onward.
  2. CRSP only keep historical CUSIPs when the securities are traded in major exchanges. If the most recent CUSIP in Compustat is from regional or OTC market, CRSP and Compustat records would not be matched properly.
  3. CRSP and Compustat has different data refreshing frequency. The most recent CUSIP in one databases may not be synchronized with the most recent CUSIP in the other, which may in term creates certain miss matchings as well.

Merge by CCM Linktable

The CRSP/Compustat Merged Database (CCM) is comprised of CRSP and Compustat data together with the link and link-history references between these two databases. The key product of CCM is merely a permanent identifier linking table (PERMNO to GVKEY+IID), though CCM is often wrongly considered as merged product of CRSP stock market data with Compustat accounting data. From an end user’s perspective, CCM only adds a link of PERMNO and PERMCO to Compustat database, so that Compustat items can be searched by CRSP's identifiers. In the link table (Ccmxpf_linktable), there are a few indicators for the quality of matching:

 LinkType: A code describing the connection between the CRSP and Compustat data.

  • LU: Link research complete. Standard connection between databases.
  • LC: Un-researched link to issue by CUSIP.
  • LD: Duplicate link to a security. Another GVKEY/IID is a better link to that CRSP record
  •  

 LinkPrim: Primary issue indicator for the link

  • P: Primary, identified by Compustat in monthly security data.
  • C: Primary, assigned by CRSP to resolve ranges of overlapping or missing primary markers from Compustat in order to produce one primary security throughout the company history.
  •   ...

UsedFlag: Flag marking whether link is used in building composite record

  • 1: this link is applicable to the selected PERMNO and used to identify ranges of Compustat data from a GVKEY used to build a composite GVKEY record corresponding to the PERMNO.
  • -1 = this link is informational, indirectly related to the PERMNO, but not used.
  • USEDFLAG is a legacy indicator which is kept for programing and data consistency at WRDS.

LINKDT: The first effective date of the link

LINKENDDT: The last date when the link is valid. For current valid link, LINKENDDT is set a NULL value (.E in SAS Format).

The following is a sample SAS code to match CRSP and Compustat at security level by CCM Linktable:

proc sql;

      create table ccm_link as

      select distinct a.*, c.gvkey, liid as iid

      from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b

      on a.permno=b.permno

      inner join crsp.Ccmxpf_linktable(where=(linktype in ('LC' 'LU') and

      linkprim in ('P' 'C' 'J') and USEDFLAG=1)) as c

      /*Both Primary and Secondary Shares*/

      on a.permno=c.lpermno and linkdt<=date<=coalesce(linkenddt, today())

      inner join (

            select distinct gvkey from comp.funda

            where sale>. or at>. 

      ) as d

      on c.gvkey=d.gvkey;

quit;

Numbers of companies from stock level matching by CCM Linktable and CUSIP through years are plotted in graph below. Due to various limitations of CUSIP matching mentioned above, CCM always provides greater numbers of matched results, and the two results converge in most recent years. In addition, I provides a code (PCCM.sas) that generates a combination of CUSIP match and successive price matching. As mentioned above, this approach is inspirited by the CCM LinkType variable. The code is in Alpha version stage, certain cautions are required when use. The numbers of companies matched by using the linking table generated by using this code is also reported in the figure below. As shown in the picture, the price matching does improve the quality of security matching, though CCM still provide the best results, especially in late 80's and early 90's. 

 

It is worth to notice one company may have multiple classes of common equities. And at company level, CRSP and Compustat occasionally disagree on the surviving company from merges. In these cases, a researcher could disagree with the stand taken by the CCM set in resolving the disagreement or ambiguity about a surviving company.


[1] In fact, CRSP includes stocks listed in different exchanges through stages. In particular, CRSP adds stocks listed NYSE, NYSE MKT (AMEX), NASDAQ, and ARCA in 1925, 1962, 1972, and 2006 respectively.

[2] Early researches, for example Kothari, Shanken and Sloan (1995), find a selection bias (survivorship bias) to the way in which firms are added to the COMPUSTAT database. In particular, Compustat adds a new firm into its system only when the firm has reported its financial statement for multiple years, which is sometimes called backfilling bias.  

[3] Compustat also provides historical CUSIPs, TICKERs, etc. in its Snapshot product, though historical identifiers are not available in their standard fundamental data feed.