Linking CRSP and Compustat

Drafted by Rui Dai in Jun 2016 (Revised at Oct 2017)

In empirical finance and accounting literature, CRSP and Compustat have been the two most widely used database since early 90’s. From 2001 to 2015, the combined web queries of those two databases together occupy more than 80% web traffic at WRDS. In this WRDS Research Application, we provide a general guidance to link those two databases at security level. This rest of this application is organized as following:

  1. Preliminary: security identifiers

  2. Data Structure: CRSP v.s. Compustat

  3. Matching Implementation:

    • Linking by CUSIP

    • Linking by CCM Linktable

1. Preliminary: Security Identifiers

Among all commonly used security/company identifiers, TICKER and CUSIP are probably the most often mentioned ones. A stock TICKER is an abbreviation used to uniquely identify publicly traded shares of a particular stock, while CUSIP is a nine-character alphanumeric code assigned by CUSIP bureau to identify various securities. There are a few problems associated with those two identifiers for researchers to match historical records. First of all, neither identifier is permanent. A change in a security’s CUSIP, for example, could be triggered by any change in the security, including non-fundamental events such as splits and name changes. In addition, retired stock TICKER are often reused. As shown in Table below, both TICKER and CUSIP are changed 3 times, while the name of the company is also changed once.

For those reasons, most databases create permanent identifiers to indicate the same securities and companies. For example, CRSP employs PERMNO to track stock, while Compustat uses GVKEY to follow companies and a combination of GVKEY and IID to track security. As indicated in Table above, both PERMNO and GVKEY (+IID) remain the same, regardless of changes in TICKER, CUSIP, and Company Names.


2. Data Structure:

To understand the matching mechanism and results, researchers should keep the structural differences of two databases in mind. The Table below demonstrates four major structural differences between CRSP and Compustat:

Those differences typically create three kinds of unmatchable scenarios.

  1. Prior to 1950, trading records (CRSP) cannot be matched to fundamental items (Compustat).

  2. The fundamental items cannot be matched to trading records, when firms are listed in regional exchanges, are traded as OTC, and/or only have bonds publicly traded.

  3. The trading records cannot be matched to fundamental items, when those stocks did not survive long enough (e.g. M&A) for Compustat to include its fundamental items.

3. Matching Implementation:

Merge by CUSIP

CUSIP is a nine-character alphanumeric identifier for publicly traded securities, and is created by CUSIP Service Bureau in 1968.Unlike TICKER, CUSIP is not a reusable security identifier, so that it has been widely used to match different databases. CRSP and Compustat used slightly different approach to keep CUSIP records. CRSP drops the 9th checksum digit in CUSIP and keeps historical CUSIPs, while Compustat use full CUSIP and keep only the most current CUSIP.[3] The following is a SAS sample code to match CRSP and Compustat at security level by CUSIP:

proc sql;

create table cusip_link as

select distinct a.permno, intnx('month',date,0,'e') as date format=date9.,

round(abs(prc), 0.01) as prc, c.gvkey, c.iid

from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b

/*Keep All Possible Common Stocks and ARDs*/

on a.permno=b.permno

inner join comp.security as c

on substr(c.cusip,1,8)=substr(b.ncusip,1,8) and lengthn(strip(ncusip))>6

inner join (

select distinct gvkey, datadate from comp.funda

where sale>. or at>. /*Keep company with valid Total Asset or Sales Figure*/

) as d

on d.gvkey=c.gvkey and

intnx('month', datadate, -24, 'e')<a.date<=intnx('month', datadate, 12, 'e');

/*Keep valid CUSIP linked PERMNO and GVKEY pairs around fiscal year ends*/

quit;

There are three major limitations to match CRSP and Compustat by CUSIP:

  1. Prior to 1968, CUSIP does not exist, so that the CUSIP-based match is only reliable from 1968 onward.

  2. CRSP only keep historical CUSIPs when the securities are traded in major exchanges. If the most recent CUSIP in Compustat is from regional or OTC market, CRSP and Compustat records would not be matched properly.

  3. CRSP and Compustat has different data refreshing frequency. The most recent CUSIP in one databases may not be synchronized with the most recent CUSIP in the other, which may in term creates certain miss matchings as well.

Merge by CCM Linktable

The CRSP/Compustat Merged Database (CCM) is comprised of CRSP and Compustat data together with the link and link-history references between these two databases. The key product of CCM is merely a permanent identifier linking table (PERMNO to GVKEY+IID), though CCM is often wrongly considered as merged product of CRSP stock market data with Compustat accounting data. From an end user’s perspective, CCM only adds a link of PERMNO and PERMCO to Compustat database, so that Compustat items can be searched by CRSP's identifiers. In the link table (Ccmxpf_linktable), there are a few indicators for the quality of matching:

LinkType: A code describing the connection between the CRSP and Compustat data.

  • LU: Link research complete. Standard connection between databases.

  • LC: Un-researched link to issue by CUSIP.

  • LD: Duplicate link to a security. Another GVKEY/IID is a better link to that CRSP record

LinkPrim: Primary issue indicator for the link

  • P: Primary, identified by Compustat in monthly security data.

  • C: Primary, assigned by CRSP to resolve ranges of overlapping or missing primary markers from Compustat in order to produce one primary security throughout the company history.

  • ...

UsedFlag: Flag marking whether link is used in building composite record

  • 1: this link is applicable to the selected PERMNO and used to identify ranges of Compustat data from a GVKEY used to build a composite GVKEY record corresponding to the PERMNO.

  • -1 = this link is informational, indirectly related to the PERMNO, but not used.

  • USEDFLAG is a legacy indicator which is kept for programing and data consistency at WRDS.

LINKDT: The first effective date of the link

LINKENDDT: The last date when the link is valid. For current valid link, LINKENDDT is set a NULL value (.E in SAS Format).

The following is a sample SAS code to match CRSP and Compustat at security level by CCM Linktable:

proc sql;

create table ccm_link as

select distinct a.*, c.gvkey, liid as iid

from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b

on a.permno=b.permno

inner join crsp.Ccmxpf_linktable(where=(linktype in ('LC' 'LU') and

linkprim in ('P' 'C' 'J') and USEDFLAG=1)) as c

/*Both Primary and Secondary Shares*/

on a.permno=c.lpermno and linkdt<=date<=coalesce(linkenddt, today())

inner join (

select distinct gvkey from comp.funda

where sale>. or at>.

) as d

on c.gvkey=d.gvkey;

quit;

Numbers of companies from stock level matching by CCM Linktable and CUSIP through years are plotted in graph below. Due to various limitations of CUSIP matching mentioned above, CCM always provides greater numbers of matched results, and the two results converge in most recent years. In addition, I provides a code (PCCM.sas below) that generates a combination of CUSIP match and successive price matching. As mentioned above, this approach is inspirited by the CCM LinkType variable. The code is in Alpha version stage, certain cautions are required when use. The numbers of companies matched by using the linking table generated by using this code is also reported in the figure below. As shown in the picture, the price matching does improve the quality of security matching, though CCM still provide the best results, especially in late 80's and early 90's.

It is worth to notice one company may have multiple classes of common equities. And at company level, CRSP and Compustat occasionally disagree on the surviving company from merges. In these cases, a researcher could disagree with the stand taken by the CCM set in resolving the disagreement or ambiguity about a surviving company.


[1] In fact, CRSP includes stocks listed in different exchanges through stages. In particular, CRSP adds stocks listed NYSE, NYSE MKT (AMEX), NASDAQ, and ARCA in 1925, 1962, 1972, and 2006 respectively.

[2] Early researches, for example Kothari, Shanken and Sloan (1995), find a selection bias (survivorship bias) to the way in which firms are added to the COMPUSTAT database. In particular, Compustat adds a new firm into its system only when the firm has reported its financial statement for multiple years, which is sometimes called backfilling bias.

[3] Compustat also provides historical CUSIPs, TICKERs, etc. in its Snapshot product, though historical identifiers are not available in their standard fundamental data feed.

/******************** PCCM.SAS ********************/

/********************Author: Rui Dai********************/

proc sql;

create table Secm (where=(prccm>0)) as

select distinct a.gvkey, a.iid, intnx('month',datadate,0,'e') as datadate format=date9.,

round(prccm, 0.01) as prccm

from comp.Secm(where=(lengthn(strip(iid))=2))/*USA Firm Only*/ as a

inner join (

select distinct gvkey from comp.funda

where sale>. or at>. /*Firm with Sales or Total Asset only*/

) as b

on a.gvkey=b.gvkey;

quit;

/*CRSP Monthly Security File Cleansing*/

proc sql;

create table msf(where=(prc>0)) as

select distinct a.permno, intnx('month',date,0,'e') as date format=date9.,

round(abs(prc), 0.01) as prc

from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b

on a.permno=b.permno;

quit;

/*CUSIP Matching*/

proc sql;

create table cusip_match as

select distinct a.gvkey, a.iid, b.permno

from comp.security as a inner join crsp.msenames as b

on substr(a.cusip,1,8)=substr(b.ncusip,1,8)

inner join (select distinct permno from msf) as c

on b.permno=c.permno

inner join (select distinct gvkey, iid from secm) as d

on a.gvkey=d.gvkey and a.iid=d.iid;

quit;

/*A Blind Price Matching*/

proc sql;

create table _pccm_ as

select distinct a.gvkey, permno, a.iid

from Secm as a inner join msf as b

on datadate=date and prc>0.5 and prc=prccm /*Removing very long price matching*/

group by b.permno, a.gvkey, a.iid

having count(distinct datadate)> 7 ; /*Parameter 2*/

quit;

/*Refine matching with time dimension*/

proc sql;

create table meta as

select distinct a.permno, intnx('month',date,0,'e') as DATE format=date9.,

b.gvkey, b.iid, prc, prccm, prc>0.5 and prc=prccm as sdum /*Removing very long price matching*/

from msf as a inner join _pccm_ as b

on a.permno=b.permno

inner join Secm as c

on b.gvkey=c.gvkey and b.iid=c.iid and datadate=date

order by permno, gvkey, iid, date;

quit;

/*Counting backwards to fine-tuning the matched point*/

PROC EXPAND data=meta OUT=meta METHOD=NONE;

by permno gvkey iid;

CONVERT sdum = MS_sdum / TRANSFORM=(MOVSUM 15); /*Parameter 2*/

id date;

RUN;

/* Counting forwards for fine-tuning the matched point (Three steps:)*/

proc sort data=meta out=meta; by permno gvkey iid descending date; run;

data meta; set meta; by permno gvkey iid descending date;

idx+1;

if first.iid then idx=1;

run;

PROC EXPAND data=meta OUT=meta METHOD=NONE;

by permno gvkey iid;

CONVERT MS_sdum = MS_mdum / TRANSFORM=(MOVmax 15); /*Parameter 2*/

id idx;

RUN;

/*Keep consequent matched price and date points*/

proc sql undo_policy=none;

create table meta as

select distinct *, sum(sdum) as hcnt "historical matched count"

from meta(drop=idx)

where MS_mdum>=1 and MS_sdum>=1

group by permno, gvkey, iid

having max(ms_sdum)> 7 /*Parameter 2*/

order by permno, gvkey, iid, date;

quit;

/*Among the remained matched ones, count the closest spelling distance*/

proc sql;

create table mname_match as

select distinct a.gvkey, a.iid, a.permno, spedis(upcase(compress(COMNAM,,'sp')), upcase(compress(conm,,'sp'))) as spedis

from (select distinct gvkey, iid, permno from meta) as a

left join crsp.msenames as b

on a.permno=b.permno

left join comp.company as c

on a.gvkey=c.gvkey

group by a.permno, a.gvkey, a.iid

having spedis(COMNAM, conm)=max(spedis(COMNAM, conm));

quit;

/*Among the remained matched ones, check whether same tickers are evered shared*/

proc sql;

create table tick_match as

select distinct a.gvkey, a.iid, a.permno, scan(upcase(b.ticker),1) = scan(upcase(c.tic),1) as tmatched

from (select distinct gvkey, iid, permno from meta) as a

left join crsp.msenames(where=(lengthn(TICKER)>0)) as b

on a.permno=b.permno

left join comp.Security(where=(lengthn(tic)>0)) as c

on a.gvkey=c.gvkey and a.iid=c.iid

group by a.permno, a.gvkey, a.iid

having (scan(upcase(b.ticker),1) = scan(upcase(c.tic),1))

=max(scan(upcase(b.ticker),1) = scan(upcase(c.tic),1));

quit;

/*Matched name and ticker measures back to consequent matching results and

construct key monthly statistics*/

proc sql;

create table _pccm_ as

select distinct a.*, max(hcnt) as mhcnt, max(MS_sdum) as mms_sdum, max(ms_mdum) as mms_mdum,

count(distinct catt(a.gvkey, a.iid)) as gcnt, b.permno>. as cusip_matched, spedis,

max(b.permno>.) as mcusip_matched, tmatched, max(tmatched) as mtmatched

from meta as a left join cusip_match as b

on a.permno=b.permno and a.gvkey=b.gvkey and a.iid=b.iid

left join mname_match as c

on a.permno=c.permno and a.gvkey=c.gvkey and a.iid=c.iid

left join Tick_match as d

on a.permno=d.permno and a.gvkey=d.gvkey and a.iid=d.iid

group by a.permno, date

order by a.permno, date, a.gvkey, a.iid;

quit;

/*Cleansing procedure:

Among multiple matches, keep the ones with either cusip matched or ticker matched

For the remainig ones, some random statistics are used to remove wrong matched results*/

data _pccm_; set _pccm_;

if gcnt>1 and mcusip_matched=1 and cusip_matched^=1 then delete; /*Keep CUSIP Matched Only*/

else if gcnt>1 and mtmatched=1 and tmatched^=1 then delete; /*Keep CUSIP Matched Only*/

if mcusip_matched=0 and mtmatched=0 and spedis>10 and MS_mdum<=3 and hcnt<5 then delete;/*Keep the ones with similar names but matched only a few times - Value decided by observations*/

else if mcusip_matched=0 and mtmatched=0 and spedis>30 and hcnt<15 then delete;/*Keep the ones with different names but with a lot matches - Value decided by observations*/

drop mhcnt mms_sdum mms_mdum gcnt mcusip_matched mtmatched;

run;

/*To have the best one-to-one (permno to gvkey iid) match*/

proc sql;

create table pccm as

select distinct *

from (

select distinct *

from (

select distinct *

from (

select distinct *

from _pccm_

group by permno, date

having hcnt=max(hcnt) /*1) among the current best, get the best historical best*/

)

group by permno, date

having MS_sdum=max(MS_sdum) /*2) at given day, get the best matches*/

)

group by gvkey, iid, date

having hcnt=max(hcnt)

)

group by gvkey, iid, date

having MS_sdum=max(MS_sdum) ;

quit;

/*Patching based on observations */

proc sql undo_policy=none;

create table pccm as

select distinct *

from (

select distinct *

from pccm

group by permno, gvkey, date

having iid=min(iid)

)

group by permno, date

having spedis=min(spedis)

order by permno, date;

quit;

/*Add Time dimensions to PERMNO_to_IID through CUSIP*/

proc sql;

create table cusip_match_valid as

select distinct a.permno, a.date, b.gvkey, b.iid

from msf as a inner join Cusip_match as b

on a.permno=b.permno

inner join (

select distinct gvkey, datadate from comp.funda

where sale>. or at>.

) as c

on b.gvkey=c.gvkey and

intnx('month', datadate, -24, 'e')<a.date<=intnx('month', datadate, 12, 'e')

left join pccm as d

on a.permno=d.permno and b.gvkey=d.gvkey and b.iid=d.iid and a.date=d.date

group by a.permno, a.date

having (d.permno>.)=max(d.permno>.) and count(distinct catt(gvkey,iid))=1;

quit;

/*Combine CUSIP and PRICE match*/

proc sql;

create table _link_table_ as

select distinct a.permno, a.date, b.gvkey, b.iid,

c.gvkey as cusip_gvkey, c.iid as cusip_iid

from msf as a left join pccm as b

on a.permno=b.permno and a.date=b.date

left join cusip_match_valid as c

on a.permno=c.permno and a.date=c.date

order by permno, gvkey, iid, date;

quit;

/*Making a range (two steps)*/

data _link_table_(where=(lengthn(gvkey)>0)); set _link_table_;

if gvkey="" and iid="" then do;

gvkey=cusip_gvkey; iid=cusip_iid;

end;

drop cusip_gvkey cusip_iid;

proc sort; by permno gvkey iid date;

run;

data _link_table_; set _link_table_; by permno gvkey iid date;

retain from;

ldate=lag(date);

if first.iid then do; ldate=.; from=date; end;

else if intnx('month',ldate, 1, 'e')<intnx('month',date, 0, 'e') then

from=date;

format from date9.;

run;

/*Save the matched Table*/

proc sql;

create table pccm_table as

select distinct permno, gvkey, iid, from, date as to format=date9.

from _link_table_

group by permno, gvkey, iid, from

having date=max(date);

quit;