Drafted by Rui Dai in Jun 2016 (Revised at Oct 2017)
In empirical finance and accounting literature, CRSP and Compustat have been the two most widely used database since early 90’s. From 2001 to 2015, the combined web queries of those two databases together occupy more than 80% web traffic at WRDS. In this WRDS Research Application, we provide a general guidance to link those two databases at security level. This rest of this application is organized as following:
Preliminary: security identifiers
Data Structure: CRSP v.s. Compustat
Matching Implementation:
Linking by CUSIP
Linking by CCM Linktable
1. Preliminary: Security Identifiers
Among all commonly used security/company identifiers, TICKER and CUSIP are probably the most often mentioned ones. A stock TICKER is an abbreviation used to uniquely identify publicly traded shares of a particular stock, while CUSIP is a nine-character alphanumeric code assigned by CUSIP bureau to identify various securities. There are a few problems associated with those two identifiers for researchers to match historical records. First of all, neither identifier is permanent. A change in a security’s CUSIP, for example, could be triggered by any change in the security, including non-fundamental events such as splits and name changes. In addition, retired stock TICKER are often reused. As shown in Table below, both TICKER and CUSIP are changed 3 times, while the name of the company is also changed once.
For those reasons, most databases create permanent identifiers to indicate the same securities and companies. For example, CRSP employs PERMNO to track stock, while Compustat uses GVKEY to follow companies and a combination of GVKEY and IID to track security. As indicated in Table above, both PERMNO and GVKEY (+IID) remain the same, regardless of changes in TICKER, CUSIP, and Company Names.
2. Data Structure:
To understand the matching mechanism and results, researchers should keep the structural differences of two databases in mind. The Table below demonstrates four major structural differences between CRSP and Compustat:
Those differences typically create three kinds of unmatchable scenarios.
Prior to 1950, trading records (CRSP) cannot be matched to fundamental items (Compustat).
The fundamental items cannot be matched to trading records, when firms are listed in regional exchanges, are traded as OTC, and/or only have bonds publicly traded.
The trading records cannot be matched to fundamental items, when those stocks did not survive long enough (e.g. M&A) for Compustat to include its fundamental items.
3. Matching Implementation:
Merge by CUSIP
CUSIP is a nine-character alphanumeric identifier for publicly traded securities, and is created by CUSIP Service Bureau in 1968.Unlike TICKER, CUSIP is not a reusable security identifier, so that it has been widely used to match different databases. CRSP and Compustat used slightly different approach to keep CUSIP records. CRSP drops the 9th checksum digit in CUSIP and keeps historical CUSIPs, while Compustat use full CUSIP and keep only the most current CUSIP.[3] The following is a SAS sample code to match CRSP and Compustat at security level by CUSIP:
proc sql;
create table cusip_link as
select distinct a.permno, intnx('month',date,0,'e') as date format=date9.,
round(abs(prc), 0.01) as prc, c.gvkey, c.iid
from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b
/*Keep All Possible Common Stocks and ARDs*/
on a.permno=b.permno
inner join comp.security as c
on substr(c.cusip,1,8)=substr(b.ncusip,1,8) and lengthn(strip(ncusip))>6
inner join (
select distinct gvkey, datadate from comp.funda
where sale>. or at>. /*Keep company with valid Total Asset or Sales Figure*/
) as d
on d.gvkey=c.gvkey and
intnx('month', datadate, -24, 'e')<a.date<=intnx('month', datadate, 12, 'e');
/*Keep valid CUSIP linked PERMNO and GVKEY pairs around fiscal year ends*/
quit;
There are three major limitations to match CRSP and Compustat by CUSIP:
Prior to 1968, CUSIP does not exist, so that the CUSIP-based match is only reliable from 1968 onward.
CRSP only keep historical CUSIPs when the securities are traded in major exchanges. If the most recent CUSIP in Compustat is from regional or OTC market, CRSP and Compustat records would not be matched properly.
CRSP and Compustat has different data refreshing frequency. The most recent CUSIP in one databases may not be synchronized with the most recent CUSIP in the other, which may in term creates certain miss matchings as well.
Merge by CCM Linktable
The CRSP/Compustat Merged Database (CCM) is comprised of CRSP and Compustat data together with the link and link-history references between these two databases. The key product of CCM is merely a permanent identifier linking table (PERMNO to GVKEY+IID), though CCM is often wrongly considered as merged product of CRSP stock market data with Compustat accounting data. From an end user’s perspective, CCM only adds a link of PERMNO and PERMCO to Compustat database, so that Compustat items can be searched by CRSP's identifiers. In the link table (Ccmxpf_linktable), there are a few indicators for the quality of matching:
LinkType: A code describing the connection between the CRSP and Compustat data.
LU: Link research complete. Standard connection between databases.
LC: Un-researched link to issue by CUSIP.
LD: Duplicate link to a security. Another GVKEY/IID is a better link to that CRSP record
…
LinkPrim: Primary issue indicator for the link
P: Primary, identified by Compustat in monthly security data.
C: Primary, assigned by CRSP to resolve ranges of overlapping or missing primary markers from Compustat in order to produce one primary security throughout the company history.
...
UsedFlag: Flag marking whether link is used in building composite record
1: this link is applicable to the selected PERMNO and used to identify ranges of Compustat data from a GVKEY used to build a composite GVKEY record corresponding to the PERMNO.
-1 = this link is informational, indirectly related to the PERMNO, but not used.
USEDFLAG is a legacy indicator which is kept for programing and data consistency at WRDS.
LINKDT: The first effective date of the link
LINKENDDT: The last date when the link is valid. For current valid link, LINKENDDT is set a NULL value (.E in SAS Format).
The following is a sample SAS code to match CRSP and Compustat at security level by CCM Linktable:
proc sql;
create table ccm_link as
select distinct a.*, c.gvkey, liid as iid
from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b
on a.permno=b.permno
inner join crsp.Ccmxpf_linktable(where=(linktype in ('LC' 'LU') and
linkprim in ('P' 'C' 'J') and USEDFLAG=1)) as c
/*Both Primary and Secondary Shares*/
on a.permno=c.lpermno and linkdt<=date<=coalesce(linkenddt, today())
inner join (
select distinct gvkey from comp.funda
where sale>. or at>.
) as d
on c.gvkey=d.gvkey;
quit;
Numbers of companies from stock level matching by CCM Linktable and CUSIP through years are plotted in graph below. Due to various limitations of CUSIP matching mentioned above, CCM always provides greater numbers of matched results, and the two results converge in most recent years. In addition, I provides a code (PCCM.sas below) that generates a combination of CUSIP match and successive price matching. As mentioned above, this approach is inspirited by the CCM LinkType variable. The code is in Alpha version stage, certain cautions are required when use. The numbers of companies matched by using the linking table generated by using this code is also reported in the figure below. As shown in the picture, the price matching does improve the quality of security matching, though CCM still provide the best results, especially in late 80's and early 90's.
It is worth to notice one company may have multiple classes of common equities. And at company level, CRSP and Compustat occasionally disagree on the surviving company from merges. In these cases, a researcher could disagree with the stand taken by the CCM set in resolving the disagreement or ambiguity about a surviving company.
[1] In fact, CRSP includes stocks listed in different exchanges through stages. In particular, CRSP adds stocks listed NYSE, NYSE MKT (AMEX), NASDAQ, and ARCA in 1925, 1962, 1972, and 2006 respectively.
[2] Early researches, for example Kothari, Shanken and Sloan (1995), find a selection bias (survivorship bias) to the way in which firms are added to the COMPUSTAT database. In particular, Compustat adds a new firm into its system only when the firm has reported its financial statement for multiple years, which is sometimes called backfilling bias.
[3] Compustat also provides historical CUSIPs, TICKERs, etc. in its Snapshot product, though historical identifiers are not available in their standard fundamental data feed.
/******************** PCCM.SAS ********************/
/********************Author: Rui Dai********************/
proc sql;
create table Secm (where=(prccm>0)) as
select distinct a.gvkey, a.iid, intnx('month',datadate,0,'e') as datadate format=date9.,
round(prccm, 0.01) as prccm
from comp.Secm(where=(lengthn(strip(iid))=2))/*USA Firm Only*/ as a
inner join (
select distinct gvkey from comp.funda
where sale>. or at>. /*Firm with Sales or Total Asset only*/
) as b
on a.gvkey=b.gvkey;
quit;
/*CRSP Monthly Security File Cleansing*/
proc sql;
create table msf(where=(prc>0)) as
select distinct a.permno, intnx('month',date,0,'e') as date format=date9.,
round(abs(prc), 0.01) as prc
from crsp.msf as a inner join crsp.msenames(where=(shrcd in (10 11 12 30 31 32))) as b
on a.permno=b.permno;
quit;
/*CUSIP Matching*/
proc sql;
create table cusip_match as
select distinct a.gvkey, a.iid, b.permno
from comp.security as a inner join crsp.msenames as b
on substr(a.cusip,1,8)=substr(b.ncusip,1,8)
inner join (select distinct permno from msf) as c
on b.permno=c.permno
inner join (select distinct gvkey, iid from secm) as d
on a.gvkey=d.gvkey and a.iid=d.iid;
quit;
/*A Blind Price Matching*/
proc sql;
create table _pccm_ as
select distinct a.gvkey, permno, a.iid
from Secm as a inner join msf as b
on datadate=date and prc>0.5 and prc=prccm /*Removing very long price matching*/
group by b.permno, a.gvkey, a.iid
having count(distinct datadate)> 7 ; /*Parameter 2*/
quit;
/*Refine matching with time dimension*/
proc sql;
create table meta as
select distinct a.permno, intnx('month',date,0,'e') as DATE format=date9.,
b.gvkey, b.iid, prc, prccm, prc>0.5 and prc=prccm as sdum /*Removing very long price matching*/
from msf as a inner join _pccm_ as b
on a.permno=b.permno
inner join Secm as c
on b.gvkey=c.gvkey and b.iid=c.iid and datadate=date
order by permno, gvkey, iid, date;
quit;
/*Counting backwards to fine-tuning the matched point*/
PROC EXPAND data=meta OUT=meta METHOD=NONE;
by permno gvkey iid;
CONVERT sdum = MS_sdum / TRANSFORM=(MOVSUM 15); /*Parameter 2*/
id date;
RUN;
/* Counting forwards for fine-tuning the matched point (Three steps:)*/
proc sort data=meta out=meta; by permno gvkey iid descending date; run;
data meta; set meta; by permno gvkey iid descending date;
idx+1;
if first.iid then idx=1;
run;
PROC EXPAND data=meta OUT=meta METHOD=NONE;
by permno gvkey iid;
CONVERT MS_sdum = MS_mdum / TRANSFORM=(MOVmax 15); /*Parameter 2*/
id idx;
RUN;
/*Keep consequent matched price and date points*/
proc sql undo_policy=none;
create table meta as
select distinct *, sum(sdum) as hcnt "historical matched count"
from meta(drop=idx)
where MS_mdum>=1 and MS_sdum>=1
group by permno, gvkey, iid
having max(ms_sdum)> 7 /*Parameter 2*/
order by permno, gvkey, iid, date;
quit;
/*Among the remained matched ones, count the closest spelling distance*/
proc sql;
create table mname_match as
select distinct a.gvkey, a.iid, a.permno, spedis(upcase(compress(COMNAM,,'sp')), upcase(compress(conm,,'sp'))) as spedis
from (select distinct gvkey, iid, permno from meta) as a
left join crsp.msenames as b
on a.permno=b.permno
left join comp.company as c
on a.gvkey=c.gvkey
group by a.permno, a.gvkey, a.iid
having spedis(COMNAM, conm)=max(spedis(COMNAM, conm));
quit;
/*Among the remained matched ones, check whether same tickers are evered shared*/
proc sql;
create table tick_match as
select distinct a.gvkey, a.iid, a.permno, scan(upcase(b.ticker),1) = scan(upcase(c.tic),1) as tmatched
from (select distinct gvkey, iid, permno from meta) as a
left join crsp.msenames(where=(lengthn(TICKER)>0)) as b
on a.permno=b.permno
left join comp.Security(where=(lengthn(tic)>0)) as c
on a.gvkey=c.gvkey and a.iid=c.iid
group by a.permno, a.gvkey, a.iid
having (scan(upcase(b.ticker),1) = scan(upcase(c.tic),1))
=max(scan(upcase(b.ticker),1) = scan(upcase(c.tic),1));
quit;
/*Matched name and ticker measures back to consequent matching results and
construct key monthly statistics*/
proc sql;
create table _pccm_ as
select distinct a.*, max(hcnt) as mhcnt, max(MS_sdum) as mms_sdum, max(ms_mdum) as mms_mdum,
count(distinct catt(a.gvkey, a.iid)) as gcnt, b.permno>. as cusip_matched, spedis,
max(b.permno>.) as mcusip_matched, tmatched, max(tmatched) as mtmatched
from meta as a left join cusip_match as b
on a.permno=b.permno and a.gvkey=b.gvkey and a.iid=b.iid
left join mname_match as c
on a.permno=c.permno and a.gvkey=c.gvkey and a.iid=c.iid
left join Tick_match as d
on a.permno=d.permno and a.gvkey=d.gvkey and a.iid=d.iid
group by a.permno, date
order by a.permno, date, a.gvkey, a.iid;
quit;
/*Cleansing procedure:
Among multiple matches, keep the ones with either cusip matched or ticker matched
For the remainig ones, some random statistics are used to remove wrong matched results*/
data _pccm_; set _pccm_;
if gcnt>1 and mcusip_matched=1 and cusip_matched^=1 then delete; /*Keep CUSIP Matched Only*/
else if gcnt>1 and mtmatched=1 and tmatched^=1 then delete; /*Keep CUSIP Matched Only*/
if mcusip_matched=0 and mtmatched=0 and spedis>10 and MS_mdum<=3 and hcnt<5 then delete;/*Keep the ones with similar names but matched only a few times - Value decided by observations*/
else if mcusip_matched=0 and mtmatched=0 and spedis>30 and hcnt<15 then delete;/*Keep the ones with different names but with a lot matches - Value decided by observations*/
drop mhcnt mms_sdum mms_mdum gcnt mcusip_matched mtmatched;
run;
/*To have the best one-to-one (permno to gvkey iid) match*/
proc sql;
create table pccm as
select distinct *
from (
select distinct *
from (
select distinct *
from (
select distinct *
from _pccm_
group by permno, date
having hcnt=max(hcnt) /*1) among the current best, get the best historical best*/
)
group by permno, date
having MS_sdum=max(MS_sdum) /*2) at given day, get the best matches*/
)
group by gvkey, iid, date
having hcnt=max(hcnt)
)
group by gvkey, iid, date
having MS_sdum=max(MS_sdum) ;
quit;
/*Patching based on observations */
proc sql undo_policy=none;
create table pccm as
select distinct *
from (
select distinct *
from pccm
group by permno, gvkey, date
having iid=min(iid)
)
group by permno, date
having spedis=min(spedis)
order by permno, date;
quit;
/*Add Time dimensions to PERMNO_to_IID through CUSIP*/
proc sql;
create table cusip_match_valid as
select distinct a.permno, a.date, b.gvkey, b.iid
from msf as a inner join Cusip_match as b
on a.permno=b.permno
inner join (
select distinct gvkey, datadate from comp.funda
where sale>. or at>.
) as c
on b.gvkey=c.gvkey and
intnx('month', datadate, -24, 'e')<a.date<=intnx('month', datadate, 12, 'e')
left join pccm as d
on a.permno=d.permno and b.gvkey=d.gvkey and b.iid=d.iid and a.date=d.date
group by a.permno, a.date
having (d.permno>.)=max(d.permno>.) and count(distinct catt(gvkey,iid))=1;
quit;
/*Combine CUSIP and PRICE match*/
proc sql;
create table _link_table_ as
select distinct a.permno, a.date, b.gvkey, b.iid,
c.gvkey as cusip_gvkey, c.iid as cusip_iid
from msf as a left join pccm as b
on a.permno=b.permno and a.date=b.date
left join cusip_match_valid as c
on a.permno=c.permno and a.date=c.date
order by permno, gvkey, iid, date;
quit;
/*Making a range (two steps)*/
data _link_table_(where=(lengthn(gvkey)>0)); set _link_table_;
if gvkey="" and iid="" then do;
gvkey=cusip_gvkey; iid=cusip_iid;
end;
drop cusip_gvkey cusip_iid;
proc sort; by permno gvkey iid date;
run;
data _link_table_; set _link_table_; by permno gvkey iid date;
retain from;
ldate=lag(date);
if first.iid then do; ldate=.; from=date; end;
else if intnx('month',ldate, 1, 'e')<intnx('month',date, 0, 'e') then
from=date;
format from date9.;
run;
/*Save the matched Table*/
proc sql;
create table pccm_table as
select distinct permno, gvkey, iid, from, date as to format=date9.
from _link_table_
group by permno, gvkey, iid, from
having date=max(date);
quit;