Industry Categorization with GloVe Word Vectors
Posted by Rui Dai on Dec 27, 2021
Empirical research in accounting and finance has primarily relied on structured data from tabulated contents in regulatory filings and security trading records. The lack of systematic ways to retrieve information from unstructured data has impeded researchers from utilizing a tremendous amount of information embedded in regulatory filings and other similar textual data. On the other hand, recent advances in Natural Language Processing (NLP) have paved the way for innovative solutions for information retrieval from unstructured financial data. NLP is a multidisciplinary topic of linguistics, computer science, and artificial intelligence that studies the interplay between computers and human language. Though academics in accounting and finance have predominantly employed NLP statistical methods based on bag-of-word methods, the word vector that enables state-of-the-art NLP models is still new ground for many. This article aims to demonstrate the undeniable potential of word vectors in accounting and financial applications through a case study.
Word-Vector models represent each lexical item (word or multi-word term) with a real-valued vector corresponding to a position in a semantic space. These models seek to quantify and classify semantic similarity between linguistic elements using their distributional features in large samples of language data. The two most often used model families for word vector learning are global matrix factorization approaches such as latent semantic analysis (LSA) and local context window methods such as the skip-gram model of Mikolov et al. (2013). In this case study, we use the word vectors from the GloVe model developed by a group of computer scientists at Stanford (Pennington et al. 2014). The GloVe is a weighted least squares log-bilinear regression model that incorporates features of two model families. This model performs similarly to word2vec model trained through a shallow (two-layer) neural network and consumes less computational resources than more cutting-edge models such as FastText (Facebook 2018) and BERT (Google 2018).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.decomposition import PCA
from gensim.matutils import sparse2full
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.corpora import Dictionary
from gensim.models.tfidfmodel import TfidfModel
In this case study, we leverage Gensim, an open-source library, to handle text data processing, along with a few commonly used Python libraries for data manipulation (numpy and pandas), plot display (matplotlib), principle component decomposition (sklearn), and regular expressions (re).
glove_file = datapath("vectors.txt")
word2vec_glove_file = get_tmpfile("glove.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)
model_glove = KeyedVectors.load_word2vec_format(word2vec_glove_file)
Pennington et al. (2014) train the GloVe model with textual data from Wikipedia, General News (Gigaword), and online text (Common Crawl). The total amount of words in Wikipedia and news (Gigaword) is 6 billion, while the total number of words in the online text is 42 billion. Pennington et al. (2014) report model performance using the top 400 thousand (2 million) unique words from Wikipedia and in the news (the online text). Loughran and McDonald (2011) argue that the way some accounting and finance terms are used is substantially different from how they are used in ordinary English. One benefit of the GloVe model is that the authors publish source codes online, enabling us to retrain the model fully from the accounting and finance corpus. We draw on the accounting and finance corpora for this case study, including item 1 (company description) and item 7 (management discussion and analysis) from the 10-K, conference call transcripts, the USPTO's patent application abstracts and working papers from SSRN's financial economics network. Our final training consists of 7.12 billion words, and our Fin-GloVe model includes 1.37 million unique words. Also, as Pennington et al. (2014) suggested, we use a 300-dimensional word vector to balance training duration and model performance. The trained model is stored in a text file called "vectors.txt", which gensim can load into its environment seamlessly.
model_glove.get_vector('apple')[:4]
Out[10]: array([-0.520283, -0.016652, -0.098841, -0.06334 ], dtype=float32
I printed the first four elements in the 300-dimensional word vector representing 'apple'. The quality of such a word vector increases with higher dimensionality with diminishing marginal gains. It is worth noting that the word vectors used in most previous accounting and finance research are the sparse vector, containing all the possible words, though most of which are zero elements. For example, the product word vector used in Hoberg and Phillips (2016) has a length equal to the number of unique words used in the entire corpus, and each element in the vector is equal to one if a firm uses a word, and zero otherwise. In their sample, typical firms use roughly 200 unique words to describe their products, while the number of unique words in the business description section of 10-K filings was 55,605 (61,146) in 2008 (1996).
model_glove.most_similar('apple')
Out[11]:
[('iphone', 0.6817105412483215),
('android', 0.6711021661758423),
('google', 0.6256544589996338),
('ipad', 0.6183961629867554),
('itunes', 0.5995724201202393),
('microsoft', 0.5955721735954285),
('ios', 0.5948647856712341),
('app', 0.5719366073608398),
('samsung', 0.5701519846916199),
('blackberry', 0.5687897801399231)]
We present the top ten closest terms containing the word 'apple.' The demonstrated results, in addition to numerous unreported ones, indicate that our Fin-GloVe model is capable of capturing semantically similar words in accounting and finance fields.
result = model_glove.most_similar(positive=['apple', 'android'], negative=['google'])
print("{}: {:.4f}".format(*result[0]))
iphone: 0.6815
Mikolov et al. (2013) proposed a novel assessment technique based on word analogies that examines the refined structure of the word vector space by comparing various dimensions of difference. For instance, the vector space vector equation king - queen = man - woman should convey the comparison "king is to queen as man is to woman." Given that this Fin-GloVe is trained purely based on accounting and finance corpus, we then explore a similar relationship in business domain: Apple - Google = [?] - Android. The result is iPhone (with a cosine similarity score of 68.15%), which perfectly aligns with one's expectations. When I try Amazon - Google =[?] - YouTube, the Fin-GloVe reports Netflix (with a cosine similarity score of 67.32%), which may shed some light on the capacity of this Fin-GloVe model.
def pca_2_plot(model_glove, words):
words_on_list = [w for w in words if model_glove.has_index_for(w)]
word_vectors = np.array([model_glove[w] for w in words_on_list])
twodim = PCA().fit_transform(word_vectors)[:,:2]
plt.figure(figsize=(16,9))
plt.scatter(twodim[:,0], twodim[:,1], s=180, color="blue", alpha=0.3)
for word, (x,y) in zip(words_on_list, twodim):
plt.text(x+0.05, y+0.05, word, size='large')
pca_2_plot(model_glove, ['asset', 'profit', 'income', 'sales', 'revenue', 'liability', 'loss', 'ebitda', 'ebit', 'eps', 'roe', 'roa', 'wharton', 'fama', 'wrds', 'ibm', 'apple', 'google', 'amazon', 'equity', 'bond', 'option'])
A graphic presentation is another option to verify the Fin-GloVe model's validity. To illustrate the semantic similarity of a collection of typical accounting and finance words, I displayed the first two principal components of their 300-dimensional vector in the graph above. Again, the two-dimensional distances between those selected words demonstrate that the Fin-GloVe can effectively approximate the semantic distance among words from the specific domain.
Thus far, I have demonstrated that Fin-GloVe has a decent performance in handling unigram words. I then would apply this model to a paragraph of business descriptions to see if or how it can capture the semantic distance among sentences.
ciq_bus_desc=pd.read_sas('/home/wrds/rdai/python/glove/meta/ccm_disc.sas7bdat', encoding='utf-8')
bus_des_list=ciq_bus_desc[['gvkey', 'companyname', 'gsector', 'businessdescription']]
bus_des_list.reset_index(drop=True, inplace=True)
bus_des_list[:5]
For this case study, I focus on the US firms that are primarily listed in major US exchanges (firms with CRSP share code 10 and 11), leading to a sample of 3,573 firms. I first obtained a business description and the Global Industry Classification Standard (GICS), an industry taxonomy jointly developed by MSCI and Standard & Poor's (S&P) from Capital IQ databases. In comparison to the SIC industry code designed by government agencies, GICS aims to improve investment research and asset management processes worldwide. Bhojraj et al. (2003) show that GICS classifications are significantly better at explaining stock return co-movements and variations in valuation multiples. However, because the business description is drafted by S&P or one of its data providers, the Fin-GloVe metric may solely reflect the data vendor's perceptions about these firms. For the sake of this demonstrative case study, I will disregard this drawback for now. An example of business descriptions looks like the following:
"Amazon.com, Inc. engages in the retail sale of consumer products and subscriptions in North America and internationally. The company operates through three segments: North America, International, and Amazon Web Services (AWS). It sells merchandise and content purchased for resale from third-party sellers through physical and online stores. The company also manufactures and sells electronic devices, including Kindle, Fire tablets, Fire TVs, Rings, and Echo and other devices; provides Kindle Direct Publishing, an online service that allows independent authors and publishers to make their books available in the Kindle Store; and develops and produces media content. In addition, it offers programs that enable sellers to sell their products on its Websites, as well as its stores; and programs that allow authors, musicians, filmmakers, skill and app developers, and others to publish and sell content. Further, the company provides compute, storage, database, and other AWS services, as well as fulfillment, advertising, publishing, and digital content subscriptions. Additionally, it offers Amazon Prime, a membership program, which provides free shipping of various items; access to streaming of movies and TV episodes; and other services. The company also operates in the food delivery business in Bengaluru, India. It serves consumers, sellers, developers, enterprises, and content creators. The company also has utility-scale solar projects in China, Australia, and the United States. Amazon.com, Inc. has a strategic relationship with NXP Semiconductors N.V. to deliver a cloud compute solution for vehicles that enable cloud-powered services. The company was founded in 1994 and is headquartered in Seattle, Washington."
bus_des_voc_list=[re.sub("[\d\W]", " ", s).lower().split() for s in bus_des_list['businessdescription'].tolist()]
docs_dict = Dictionary(bus_des_voc_list)
docs_dict.filter_extremes()
docs_dict.compactify()
docs_corpus = [docs_dict.doc2bow(doc) for doc in bus_des_voc_list]
model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
docs_tfidf = model_tfidf[docs_corpus]
In this case study, we did not apply lemmatization or stemming on the business description corpus and used only alphabetic words for simplification. We first trim number of words for a sparse binary vector using the gensim's standard filters, i.e., a word is used by at least five firms but not more than fifty percent of firms. I then employ gensim's default TF-IDF method to generate a weight for each word in the sparse vector, leading to more weight assigned for more relevant words to a firm. This treatment provides 6,086 dimensional sparse TF-IDF vector for 3,573 firms.
docs_vecs = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])
tfidf_emb_vecs = np.vstack([model_glove.get_vector(docs_dict[i]) for i in range(len(docs_dict))])
docs_emb = np.dot(docs_vecs, tfidf_emb_vecs)
Finally, I generate a 300-dimensional word vector for each firm in our sample by the product of the TF-IDF matrix and the corresponding Fin-GloVe word matrix.
fig, ax = plt.subplots(figsize=(16,9))
zero_indices = np.where(bus_des_list.gsector == '40')[0]
one_indices = np.where(bus_des_list.gsector == '45')[0]
two_indices = np.where(bus_des_list.gsector == '35')[0]
twodim = PCA().fit_transform(docs_emb)[:,:2]
ax.plot(twodim[zero_indices,0], twodim[zero_indices,1], marker='o', linestyle='', ms=8, alpha=0.3, label='Financials')
ax.plot(twodim[one_indices,0], twodim[one_indices,1], marker='o', linestyle='', ms=8, alpha=0.3, label='Information Technology')
ax.plot(twodim[two_indices,0], twodim[two_indices,1], marker='o', linestyle='', ms=8, alpha=0.3, label='Health Care')
ax.legend()
plt.show()
Our sample includes 872 health care firms, 600 financial firms, and 496 information technology firms, the top three GICS sectors in the United States. The figure above graphically presents the first two principal components among firms from those three industries. The figure demonstrates firms are separated by their business sectors fairly well. However, the industry separation becomes blurry when more firms from other sectors are added to the graph. These results may suggest that the dimension reduction from 300 to 2 could have significantly reduced the power of this Fin-GloVe model.
find_gvkey = {w: idx for idx, w in enumerate(bus_des_list['gvkey'])}
idx_gvkey = {idx: w for idx, w in enumerate(bus_des_list['gvkey'])}
idx_companyname = {idx: w for idx, w in enumerate(bus_des_list['companyname'])}
def _similarity_query(word_vec, number):
dst = np.dot(docs_emb, word_vec)/(np.linalg.norm(docs_emb, axis=1)*np.linalg.norm(word_vec))
firm_ids = np.argsort(-dst)
return [(idx_gvkey[x], idx_companyname[x], dst[x]) for x in firm_ids[:number+1] if x in idx_gvkey]
def most_similar_gvkey(gvkey, number=2):
try:
gvkey_idx = find_gvkey[gvkey]
except KeyError:
raise Exception('gvkey not in dictionary')
return _similarity_query(docs_emb[gvkey_idx], number)
In the last step, I constructed a method to calculate the cosine similarity among firms indexed by their GVKEY, a permanent firm-level identifier from S&P, and return a list of semantically similar peers for a given firm.
most_similar_gvkey('002991', 5)
[('002991', 'Chevron Corporation', 0.9999999),
('006310', 'Kinder Morgan Kansas, Inc.', 0.96647817),
('170841', 'Phillips 66', 0.9628442),
('183704', 'Vertex Energy, Inc.', 0.9592671),
('007017', 'Marathon Oil Corporation', 0.9537357),
('004503', 'Exxon Mobil Corporation', 0.95351523)]
most_similar_gvkey('007906', 5)
[('007906', 'NIKE, Inc.', 1.0),
('165052', 'Under Armour, Inc.', 0.96981627),
('105936', 'Columbia Sportswear Company', 0.96045893),
('063763', 'Hibbett, Inc.', 0.9591285),
('011566', 'Wolverine World Wide, Inc.', 0.9589095),
('029015', 'Deckers Outdoor Corporation', 0.9583419)]
Cosine similarity rankings provide a reasonable collection of matched firms based on tens of cases such as those demonstrated above. However, the magnitudes of similarity metrics appear to be greater than the upper boundary of a realistic range for such similarity measures. Even with TF-IDF weights in the firm word vector, FinGloVe's ability to assess sentence level similarity still seems restricted, probably due to a few limitations of word vectors, the homogeneity in the business descriptions, or/and the imperfection of GICS classification. A known workaround for technical limitations in word vectors is to employ Google's Universal Sentence Encoder, which is designed to handle sentence-level similarities (see Dai et al., 2021 for an application related to finance research.)
The Main Take-Aways
The Fin-GloVe is quite capable of handling word vector space for unigram, suggesting such models can be adopted to retrieve semantically similar words in accounting and finance texts.
The TF-IDF weighted firm-level vectors also seem to have a reasonable performance in selecting semantically similar peers, though they may have some upward bias in the magnitude of its similarity measures.
Reference:
Bhojraj, S., Lee, C.M. and Oler, D.K., 2003. What's my line? A comparison of industry classification schemes for capital market research. Journal of Accounting Research, 41(5), pp.745-774.
Dai, R., Donohue, L., Drechsler, Q.F.S. and Jiang, W., 2021. Dissemination, Publication, and Impact of Finance Research: Working Paper (March 12, 2021).
Hoberg, G. and Phillips, G., 2016. Text-based network industries and endogenous product differentiation. Journal of Political Economy, 124(5), pp.1423-1465.
Loughran, T. and McDonald, B., 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of finance, 66(1), pp.35-65.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Pennington, J., Socher, R. and Manning, C.D., 2014, October. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).