With the amount of text data currently available, it has become impossible for humans to analyze those vast quantities efficiently. As the information and technological age has begun to dominate our communication, the need to extract and analyze text from all sorts of mediums for a multitude of applications has soared. In the financial services industry, however, large amounts of financial text data are still interpreted manually. Here, Natural Language Processing (NLP) and Natural Language Understanding (NLU) can play a pivotal role in making information extraction and analysis more effective. NLP can be generally defined as a sub-field of artificial intelligence that works on improving the ability of computers to process human language, usually on a linguistic level such as part of speech. NLU is more focused on understanding human language after processing and putting the components together, i.e. it is more semantic oriented.
Automatic processing of qualitative data: understanding the challenges
NLP and NLU make it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important. But the challenge is that text is an inherently fickle medium due to it being entwined with language. Something easily processable by a human is often not easily transferable to a computer. Thus, different text sources provide not only different types of information, but unique challenges.
Company relevant information, for example, can be extracted from a multitude of different text sources. While information usually overlaps, each source may contain a different inherent perspective. How a company chooses to report itself can be interpreted very differently to how a third party judges the report. Additionally, some information disseminates quickly, while other information is hardly noticeable, and sometimes even released at specific times for that exact effect. Furthermore, each source presents different challenges sometimes not even directly from the text, but from the additional meta-data that accompanies it.
NLP: Different data sources demand different approaches
News: Perhaps the most intuitive way of finding company information is in the news. News are constantly being streamed from various sources and sectors. One advantage of news is that the writing style is rather formal and structured in appearance, making it more suitable for extracting information. However, the origin of the news can bring with it its own problems. Whilst a news format may look perfectly fine to the human eye, the computer might see many erroneous characters (such as newline or special characters) that the reader never knows exist. This can be true for any source of data, but with news data the text comes from a third-party source which has its own preferences for its machine readable text formats. Thus, a major task is simply integrating how different outlets structure their news. This requires matching any of the additional meta-data news sources provide across multiple intake streams and aggregating the results in a single cohesive data stream. One solution to the problem can be to design an NLP pipeline that contains document classification system that recognize texts and assign them to particular processing pipelines for special texts.
Company filings: Many companies release annual reports on their finances and, if required by law, file mandatory reports or bond prospectuses with various agencies. Whilst such information is very pertinent to a specific company, extracting relevant information from those reports can be difficult as the formats vary greatly. Furthermore, reports are often stored in PDFs which are not inherently machine readable. However, while specific company reports may still be challenging to analyze, there are sources of standardized reports that contain similar information, such as the Security and Exchange Commission (SEC) filings. These filings – albeit not available for every company - are systematic and follow similar structures from year to year, allowing for a more automated process for information extraction to be developed. Moreover, available research to identify exactly which parts of the filings are most relevant and to examine how they change over time (see Rawte, Gupta, & Zaki. 2018 for an example) reduces the processing requirements.
Earnings Calls: Earnings calls are a forum in which a public company discusses the financial outcomes during a given reporting period. They have become increasingly popular to analyze due to the specific pieces of information conveyed and their set formats (see Keith & Stent, 2019 for an example). Since it is disclosed who is talking and when, the dialogue is easier to follow, and points can be directly attributed to the speaker. This sort of structure also allows for an additional level of semantic analysis: the speaker performance can be compared to previous calls, and put in the context of the reported numbers, indicating how the individual tends to present either positive or negative financial outcomes. Additionally, the speaker’s way to position the information can be compared against actual company and market behavior.
Social Media: As social media channels have become to dominate our communication, so has their use as a timely source of information. Financial sentiment across social media has become a particular area of focus (SemEval 2017 Task 5, Cortis et al., 2017). From an NLP perspective, social media presents its own unique challenges. Original NLP methodologies were developed and designed for much more formal, structured texts, such as newspapers. Social media perhaps represents the antithesis of this. The data is completely unstructured, informal, and often times full of special characters. Emojis have spawned research avenues in how to be opinion mined. However, the speed at which knowledge disseminates on such platforms makes them an important focus for trying to predict future behaviors. Today, many individuals find out about company specific events via small social media snippets and are already making decisions based on this information – well before a more formalized article is published.
Usage of NLP in financial data analysis will continue to grow
As digitized text increases in availability, so will the need for NLP to handle the multitude of mediums. News, company filings, earnings calls, and social media all contain potentially valuable information for financial investors, but they also pose different challenges with regard to extracting the information. As NLP continues to expand its use in the financial domain, so will its need to process an additional wealth of sources from which individuals can make actionable investment decisions. The speed of innovation in areas such as machine learning, however, continue to allow more advancements in NLP and NLU. This will only increase our ability to extract and analyze texts on multiple cognitive levels.