Can LLMs Predict the Market? Insights from the Beige Book.

Introduction

The Beige Book, formally known as the Summary of Commentary on Current Economic Conditions, is a publication by the Federal Reserve that provides insights into the condition of the United States economy. Its value lies in its timeliness, complementing traditional economic indicators that often lag current conditions. At the same time, its methodology has limitations, and its content is built from anecdotal evidence rather than a strict statistical process.

That tension is what makes the Beige Book interesting. We’ve found that it contains valuable qualitative economic insights, yet it does not arrive in a format that is easy to test systematically. Our study aimed to address that gap by applying large language models to Beige Book reports and examining whether those interpretations could help forecast short-term S&P 500 returns.

Our research covered Beige Book reports from 2013 to 2023, excluding 2020. The study used five large language models to score Beige Book content from -1 to +1, in 0.5 increments, and then used Nonlinear Iterative Partial Least Squares (NIPALS) regression to evaluate whether those scores had predictive power for S&P 500 returns over 1-day, 3-day, 7-day, and 14-day horizons.

The main finding was not that language models can simply “predict the market.” It was narrower than that, but still useful. The predictive power of LLM-interpreted Beige Book content generally peaked at the 3-day to 7-day horizon, and model performance improved significantly when control variables were included.

Why use the Beige Book

Firstly, the Beige Book serves as a qualitative input for monetary policy. It draws on observations gathered across the twelve Federal Reserve districts and summarises current economic conditions by region and sector. Because of that structure, it offers a timely reading of economic conditions that standard macro series may not fully capture at the same moment.

Secondly, there is already a body of work showing that Beige Book content has predictive value for economic indicators such as GDP and employment. What has been less explored is whether more advanced text analysis techniques can extract useful signals from the report for financial markets, especially equity returns.

Lastly, our study builds on more recent studies that used large language models to classify central bank communication. Hansen and Kazinnik used GPT models to classify the stance of FOMC announcements. Woodhouse and Charlesworth used GPT-3.5 to analyse Bank of England speeches. Our study extends that logic to Beige Book text and applies it to the S&P 500.

Methodology

The data collection process combined automated and manual retrieval. A web scraping program was used to fetch Beige Book chapters from the Minneapolis Fed archive, and missing chapters were manually retrieved from the Federal Reserve archive to complete the dataset. After excluding 2020, the final sample contained 80 Beige Book reports and 1,040 chapters.

Five LLMs were selected through AWS Bedrock:

Amazon Titan Text Premier
Anthropic Claude 3.5 Sonnet
Cohere Command R+
Meta Llama 3.1 70B Instruct
Mistral Large.

The study used one shared prompt across all models. Each model was instructed to act as an expert financial analyst and numerically classify the likely impact of the text on S&P 500 returns using only five possible outputs: -1, -0.5, 0, 0.5, or 1.

The aim was to convert qualitative economic language into numerical data suitable for statistical analysis, while also keeping output tokens minimal.

The scores were then used in a NIPALS regression model. The dependent variables were S&P 500 returns 1, 3, 7, and 14 trading days after each Beige Book release. The models were tested both with and without control variables, including CPI, GDP, unemployment, and Treasury yields.

What the results showed

The signal was weak on day one, stronger after a few days, and then weaker again over longer horizons.

For 1-day returns, all models showed relatively weak predictive power. With control variables, R2 values ranged from 0.0168 for Claude 3.5 Sonnet to 0.0846 for Titan Text Premier. Without control variables, performance was weaker still, with R2 values not exceeding 0.0279. Beige Book content, as interpreted by the tested LLMs using this prompt design, had limited immediate impact on stock market returns.

The 3-day horizon was much more interesting. With control variables, R2 values ranged from 0.1370 for Claude 3.5 Sonnet to 0.3278 for Cohere Command R+. Without controls, the range was lower, from 0.0169 to 0.1181. The study reads this as evidence that the market may take some time to fully incorporate the information from Beige Book reports.

At 7 days, performance still held up, though it varied more across models. With controls, R2 values ranged from 0.0296 to 0.2206. Without controls, the range was 0.0054 to 0.1528. By 14 days, predictive power had generally declined, which I interpret as a sign that the impact of Beige Book reports diminishes over time as newer information enters the market.

That is probably one of the most useful parts of the research. It suggests that Beige Book information is not fully absorbed immediately, but it also does not remain fresh for long. The strongest window appears to be somewhere between three and seven trading days after release.

NIPALS Results with Control Variables

NIPALS Results without Control Variables

Which models stood out

Among the five LLMs tested, Cohere Command R+ and Titan Text Premier consistently demonstrated the strongest predictive performance across different time horizons, especially when control variables were included. Meta Llama 3 70B also showed strong performance, particularly for 3-day and 7-day returns. Claude 3.5 Sonnet generally had the weakest predictive power across all time horizons.

The effect of the control variables is worth stressing because it is one of the clearest empirical points in our reearch. For example, Cohere Command R+ improved from an R2 of 0.0555 to 0.3278 for 3-day returns once control variables were added. Titan Text Premier’s 7-day R2 also improved when controls were included.

The LLM scores were not useful in isolation, they worked better when combined with traditional economic indicators. This could be treated as evidence that qualitative economic assessments and quantitative macro data are more informative together than apart.

Positive bias in the text

One of the more interesting findings concerns tone. We find a positive bias in Beige Book content across all LLMs, supporting the hypothesis of inherent optimism in central bank communications.

The descriptive statistics point in that direction quite strongly. Mean scores ranged from 0.0875 for Cohere Command R+ to 0.6125 for Meta Llama 3 70B. Only Titan Text Premier produced a minimum score of -1, and only twice. By contrast, two LLMs produced a score of 1. We read this as evidence of consistent positive bias across models and districts.

Limits of the findings

One issue is cost and access. The use of on-demand LLMs through AWS Bedrock can become expensive, especially during testing and implementation.

Another issue is prompt design. The same prompt was used across all five models, which helped standardise the study, but may not have fully used the strengths of each model.

There is also a modelling issue. The combination of multiple LLM scores and several control variables created a high-dimensional input space relative to a sample of 80 reports. NIPALS is designed for multicollinearity and dimension reduction, but the risk of overfitting still remains.

Why this matters

What this study really shows is that large language models can do something useful with qualitative macroeconomic text. They can turn it into structured numerical signals that are at least testable in a market setting.

That matters because much of macro analysis still lives in prose. Central bank material and and policy statements may influence markets, but they are harder to formalise than time series. This study shows one way to bridge that gap. The study also provides a more measured view of how AI may fit into financial research. The best use here was not as a standalone prediction engine. It was as a tool for extracting signal from difficult text and then combining that signal with traditional indicators.