Point Extraction & Sentiment

Now I can start with the first step of the analysis. The summarization. I first intended to use gpt-3.5-turbo to summarize the points and then assign sentiment to them separately however after testing I found that using gpt-4o to summarize and assign sentiment all in one step actually works better. An even better way to do this is to only use Large Language Models like GPT to summarize the points and then use a more traditional transformers based approach to assign sentiment. I should train said model specifically on financial data so that it’s good at assigning sentiment in that venue. I could do this using a neural network (transformers) and utilizing transfer learning to use a foundation of pretrained parameters with a preset learning rate for the stochastic gradient descent to build the Deep Learning network. This would mean high accuracy but high costs and more compute. A more traditional approach would be XGBoost with engineered features (word counts, sentiment scores from lexicons like VADER/Loughran-McDonald) or a Random Forest trained on TF-IDF or word embeddings (Word2Vec, FastText) which are much cheaper than Transformers based models. A good middle ground would be to use recurrent models like Bi-LSTM with attention or GRUs using embeddings (GloVe, Word2Vec) these can capture sequential dependencies while being easier to train than transformers based models. However I was advised to not build my own model for this BLL so I’m just going to go with a simple GPT-4o approach.

The script will use the previously saved ID’s from the posts it just scraped and get the content of the posts from the database. I then use some prompt engineering to write a prompt that instructs GPT to extract factual thesis points from the post and assign them a sentiment score from 0-100 above 50 being bullish and below 50 being bearish. Then I provide it with the ticker the analysis is for (So it doesn’t get confused and extracts points for the wrong stock), the stocks name and the content of the post. OpenAI luckily supports passing of a JSON schema to force GPT to output the content in a given format so the output can be easily processed without having to make use of a whole lot of Pydantic. The JSON schema I built for this prompt looks as follows:

{
  "type": "object",

  "properties": {

    "thesis_points": {

      "type": "array",

      "items": {

        "type": "object",

        "properties": {

          "point": {

            "type": "string",

            "description": "The extracted thesis point text."
            },
          "sentiment_score": {

            "type": "integer",

            "description": "The sentiment score, where 50 is neutral, above 50 is bullish, and below 50 is bearish."
          }

        },

        "required": ["point", "sentiment_score"],

        "additionalProperties": False

      }
    }
  },
  "required": ["thesis_points"],
  "additionalProperties": False
}

Note that I’m coding all of this using OpenAI’s asynchronous library in combination with asyncio. This means all posts are getting summarized at the same time leading to O(T) in wall-clock time, no matter how many posts are added it’s still going to be taking the same amount of time to analyze all of them as it would to analyze the longest one of them.

This also led to an interesting problem I had to solve. Now the entire function is asynchronous and asyncio is used to parallelize both this function and the filtering out of points to which I’ll get to next. After employing these parallel functions the analysis status suddenly didn’t work anymore. At least not properly. When trying to get the status of the analysis the request would stall and it would only return a response after around 10-15 seconds. Since I just built in asyncio it was pretty clear that there was likely some synchronous function which was blocking the (now asynchronous) program flow leading to the /analysis-status endpoint not being able to run until said function was done. After utilizing the timing library to figure out which part of my program flow was taking so long I landed on the scraping flow. In hindsight this is pretty obvious. I’m utilizing ThreadPoolExecutor to thread the scraping which means there are synchronous blocking network calls within the scraping modules. This is solved pretty easily by just offloading these blocking calls to asyncio.to_thread().