Criticism Extraction
Expanding the Database
Comments on platforms like reddit or Seeking Alpha are there for a reason. In this case they are mainly for expressing criticism or approval of the given post. This gives us an additional interesting data point to use. Since comments contain criticism of the investment points in the post I can cross reference them with the points and thereby perform what is essentially peer review on the theses. That is the reason why I specifically extracted the comments earlier while scraping the posts.
In the previous steps what the script did was to already [[#Adding Posts to the Database|save the posts to the database]] and then fetch the posts content from there to [[#Summarizing posts and assigning sentiment|extract theses]] and [[#Using sets to filter out previously checked posts|filter them]]. This is the most straightforward and least error prone way of doing things because it means I prevent any collision or discrepancy between the extracted points an the posts. Therefore it makes sense to do the same thing for the comments. Just retrieve the comments from the newly scraped posts from the database. However when I first built the database I did not know yet that this was going to be the approach which is why I didn’t built eh database structure to hold the comments. So that’s why it is at this point that I added a Comment
table to the database. the details of this table are already explained at the top of the documentation in [[#Database Structure]]. Essentially this table holds content of the comment, the URL and the author. In case you’re wondering why the author is nullable, right now I’m not scraping the authors from seeking alpha but just from reddit. I’m also not going to be using the author data point, at least not in the scope of this documentation. Later on (after the BLL) I plan to add an additional feature that tracks the performance of given users. It will simulate trades based off the authors theses and record their performacne to then build a ranking. But again I’m not going to be doing this here just yet, I just thought I should explain so you wouldn’t think I’m adding unnecessary columns.
The Comment
table is linked to post and criticisms which means a post can have multiple comments and a comment can have multiple criticisms (exclusively in the described one-to-many relationship).
Fetching & analyzing comments
Now let’s get to the extract_criticisms.py
module. Since I plan to extract criticisms post for post, I’m again going to use asyncio and AsyncOpenAI to run this concurrently.
I need to do this post by post because each post will have it’s own comments, therefore I want to process each posts points with it’s respective comments. Since right now the points are just in an unstructured list I’ll first have to group them by their post id’s. For that I just use defaultdict
and provide a list as the default factory function so that any non-present accessed key will be set to an empty list. This way it#s possible to simply loop through the points and append each point to the respective post_id’s key in the dictionary. So if the post id is 1
I’ll add it to the dictionary’s 1
key and so on.
Now that I have a dictionary of lists of arrays structured by the posts they were extracted from I can asynchronously pass these dictionary entries to the extraction function.
I’ll now first fetch all comments for the post id from the database. I then structure them into a list of dictionaries holding the comment id and the comment content.
I’ll be using GPT again to compare the comments/criticisms to the points, however I do not want to pass the entire point data to it. This includes stuff like the vector embeddings of the points, which would massively increase the input tokens while not enhancing the process at all. Therefore I will reduce the point data to just the point content and the sentiment score (which helps add context).
Then I’ll craft a system prompt instructing GPT to check if valid criticisms are found in the comments. If he finds no valid criticisms he should keep the point and set the output boolean of criticism_exists
to false
. If multiple strong criticisms exist or a criticisms is so strong it invalidates the point it should leave out the point in the final output. If there is criticism but the point remains viable it should link a short summarized version of the criticism to the point, set criticism_exists
to true and assign a validity_score
from 1-100 to the criticism depending on how strong it is.
I then provide it with ticker, points and comments (I give it the ticker so it can use the web to validate any criticisms found). Lastly I again provide a JSON schema so the program flow can easily process the data.
Now since I gave GPT a simplified version of the points I now have to merge this simplified version with the original full points and then I can return these merged points.
Note: One small thing worth mentioning is that since I’m running all of this asynchronously I don’t want the whole process to fail just because one point failed. Therefore I’ll tell asyncio to return exceptions. Then in the main running module I’ll filter out all exceptions from the output and log them, just returning the results that did not fail.