Mar 10, 2024 bb-projects

Building an Advanced Nietzsche AI Database (2024 version)

An advanced database system for Nietzsche's philosophical works

As any Nietzsche reader knows, his works are multi-interpretable. This is both a charm of reading Nietzsche but also can be a drawback, as opinion leaders, such as history writers, YouTubers, bloggers etc. can pick out a statement within their own interpretation and promote one side of Nietzsche’s ideas. Or even worse, the actual meaning of Nietzsche’s writing can get lost in translation. It also doesn’t help that over the course of his writing career, Nietzsche changed his opinion quite a bit.

I needed a way to VERIFY both his own statements, other author’s statements or just about anything anyone ever said about Nietzsche. Using CTRL+F for each of his 10+ works isn’t a very efficient process. In comes ChatGPT! Unfortunately, I noticed that it’s answers were inconsistent, I do not know if it has all Nietzsche’s contents in it’s training data and it sometimes gives plainly wrong answers. To remove the most amount of friction between my question and a verifiable answer I needed a combination of both solutions. Luckily I was in need of a project where I could test all my LLM engineering skills so I sat down and started to put in the work.

This blog can be read by either philosophy enthusiasts or LLM enthusiasts. This blog is meant to be “alive” and might receive regular updates. If you have any suggestions, would like to contribute or have any other ideas just reach out to me.

Corpus UI

Index

CorpusDB
CorpusLLM
CorpusUI
Hosting
Custom GPT in ChatGPT
Evaluating the results
Roadmap

1. CorpusDB | Building the database through manual labor

I’ve always had the idea that books are a very inefficient store of information. It’s increasingly rare that books are worthwhile to read from start to finish. When a book is indeed good and information-dense, it’s hard to remember of retrieve all key ideas of it. I believe long texts are just data in disguise, but instead of cells and rows it’s split by either paragraphs, subchapters and chapters and, within these contents the meaning can even semantically be split as well. Therefore I think it’s already useful to split books into enumerated chapters, subchapters and paragraphs. This way it can easily be referenced or compared between different translations or the original work. So I went about doing just that.

Luckily all works of Nietzsche are available with a public domain license through Project Gutenberg. I started building a manual database (a.k.a. a CSV) containing all books, including metadata such as popular title, Gutenberg title, book id, and the URL for the txt version of the book. Next came the most laborious part: splitting the books in to chapters and subchapters.

For each book I needed to identify a pattern how I could programmatically recognize what’s the start and end of each chapter and subchapter. I found that most subchapters were already enumerated, with the number of the subchapter on a separate line before the start of the subchapter. This is roughly how the process went:

Remove all the book contents not written by Nietzsche, such as title pages, translator’s introductions, footnotes etc.
Split the book into an array based on newlines
Extract all numeric-only values as subchapter headings
Add the contents between each heading into a dataframe in the same row as the heading
LOTS OF manual cleaning up such as incorrectly formatted headings, duplicated or missing headings, etc.

Each book was different, but showed similar patterns so the process went smoother and faster for each book. Many hours later it was done! A dataset containing all chapters Nietzsche ever wrote, or at least the ones present on Project Gutenberg. (If you are interested in this dataset, please let me know and I will send it to you, still looking for a good place to share it).

“A dataset containing all chapters Nietzsche ever wrote”

This dataset can be used for all kinds of AI bells and whistles, but I think it’s already useful in itself. At least it makes the CTRL+F part even easier than it was before.

2. CorpusLLM | Chunking and summarizing

Now I had to find a way how to most optimally search and retrieve each subchapter. Here I found basically two options: chunking and/or summarization.

Chunking

Nietzsche’s subchapters can range from a single sentence to multiple pages. My first thought was, for long subchapters, I’d like to semantically split it so when you search for something you get the relevant part of the subchapter. I can easily let an LLM decide how to split a subchapter, right? I tried some sentences with test paragraphs which clearly changed topics at certain points and it worked perfectly. I created a prompt where all sentences in the chapter were enumerated, so the LLM response only had to be a couple of numbers in an array, thereby saving time and costs.

Technically, this worked perfectly. HOWEVER, even with GPT-4 I got results with which I did not agree, and different LLM models greatly disagreed with each other.

Then I stumbled on another approach: the python package appropriately called semantic-text-splitter. Instead of inefficiently passing all contents through a LLM, it uses bert-base-uncased tokenizer to do the job, mathematically I presume? I am not going to claim I understand how this works, but there results were disappointing. Here I also realized that Nietzsche was probably very thoughtful what he put into a single subchapter and probably shouldn’t be split.

Summarizing

Summarizing each subchapter servers several purposes. In order to semantically retrieve these pieces of text I want to remove as much contextual words as possible and leave in the most important keywords. Secondly, when retrieving a large number of documents, I can quickly glance over the summarized results to get a grasp of what it’s about and if it’s worthwhile to read the full subchapter. Luckily, summarizing is easier than semantic chunking so it was done quite quickly. As we’re passing through the entire Nietzsche corpus through a paid LLM, I choose to use the Mixtral 8x7b model through OpenRouter for my first tests. Once I verify everything is working as intended, I might run it through GPT-4 or Claude 3 Opus to get the absolute best summaries LLMs have to offer. I made sure to save the output dataframe to a CSV not to run the summarization process again, and so it can be loaded into the next part of the project.

3. CorpusUI | Embedding and making it actually useful

As a node.js developer, I am used to endless flexibility in building a frontend. However, I never liked frontend development, messing around with components, CSS and the like. I rather focus on the usability than on the looks of an interface, and therefore Gradio is a blessing, as can be seen by the popularity in the AI world nowadays. Gradio lets you build an interactive interface for your python functions and data in a ridiculously low amount of time.

Corpus is also available in dark mode

But before we can start building something interactive, we first need a way to make our subchapters semantically retrievable. This is important for my use case because when I literally search for topics I might miss synonyms and spelling variants. For example when I search for Philosopher I also want to retrieve documents about Thinkers.

For storing the vector embeddings I chose to use Chroma DB because of it’s simplicity for local usage. This was useful for me in my prototyping phase because I tried all kinds of different stuff. For my tests I used a local embedding model named all-MiniLM-L6-v2. When I got the hang of it and knew how I wanted to store and retrieve the database, I switched to OpenAI’s new text-embedding-3-small. Even for my dataset, the cost for this was negligible and later I will switch to the large one.

The interface

Building the interface was quite the iterative process since here I was able to test out my actual use case. Some considerations I ran into:

I decided I really did need a “must contain” option, especially for searching names such as Wagner and Schopenhauer
I found out that it’s not efficient to take the full user question and use that as the query for the vector database. With a question I added a step in between in which an LLM extracts the relevant keywords and uses this for the query, called query rewriting or query transformations. I made this step optional in case I want full control over the search query.
I really want to show the relevance score to the user, which unfortunately is not available with all retrieval methods such as multi query retriever. However it’s not hard to cook up a manual solution for this including the scores but I haven’t found a need for that yet.
I noticed I want the absolute best LLM to summarize the search results for me. The meaning of the text is especially important in this project so GPT-4 it is.
As mentioned earlier, you probably don’t want to read all full subchapters of the search results. Therefore I chose to present the summaries, location of the summaries and the relevance score in a table as the search result. Luckily I found out that Gradio dataframes supports the HTML cell type (which was not in de documentation which I wrote this), so I could create the option that when you click on the summary you can read the full chapter.
The full search results, including locations, summaries and full subchapters are passed into a big prompt for summarization. The output request in this prompt might need some optimization, but it captures the main ideas quite well. When in doubt about the answer, you are very fortunate to have all the references already under your nose!

4. Hosting

As I am most familiar with Google Cloud Run for hosting things quickly and easily I went for that. I (ChatGPT) wrote a decent Dockerfile, some inevitable Cloud Debugging an the project is live! It’s not perfect yet but it’s definitely useful already. See my later tests for more about that.

5. Custom GPT in ChatGPT

I’ve been meaning to try to build a Custom GPT with function calling, so I realized this might be the perfect use case for that. Much of the code is already done, and the custom GPT can so some of the heavy lifting (and save some costs). First I tried that “Use via API” function underneath the Gradio page, but I found it felt too hacky to cram two different application types into a single script. I created a simple fastapi in the same project, reused the same functions and I had a simple API that ChatGPT could use. After messing around with the manifest, I published for all (ChatGPT subscribers) to see: The Nietzsche Reference Bot (https://chat.openai.com/g/g-F62wnKW8A-nietzsche-reference-bot).

I could have chosen to make it act more like an AI agent, where it extracts years and book titles from the user question and send it as parameters to the API but I chose not to, because:

“Be your own agent before you write an agent”

I believe you must have spent some significant time tweaking your retrieval methods manually before you can ask an AI quite precisely what it is you want it to do.

6. Evaluating the results

Here I am going to use the application by question type and evaluate the results.

Type 1: “Topic” “What did Nietzsche say about ‘Philosophers’?” -> here we get a bunch of results, so I passed in 25 subchapters to summarize and this is the result:

Corpus search result for "What did Nietzsche say about 'Philosophers'?"

Type 2: “When” “Where did Nietzsche introduce ‘Amor Fati’?” -> it returns just 1 excerpt from Ecce Homo, which (as far as I know) is the only literal mention of the term.

Type 3: “Is it true” “Is it true that Nietzsche hated Schopenhauer?” -> as we all know, he changed his mind about Schopenhauer and the answer reflects that, including references!

Corpus search result for "Is it true that Nietzsche hated Schopenhauer?"

Type 4: “Contradictions” “Did Nietzsche contradict himself on the topic of ‘science’?” -> The answer to this question is, as ChatGPT mentions ever so often “complex and multifaceted”, which is the correct answer in this case.

Overall I am satisfied with the results, and as mentioned before can be improved upon if I use these question types for few-shot prompting.

7. Roadmap

Reranking: I am still undecided whether to add this to the project. The order of the results is not perfect, but as the user can decide how many results are passed into the summarizer, it let’s the LLM sort of rerank it as well.

Agents: as mentioned before, when I spent a great amount of time tuning the parameters and trying out many different types of questions, I will create an option for the LLM to decide these things for me. Ideally I classify most question types and create a few-shot prompting.

Chain of though and judging: either way I want to add an extra step so the LLM judges the quality of the results. I want this to be optional because ultimately the human decides if the results and answer are sufficient for the question.

Reinforcement learning: this can theoretically be applied to several parts of the project by letting the AI judge or rate itself with a grade. Not sure if this will work but it might be a good option to fine-tune the quality of the summaries and answers.

Translations: I would like to add the contents of the original German books, and add an option for the LLM to judge the translation from the Gutenberg book. Nietzsche used a lot of word play which may have gotten lost in translation or where different translators disagree about. This gives us a very low effort method to judge for ourselves!

Add context: I could place a button next to the subchapter summary or the results summary, asking the LLM to provide context to the text. This could be things such as if it’s controversial, true or untrue or other Philosopher’s opinions about it.

Streaming: I expected this to be easier to implement, but will require some engineering. As I’m not building a consumer oriented or commercial product I decided not to spend a lot of time on that right now.

Fine-tuning: I might use a finetuned model, which will be able to write in the style of Nietzsche itself. Another use case for using a custom finetuned model is to prevent censorship, as Nietzsche’s ideas might not pass all safety filters of commercial LLMs.

Other books? Let me know!

Open sourcing: for both the dataset and the code I need to know if there’s enough demand for it to share it. I’ll need to do some cleaning up before sharing, so I need to know it it’s worth the effort.

Conclusion

It took quite some time to build this but it took my RAG skills to the next level. As I am in the process of finishing the last few Books of Nietzsche, when I finish I will study his (subjectively) most interesting ideas by topic. This is where Corpus should make this process as frictionless as possible.

← Back to posts