Build with language models via llm
llm
(previously) is a tool Simon Willison is working on for interacting with large language models, running via API or locally.
I set out to use llm
as the glue for prototyping tools to generate embeddings from one of my journals so that I could experiment with search and clustering on my writings. Approximately, what I’m building is an ETL workflow: extract/export data from my journals, transform/index as searchable vectors, load/query for “what docs are similar to or match this query”.
Extract and transform, approximately
Given a JSON export from DayOne, this turned out to be a matter of shell pipelines. After some iterations with prompting (via Raycast’s GPT3.5 integration), I came up with a simple script for extracting entries and loading them into a SQLite database containing embedding vectors:
#!/bin/sh
# extract-entries.sh
# $ ./extract-entries Journals.json
file=$1
cat $file |
jq '[.entries[] | {id: .uuid, content: .text}]' |
llm embed-multi journals - \ # [1]
--format json \
--model sentence-transformers/all-MiniLM-L6-v2 \ # [2]
--database journals.db \
--store
A couple things to note here:
- The placement of the
-
parameter matters here. I’m used to placing it at the end of the parameter list, but that didn’t work. Thellm embed-multi
docs suggest that--input
is equivalent, but I think that’s a docs bug (the parameter doesn’t seem to exist in the released code). - I’m using locally-run model to generate the embeddings. This is very cool!
In particular, llm embed-multi
takes one JSON doc per line, expecting id/content
keys, and “indexes” those into a database of document/embedding rows. (If you’re thinking “hey, it’s SQLite, that has full search, why not both: yes, me too, that’s what I’m hoping to accomplish next!)
I probably could have just built this by iterating on shell commands, but I like editing with a full-blown editor and don’t particularly want to practice at using the zsh builtin editor. 🤷🏻♂️
Load, of a sort
Once that script finishes (it takes a few moments to generate all the embeddings), querying for documents similar to a query text is also straightforward:
# Query the embeddings and pretty display the results
# query.sh
# ./query.sh "What is good in life?"
query=$1
llm similar journals \
--number 3 \
--content "$query" |
jq -r -c '.content' | # [1]
mdcat # [2]
Of note, two things that probably should have been more obvious to me:
- I don’t need to write a for-loop in shell to handle the output of
llm similar
;jq
basically has an option for that - Pretty-printing Markdown to a terminal is trivial after
brew install mdcat
I didn’t go too far into clustering, which also boils down to one command: llm cluster journals 10
. I hit a hiccup wherein I couldn’t run a model like LLaMa2 or an even smaller one because of issues with my installation.
Things I learned!
jq
is very good on its own!- and has been for years, probably!
- using a copilot to help me take the first step with syntax using my own data is the epiphany here
llm
is quite good, doubly so with its growing ecosystem of plugins- if I were happier with using shells, I could have done all of this in a couple relatively simple commands
- it provides an adapter layer that makes it possible to start experimenting/developing against usage-priced APIs and switch to running models/APIs locally when you get serious
- it’s feasible to do some kinds of LLM work on your own computer
- in particular, if you don’t mind trading your own time getting your installation right to gain independence from API vendors and usage-based pricing
Mission complete: I have a queryable index of document vectors I can experiment with for searching, clustering, and building applications on top of my journals.