February 2023

Midjourney hallucinates scissors and houses

I used to be the type of chap who has one place for everything. The scissors are in That Drawer in the kitchen. It kinda worked, until I had to live with other humans that I didn’t have control of.

After years of fighting against the system of “one place for every thing” I went full in on the other direction, I think inspired by reading Algorithms to Live By, and Brad Fitzpatrick’s work with Perkeep. Since then, I have taken the opposite approach to a single source of object, and instead I put items in many places, ideally where they would be used. This is why I now have scissors all over my house, or screw drivers, etc. “All over” is going a lil far, I put them in spots where I think they will be used. This is akin to a CDN… caching a copy closer to where someone needs it.

Why am I talking about scissors?

I went through the exact same kind of transition in the virtual world. Where do I keep my data? I would try to centralize it as much as possible. E.g. Google Drive as a source of truth, and then split off for types of content that weren’t a fit (e.g. I used Asana as a central database for a long time, and 1Password for passwords, and Active Recall for learning, and Type.ai for writing, etc).l

After years of trying new systems, and migrating data, I took the other approach:

I embrace the fact that one system won’t be perfect, for now, and especially the future
I don’t worry about migrating data as I try new things. I jumped around with Roam, Obsidian, Logseq as an example. Once a lil more settled, I may then do some migration
I favor products where I can get to the data (yay owning your data)
I favor products where there are strong integrations. Instead of a central merge, I can then connect all of the things and have the data show up in all of the places

But how about finding things? Integrating into one search to rule them all is vital when you have data in various spots. I’m hopeful for products such as Needl (I need more integrations and a SDK to integrate with). My latest foray here is, importing all of my second brain into my own local Polymath with access control.

Now I can use natural language to get semantic search results, each with links that poke me to where the data is. Often it’s in multiple spots so I can choose what I want to open up to see and use that data.

There is so much opportunity for us to get to a collective, with integrations, and allows us to evolve and connect our data.

Creativity & Constraints, Foundations & Flywheels

The developer community is buzzing around the new world of LLMs. Roadmaps for the year are getting ripped up one month in, and there is a whole lot of tinkering… and I love the smell of tinkering.

At Shopify we shared a new Winter Edition, which packaged up 100+ features for merchants and developers. Some of the launches had a lil Shopify Magic in them, using LLMs to make life better for our users.

I had a lot of fun, shipping something for developers that used LLMs, and I thought I would write about a few things that I learned going through the process of getting to shipping.

UI for mock.shop — The mock.shop homepage

What did we ship? mock.shop

We want to make it as easy as possible for developers to learn and explore commerce, by playing. We wanted to take as much friction as possible from being able to explore a commerce data model, and build a custom frontend to show off your frontend.

This is where mock.shop comes in, it sits in front of a Shopify store, but doesn’t require you to create one yourself. Just start playing with it and hitting it directly!

One thing we have heard from some developers is that they are new to GraphQL and/or new to the particulars of the commerce domain. We show examples, and the GraphQL and code examples of how to work with it, but could we go even further?

Generate query with AI

What if you could just use your words and ask us to generate the GraphQL for you? That’s exactly what we did. And here’s what we learned…

Foundations & Flywheels

We used OpenAI for this work, and when working with LLMs you are working with a black box. While GPT3 had some knowledge of GraphQL, and Shopify, it’s knowledge was out dated and often wrong. Out of the box you are working with anything that the model has sucked up, and you can’t trust this data at all.

You need to do all you can to feed the black box information so that it can come up with the best results. Given the black box, you will need to experiment and keep poking it to see if you are making it better or worse.

Here are some of the foundational things that we did:

Feed it the best input

Gather all of information that you think will nudge the model in the right direction. In our case we gathered the GraphQL schema (SDL) for the Shopify storefront APIs, and then a bunch of good examples. With these in hand, we would chunk them up and create OpenAI embeddings from them. You end up with a library of these embeddings, which are vectors that represent the chunks of text.

With these embeddings we can take user queries (eg. “Get me 7 of the most recent products”), get an embedding from that query, and then look for similar embeddings from the library that you have created. Those will contain snippets such as the schema for the products GraphQL section, and some of the good examples that work with products. We call this context and you will pass that to the OpenAI completions endpoint as part of a prompt.

Customize the prompt

You will want to play with prompts that result in the right kind of output for your use case. In our case we are looking for the black box to not just start completing with sentences, but rather give back valid GraphQL.

You end up with a prompt such as:Answer the question as truthfully as possible using the provided context, and if don’t have the answer, say “I don’t know”.\nContext:\n${context}\n\nQuestion:\nWhat is a Shopify GraphQL query, formatted with tabs, for: ${query}\n\nAnswer:

You can see how the prompt is:

Politely asking for the answer to be truthful
Nudging for the answer to be tied to the given context (from the embeddings) vs. making it up from full cloth, and saying that it’s ok to say “I don’t know”!
Asking for a formatted GraphQL query

One other way that we try to stop any hallucinating from the model is via setting the temperature to 0 when we make the completion call:What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It’s quite funny to see how we do everything to try to get the model to speak the truth with this type of use case!

Feedback and Flywheels

Now it’s time for the flywheels to kick in. You want to keep feeding the context with high quality examples, sometimes show what NOT to do, play with different prompts, and start getting feedback.

You will see lots of examples where users are asked for feedback. E.g. in support systems and documentation: did this help? is it accurate? To train the model as best as possible, you can look for ways to get this information from the experts (humans!) and feed it on back, as well as simply tracking what your users are asking for and how well you are acting on those needs!

Creativity & Constraints

We have the foundations in place, and the quality of data will improve through the flywheels. Now it’s time to get more constrained. We are doing all we can to nudge for truth, but you can’t trust these things, so what guardrails should you put in place?

We really want the GraphQL that we show to be valid, so… how about we do some validation?

We take the GraphQL that comes back and we can do a couple things:

We would tweak it, when possible, to place valid IDs and content, for the given dataset that we have in the mock.shop instance.
Validate the GraphQL to make sure the syntax is correct
Run it against the mock.shop, since we have real IDs, and show the results to the user!

You can’t assume anything, so you often will have to have a guard step once you get results.

ChatGPT vs. Stockfish

There was a lot of hubbub when someone pit ChatGPT vs. Stockfish in a game of chess. Many used it as a way to laugh at ChatGPT. This thing is crazy! It did all kinds of invalid moves! No doy! You have to assume that and build systems to tame it… a chess engine wouldn’t allow invalid moves.

Defensive

You have to be incredibly defensive. You are poking a brain with electrodes. It comes out with amazing things, but you can’t trust everything that comes back. Making remote calls to OpenAI itself is flaky, and often goes down.

Now only will you be checking for timeouts and errors in results, but you should consider a feature flag toggle. In the case of mock.shop, the tool is usable without any of the AI features. They are progressive enhancements to the product.

We can add checks to automatically turn it off if something really bad is happening with OpenAI. Marry both:

const openAIStatusRequest = fetch("https://status.openai.com/api/v2/status.json");

and check the results for the type of incident:

openAIStatus.status.indicator === "major"

It’s incredibly fun, getting creative with how you can use the power of LLMs, which are getting better and faster all the time. The black box nature can be frustrating at times, but it’s worth it.

I hope you are having some fun tinkering!

There are so many helpful libraries out there. I have been working with some friends on Polymath to make it simple to import and create the libraries, as well as query it all.

I have scissors all over my house

Why am I talking about scissors?

GenAI: Lessons working with LLMs

What did we ship? mock.shop

Generate query with AI

Foundations & Flywheels

Feed it the best input

Customize the prompt

Feedback and Flywheels

Creativity & Constraints

ChatGPT vs. Stockfish

Defensive

Archives for February 2023

Why am I talking about scissors?

What did we ship? mock.shop

Generate query with AI

Foundations & Flywheels

Feed it the best input

Customize the prompt

Feedback and Flywheels

Creativity & Constraints

ChatGPT vs. Stockfish

Defensive