titan: a package manager for snowflake
If you regularly work with cloud data warehouses, this line may sound familiar:
In the days before package managers, it was commonplace for JavaScript engineers to rely on a small number of dependencies stored directly in their projects or served by a CDN. The first major JavaScript package manager, npm, was built shortly after Node.js was introduced, and it quickly became one of the most popular package managers in the world. Thousands of new open source projects were created and engineers shared more code than ever before.
Source: Yarn: A new package manager for JavaScript
In many ways, modern data engineering reminds me early frontend development. And in that spirit, I thought it was about time to do the one thing programmers are never supposed to do: create a package manager.
I wrote titan: a proof-of-concept package manager for Snowflake. With titan, you can install a package of useful functions, procedures, views, and tables that perform data cleaning, complex analysis, cost management, or perform novel materializations. Today, titan supports a bare-bones api of init
, install
, and uninstall
.
What makes titan different than a SQL orchestrator like dbt or schema management tools like schemachange or terraform:
- titan is declarative, making state migrations simpler
- titan is pure SQL, unlike dbt, which relies heavily on jinja macros
- titan supports dependencies between packages
Most importantly, titan exposes its api directly in Snowflake itself. Once you've performed initial set up, you can manage packages directly from SQL, without needing the command line.
I wrote titan for the HYTRADBOI jam week. HYTRADBOI β have you tried rubbing a database on it? β is an independent conference series put on Jaime Brandon with the theme "Turning a data-centric lens onto familiar problems to yield strange new solutions (and maybe exciting new problems)." I chose to solve a problem that's bothered me for years: the lack of open source code sharing in the data SQL ecosystem.
You can find titan on GitHub.
In the rest of this post, I want to discuss the experimental features I built that set titan apart.
State management
Up front, I knew titan would need to understand the current state of the warehouse before it made changes. At a minimum, it would need to know what packages were already installed. How would titan store state?
The obvious choice was tables, they're meant for state. But I wasn't happy with the prospect of mapping complex environment state to tabular format. So instead, I built something new. titan stores state entirely within user-defined functions (UDFs).
State is initialized with a single query:
CREATE FUNCTION titan_state()
RETURNS OBJECT
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT {}
$$
To update state, titan uses a stored procedure that overwrites the titan_state
function with a new one.
CREATE PROCEDURE titan_upstate(key VARCHAR, value VARIANT)
RETURNS OBJECT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
HANDLER = 'upstate'
PACKAGES = ('snowflake-snowpark-python')
EXECUTE AS OWNER
AS
import json
def upstate(session, key, value):
new_state = json.loads(
session.sql(f"SELECT titan_state()").collect()[0][0]
)
new_state[key] = value
new_state = json.dumps(new_state)
session.sql(f"""
CREATE OR REPLACE FUNCTION TITAN_STATE()
RETURNS OBJECT
LANGUAGE SQL
IMMUTABLE
AS
'SELECT TO_OBJECT(PARSE_JSON(''{new_state}''))'
""").collect()
return {"success": True, "message": f"State updated"}
titan implements a collection of stored procedures for managing state entirely within a UDF. This allows simpler composition when using state, like this query that asserts if a set of packages is installed, and fails if not.
SELECT
titan_assert(
['arraytools', 'literals'] =
array_intersection(
titan_state():packages,
['arraytools', 'literals']
)
)
Package Versioning & Linking
In titan, you can install multiple versions of the same package in a single schema. If I install arraytools
, each function is created with the package-version prefix arraytools__0_2_0__
. But application code shouldn't have to use these prefixes. To solve that, titan implements linking. Taking advantage of Snowflake's function inlining optimization for SQL UDFs, this easy-to-use wrapper function helps keep application code clean while also supporting safe version upgrades.
CREATE FUNCTION array_from(...)
LANGUAGE SQL
AS $$
SELECT arraytools__0_2_0__array_from(...)
$$
Dependency Pinning
Package managers commonly need solve dependency issues between packages. When two packages rely on disjoint versions of the same shared dependency, the package manager has to decide how to resolve this conflict. In titan, I've chosen to pin dependencies for a package at install time.
Packages can call other packages with the TITAN_IMPORT
stored procedure. At install time, package references are dynamically rewritten to . To achieve this, I'm using the incredible sqlglot library. sqlglot allows me to parse package code and walk the abstract syntax tree of each CREATE FUNCTION statement, replacing any call to a dependency with the versioned identifier.
This parsing approach also allows titan to ensure that a package only touches its own code, or that of dependencies, and nothing else. This additional layer of security is not supported by any existing SQL management system.
-- literals/from_csv.sql
CALL TITAN_IMPORT(['arraytools']);
^^^^^^^^^^
CREATE FUNCTION from_csv(data VARCHAR)
AS $$
WITH count_cols AS (...)
SELECT _row
FROM count_cols,
LATERAL from_csv(arraytools.array_from('any', col_count), DATA)
^^^^^^^^^^
$$
;
3 APIs: Python, Command Line, and SQL
I want it to be possible to manage titan completely from within Snowflake. Today, it's only 70% of the way there.
During initialization, titan installs a set of management commands, implemented as UDFs and stored procedures. These management commands comprise most of titan's api.
I use sqlglot to parse the function signature for each SQL-based management command. I then use python-forge to dynamically generate python methods for each, exposing the same api between the titan python package and the initialized titan environment in Snowflake.
This is what it looks like:
# Uninstall from python
from titan import connect
from titan.environment import Environment
env = Environment.get(db, schema)
with connect.env_cursor(env) as cur:
cur.titan_uninstall('arraytools')
# Uninstall from cli
python -m titan uninstall arraytools -d db -s schema
-- Uninstall from sql
USE DATABASE db;
USE SCHEMA schema;
CALL TITAN_UNINSTALL('arraytools')
That's titan, a proof-of-concept package manager for Snowflake, built for HYTRADBOI: The Jam. I have a few example packages built so far, with more to come later in the year.
I think there's an important, unexplored space between traditional data engineering environments and the emerging SQL-first environments built on top of cloud data warehouses. titan is a venture into that unknown.
Acknowledgements
- Sam Boyer, for the indispensable article So you want to write a package manager.
- Jaime Brandon, for putting on HYTRADBOI.
If you enjoyed this post, please subscribe or follow me on Twitter (@teej_m).