Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Snowflake's `Snowpark` product that they recently announced, which is to bring Spark-like APIs to Snowflake.

Having a DS background, I love what SQL-orchestration tool dbt (and peers) have enabled: data consumers to rapidly create our own safe data pipelines. There's easily a 10x productivity improvement for most of my transformation pipelines vs. when I write them in Python or PySpark.

But batch ML and SQL are not that friendly (even BigQuery ML is too limiting). I end up butchering dbt's value (simplicity and iteration speed), splitting the DAG into pieces and orchestrating them with Airflow so that I can wedge in other non-dbt parts (like feature engineering, inference, logging, detecting stale models, ...). This isn't what the future looks like.

I've tried switching to Databricks, but do not see this as the path forward for unioning the warehouse + batch ML.

Hopefully Snowpark is a step forward :)

-------------------

Separately, https://materialize.com/ is something I'm paying attention to! Being able to implement all of my SQL-based pipelines as materialized views would be immensely valuable. They recently raised capital and they could become huge.



Love that you brought up Snowflake. I've been wanting to get my hands on it to play around and learn more about a Salesforce integration with it.


And not something like Spark on EMR?


Well no, unfortunately.

Remember that "data is a team sport". Together, we try and make better decisions (in manual or automated ways). A DE can produce great data but it's only useful if it helps the DA/DS. There's a lot of friction there.

Most of that friction disappears with SQL-based orchestration tools (I mean specifically dbt here, but there are others). Suddenly the analyst can create the data they need! With minimal guidance from a DE.

That can be with Spark SQL (+ DeltaLake / Iceberg), or some warehouse. That's not the issue.

The issue is around keeping orchestration simple when you're not just doing simple stuff anymore. Keeping that DAG logical, clear, and smooth is difficult once you include non-SQL items.

This isn't solved by Spark UDFs unfortunately :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: