Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
TileDB 2.0 and the Future of Data Science (medium.com/tiledb)
81 points by dalke on May 7, 2020 | hide | past | favorite | 33 comments


I don't know if the creators of TileDB visit HN, but if they do - it would be nice to see a code example that showcases what makes TileDB unique. Right now, I don't see any benefit versus serializing a data frame in hdf5 and storing it in Redis or cloud blob storage.

(Reading https://tiledb.com/developer, the Github, or the blog - none of it shows me anything amazing)

Edit: browsing around the site, I finally found out what tileDB offers, and I'm still confused (too much fluff language) To save people time:

https://docs.tiledb.com/main/handling-dataframes

Multi-column slicing: By defining any subset of the columns as dimensions and due to TileDB's tiling flexibility, you can increase the pruning effectiveness of multi-column slicing, thus leading to better overall read performance. Essentially, a TileDB array acts as a primary multi-dimensional index on the columns selected as the array dimensions.

Data updates and versioning: TileDB offers rapid, parallel, cloud-optimized updates. All the update logic is pushed into the storage engine and is completely transparent to the user. TileDB also exposes useful time traveling functionality, such as reading arrays at time snapshots, effectively implementing data versioning built into a single embeddable library.

Partitioning: TileDB enables balanced partitioning, without limiting each partition to single column values. Moreover, we will soon expose API functions for dynamically selecting different partitioning schemes (e.g., on different subsets of columns with different orders), without the need for reorganizing/rewriting the array.

Sorting: All sorting is taken care of by TileDB internally with multi-threading.


Stavros from TileDB here. TL;DR, it's all about fast slicing on multiple columns while supporting updates, locally or in the cloud.

Suppose you serialize your dataframe in HDF5 or Redis. Let your dataframe have schema (Date, Stock, Price). Assume this dataframe is 1 TB long and stored on S3, GCS or Azure (as they are cheap). How would you be able to efficiently perform an average query on Price for a specific Date range and Stock symbol? With HDF5 you would have to download 1 TB (no notion of "fast slicing on variable predicates") and apply the predicates locally. If you stored the dataframe in Parquet (a better choice for this use case), then you would be able to build some logic in your code that uses the Parquet metadata/indexes and prune a lot of unnecessary information (as Spark does). However, Parquet is "one-dimensional", i.e., your pruning would be efficient on Date, but not on Stock (you'd have to "partition" your Parquet files with Spark or Hive and things could get quite complicated). Most importantly, you wouldn't be able to update the Parquet files; you would have to generate new files and build a catalog on top (or use services like Delta Lake) to manage your Parquet files. And this is an extremely cumbersome task.

TileDB abstracts everything for you, while allowing you to slice fast on any number of columns. You just define Date and Stock as "dimensions", and slicing on both those columns becomes uber efficient locally or in the cloud. Effectively, you turn this dataframe into a sparse 2D array. Updates and time traveling are handled by TileDB. You get to use Spark, Dask, MariaDB and PrestoDB as you did before, but there is no need for Hive, Delta Lake or any other cataloging service. Thank you for pointing out the confusion though. We just launched and we have tons of examples coming up.


Stavros thanks for the explanation, how does TileDB avoid downloading the entire matrix and do the slicing (locally)? Are we achieving this by breaking down a big matrix to a set of smaller ones? so that you only down the subset of that the current query need? If this is the case, what is the current measure we have to avoid mismatch on metadata (e.g. some error while uploading them to S3) that links them together? thanks


Efficient slicing happens because of "tiling", hence the name TileDB. A tile is similar to an HDF5 or Zarr "chunk", or more loosely to a Parquet page. Although totally configurable, tiling is handled solely by TileDB, the user doesn't need to know about it. A tile is the atomic unit of IO and compression. TileDB maintains all the necessary metadata and indexing built into its format and, given a query, it knows how to fetch only the tiles that might include results. The tiles are decompressed in your memory and filtered further for the actual results. The dense array case is rather straightforward. The sparse case is a big differentiator in TileDB and it is quite challenging, especially in the presence of updates. TileDB handles the sparse case via bulk-loaded R-trees for multi-dimensional indexing, and via an LSM-tree-like approach with immutable objects that allows time traveling.

Concerning your point on potential errors occurring on S3, this is addressed by TileDB's immutable object approach. If an error occurs upon some write, there will be no array corruption. Happy to discuss about this topic on a separate thread.

Some related docs:

https://docs.tiledb.com/main/performance-tips/choosing-tilin...

https://docs.tiledb.com/main/basic-concepts/tile-filters#til...

https://docs.tiledb.com/main/basic-concepts/definitions/frag...


I heard about TileDB because one of the authors commented at https://news.ycombinator.com/item?id=22423632 . That discussion includes what you mentioned, plus a mention that it integrates with MariaDB, PrestoDB and Spark, including the ability to use an embedded MariaDB.


Thanks for the post!


Here is a quick demo of dataframe i/o: reading to/from a pandas dataframe, and then creating and slicing a sparse array with heterogeneous dimensions (int and string):

https://gist.github.com/ihnorton/1544bb169b4ec8cd385e9c3e756...


This demo doesn't run properly. Eg in the 2nd code block I got: "AttributeError: module 'tiledb' has no attribute 'from_csv'". I received this message in the latest TileDB Docker image.


There seems to be a problem running the example under python 3.6 used in the docker container. Investigating. Please ping us (tiledb.com for contacts) and I will let you know when the docker image is updated.


Fixed in the latest image update (tiledb/tiledb:2.0.1). Thanks for the report!


So, basically, tiledb is a file format like paraquet, avro and orc are formats?

And it comes with a nice library that makes accessing the data fairly storage-transparent?


Disclosure: I am a member of the TileDB team.

Yes you are correct, TileDB introduces a storage format, but is also a full-fledged storage engine that implements it. Having the complete storage engine is what allows us to make accessing the data transparent regardless of the API or application you use to access the data. We are able to integrate into different computational tools without having to reimplement or sacrifice features.


How do you compare to Iceberg?


Disclosure: I am a member of the TileDB team.

Iceberg, similar to Delta Lake, is another layer you have to add on top of Parquet, which limits your mobility of compute. You can currently use it either with Spark or Presto, but not directly via language APIs or other compute engines (e.g., pandas, Dask, MariaDB, etc). For example, the pandas "read_parquet" will not be able to read your dataframe, you will have to use pyspark or presto + sql query to fetch the data into pandas. TileDB has updates and data versioning built into its format and storage engine, therefore reading an array directly via a TileDB language API, or Spark, Dask, PrestoDB or MariaDB will have the same behavior. You will be able to see the dataframe after all the changes, or time travel and "activate" only some of the changes in time.


This isn't strictly true.

You can't arbitrarily reach into the data from the client with Iceberg or Delta Lake, true, but that's intentional. The service on top provides a lot of the functionality TileDB provides. There's nothing preventing you from writing a pandas integration, for example.

Notably, the most significant difference I can see is that, as far as I can tell, TileDB will not solve for failures due to S3 consistency, which are solved by Iceberg and Delta Lake. It's strictly necessary to have a central, ACID-ish place to record transactions, as probing S3 is not reliable.

At the very least, you are definitely suspectable to stale reads, which means users will see all sorts of bizarre failures at scale. I'd be pretty terrified of putting anything serious on TileDB.


Disclosure: I am a member of the TileDB team.

Several important concepts are being conflated here, so I'll elaborate on each separately.

S3 eventual consistency: TileDB is fully aware and designed around the eventual consistency guarantees of cloud object stores [1]. When an array is opened, only the committed writes (up to the specified timestamp if using time traveling) are seen by the reader (each write produces a timestamped "fragment", which is essentially a folder). There will never be partial reads, or corrupt reads. The array is always in a readable state with committed data. This works very similar to Iceberg's method of opening a table at a snapshot for reads [2]. I don't believe that TileDB is any more affected by stall reads than Iceberg (or Delta lake), the user must reopen or re-query a table to see data from a newer timestamp/snapshot.

Handling S3 write failures: TileDB is designed for a lock-free, multi-writer scenario. All writes produce a new timestamped fragment, which is immutable after completion of the write. TileDB performs an atomic write of an "ok" object, which signals when the fragment is complete. Any fragment which is missing the "ok" file, is ignored by the reader [3]. TileDB handles corrupt or incomplete fragments by erroring out on the read. In the future, we could offer a retry mechanism for failed reads, but the important thing is we will never return corrupt or invalid results. Incomplete fragments can happen because of S3's eventual consistency; it is possible for the atomic ok file to show up before an object in the fragment "folder".

Write serializability: TileDB's fragments are written at a (timestamp + uuid) fashion. In the event of a conflicting write at the same timestamp, the uuid provides uniqueness and guarantees that there are no errors by essentially randomly ordering the conflicting fragments. The effect in the end is similar to Iceberg's cancel and retry conflicting writes, except TileDB does not have the penalty of retrying. For Iceberg, if there are two simultaneous conflicting writes, it is effectively random which one would be accepted and which one would be retried.

ACID: TileDB does not support ACID intentionally, as it was not designed to be a transactional database. TileDB was designed with a lock-free multi-writer/multi-reader model, as our use cases up until now involved a one-off massive parallel write, and then multiple concurrent reads. Lack of transactions does not yield inconsistent data though, as stated above the read/write algorithms are specifically designed with this in mind. That said, we have recognized that there are workloads where transactions are important, and we have plans to eventually add transactions to our cloud product where we have the orchestration layer needed to manage them. It is important to keep the transactional layer modular and format-agnostic (Delta Lake is blending it with Parquet) and, therefore, we will build it on top of TileDB, not inside the storage engine.

Interoperability: Obviously nothing stops someone from writing a pandas integration for Iceberg or Delta Lake. An important part of our philosophy at TileDB is extreme integration with existing tools and frameworks, which is why we strive to support things like returning numpy arrays for the Python results and doing zero-copying everywhere possible. In other words, it is not trivial to add a tool integration, just the same as it is not trivial to just add an ACID layer.

[1] https://docs.tiledb.com/main/basic-concepts/consistency

[2] https://iceberg.apache.org/reliability/

[3] https://docs.tiledb.com/main/basic-concepts/physical-storage


Thanks for the detailed response.

This is what I was getting at, mostly:

> Any fragment which is missing the "ok" file, is ignored by the reader [3].

As a client, how do I detect this? Say I just wrote some data and now I want to generate a new total. Could my data be silently missing?

> TileDB handles corrupt or incomplete fragments by erroring out on the read. In the future, we could offer a retry mechanism for failed reads, but the important thing is we will never return corrupt or invalid results.

But now I have to handle this everywhere I interact with TileDB. Most users don't expect essentially random errors. Retries are a stop-gap, as it can take arbitrary time for consistency to converge. I've observed O(hours) regularly.

> Incomplete fragments can happen because of S3's eventual consistency; it is possible for the atomic ok file to show up before an object in the fragment "folder".

Indeed! And from what I understand, mostly from conversation with the Iceberg team last year, they solve this entirely by using a central ACID service backed by RDBMS (which, aside, I find very amusing!! They never listen to Stonebraker...).

Indeed, as far as I can tell, the only way to remove consistency issues is to: - store an index to objects somewhere central - never list the bucket (or speculatively stat, etc.)

Until either a) you support a central index or b) S3 has a better consistency model (not that DynamoDB garbage Hadoop uses) I would be very reluctant to use TileDB if it wasn't using my own disks.

I think for small stuff, it'll work great, but at scale you're going to hit some roadblocks. I recently encountered splitting a bucket into to many buckets and prefixing UUIDs at the root of the bucket to reduce consistency problems caused by S3's rebalancing, and it ended up being a very expensive marginal improvement.


Folks, apologies, but I think we got a bit side tracked here, TileDB does not suffer from the consistency issues mentioned above.

Here is how TileDB performs a new (potentially concurrent with other reads and writes) write:

- It creates a fragment folder (or "prefix" of a set of objects on S3 - there are no "folders" on S3) which is timestamped and carries a unique UUID. This fragment is self-contained and represents the entire write (e.g., all cells and all attribute values)

- It writes all data objects under the fragment prefix. Note that TileDB never updates, it always writes new immutable objects.

- After all the PUT requests succeed for the data objects, it creates an empty "ok" object.

Here is how TileDB performs a (potentially concurrent with other reads and writes) read:

- It lists the array prefix to get the ok objects

- There are two cases:

1. The ok object is not there for some fragment. That fragment is completely ignored.

2. The ok object is there. Since TileDB writes the ok object last, all the data objects it wrote have been committed and are all visible with GET requests. TileDB reads the data objects only with GET requests (not ListObject requests). Due to S3’s read-after-write consistency model (https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction...), all those objects will be available for reading (now on all S3 regions) with GET and there will be no errors.

Therefore, TileDB follows the eventual consistency model of S3 without any errors and surprises. The user doesn't need to handle anything. Our customers have been using TileDB in production for a long time, storing hundreds of TBs of data on S3, and no consistency issue has ever come up.

Summarizing, what xyzzy_plugh is raising here is that TileDB does not have ACID guarantees. And that is true (we never claimed the contrary) and intentional. We are building a transactional layer outside of the storage engine. The reason is that this transactional layer indeed needs to be a constantly running distributed service, whereas we want the TileDB storage engine to be embeddable and used without performance regression even by applications that do not need ACID (that is, the majority of our data science applications).


How does tiledb deal with appends and upserts etc in s3? Is there success files, atomic folder renaming and things? Or?



This seems similar to the direction Apache Arrow is going in, but I don't see the use case for this once Arrow dataframes are prevalent.


Disclosure: I am a member of the TileDB team.

TileDB is designed primarily for persistent storage, similar to HDF5, Zarr and Parquet. TileDB is a cloud-optimized, dense and sparse multi-dimensional array storage engine, which is broader than Arrow's Dataframe. In addition TileDB also handles updates, time traveling and partitioning at the library level, which are not possible with the Arrow project's current choice of persistent format, Parquet. TileDB removes the need for using extra services like Delta Lake for updates or Hive for cataloging and partitioning, as you get all this functionality in a single, embeddable library.


I feel like you guys need to create some additional engineering-level documentation to convey all this. The landing page and few links I navigated to were all too fluffy.


For things that start as academic work, often the original paper is quite helpful. I find it an enjoyable read if you have a background in databases (it actually was a reading for my grad systems class)

https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17...

The basic jist is: TileDB is substantially faster than alternatives for random writes/reads - this allows for more sophisticated parallel linear algebra algorithms as well as evaluation of predicates.


One use case is that is running within postgres, so you dont need to install another (complex?) software and backup it.


After reading it, it's not clear it is the future of data science.


The highlighted new features for version 2.0 are: 1) Support for heterogeneous dimensions in sparse arrays (that is, full dataframes), 2) Support for string dimensions in sparse arrays, 3) Google Cloud Storage and Azure Blob Storage support, and 4) Completely revamped R API.

They write: "These features mark an important step in our journey to deliver a universal data storage engine for all applications and data science tools."


i agree, the title and article is mostly hyperbole


Stavros form TileDB here. Here is a more verbose explanation. Up until before 2.0, TileDB was already powerful for the main applications we targeted at: geospatial and genomics. The support for both dense and sparse arrays and the way it handles data versioning made it quite unique vs. HDF5 and Zarr. But we noticed that most of the data scientists we were working with had a lot of data beyond genomic variants, LiDAR points and rasters. They had tons of dataframes. And they were using at least two storage engines, TileDB for arrays, and Parquet or a relational database for dataframes. If you are in a large organization, this a big pain.

In TileDB 2.0 we made a huge refactoring to support something seemingly simple: dimensions in sparse arrays that can have different types and that could even be strings. This allowed us to model any dataframe as a sparse array, effectively making TileDB act as a primary multi-dimensional index. In relational databases, this means that your data is sorted in an order on disk that favors your multi-column slicing enormously, so range search becomes rapid.

Therefore, what we are telling the community with this release is that you can have dense arrays, sparse arrays, and dataframes in a single embeddable library being integrated with pretty much every data science tool out there, so that data scientists never have to worry about backends, files, updates, or anything other than their scientific analysis. In other words, we believe the future of data science is more science.


Show us code usage examples please!


Full developer docs here: https://docs.tiledb.com/main/

Specific dataframe examples coming up shortly.


Looking at the TileDB docs, I like the idea that you can store and manipulate array efficiently. It works like a columnar storage, but organized in tiles. I am not clear how it could scale to multiple nodes and how to do queries. Maybe it should be integrated into postgres.


Disclosure: I am a member of the TileDB team.

TileDB fully supports multi-writer/multi-reader, so scaling compute to multiple nodes is mostly about choosing the computation layer. We have integrations with Spark, Dask and PrestoDB for distributed compute. In our cloud product we also offer serverless UDFs and SQL (via MariaDB) to allow for scaling out computations elastically without managing your own clusters.

On the postgres side, we are looking at eventually adding a postgres storage engine, now that the new storage support is in PostgresSQL 11. To start with we have built a storage engine for MariaDB, as MariaDB has an excellent storage engine API, which I have past experience with. We hope to upstream the storage engine to MariaDB after the MariaDB 10.5 release.

MariaDB integration: https://docs.tiledb.com/mariadb/

Spark integration: https://docs.tiledb.com/spark/

Dask integration: https://docs.tiledb.com/dask/

PrestoDB integration: https://docs.tiledb.com/prestodb/

Cloud Docs: https://docs.tiledb.com/cloud/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: