Hacker Newsnew | past | comments | ask | show | jobs | submit | turk-'s commentslogin

Story for those who didn't see it: https://intuitiveexplanations.com/tech/replit/


That's weird. I would never do anything even remotely similar to what my (ex) employer does. CEO sounds like a douchebag tho.


I've seen terms/clauses here in AU for full time employment, depending on the industry/niche, where you can't jump to the same industry within X months.


That's what happens when you have a society worried about money and not interested in true human development.


It's not clear what the value add of this is vs OSS DIY Spark + Iceberg?

Free option is already available through OSS Spark/super low cost EMR. What's your value add over those?


Apache Nifi and Accumulo do come to mind, both out of NSA.


And SELinux, still from the NSA.


Buy the dip!


Yes. And there is a small possibility that something even worse happens for those previously infected/vaccinated called antibody-dependent enhancement (ADE). ADE occurs when the antibodies generated during an immune response recognize and bind to a pathogen, but they are unable to prevent infection. Instead, these antibodies act as a “Trojan horse,” allowing the pathogen to get into cells and exacerbate the immune response.

The worst case with a new covid strain would be if people who were vaccinated by a previous vaccine or infected by an older strain experience antibody-dependent enhancement after being infected with the new strain. This is where the body recognized the new strain as the old and starts producing anti-bodies. These anti-bodies actually assist the new strain in infecting your cells, making the disease worse.

ADE has not been detected with any covid strains/vaccines so its not something to worry about for now but who knows what may happen in the future. I've been keeping my eye out for any news of ADE with any of these new strains.

Certain viruses like dengue fever can be much worse if you had previously caught a different strain due to ADE.

https://www.chop.edu/centers-programs/vaccine-education-cent...


Since the current vaccines are based on recognizing the spike protein, and since this South African (Nu?) variant has many mutations on the spike protein, what are the odds that a vaccinated person's immune system would recognize the mutated spike protein as being the same as the original one?


No idea. Time will tell but a high number of mutations is not ideal.


With a datawarehouse, you can only interface with your data in SQL. With big query and snowflake, your data is locked away in a proprietary format not accessible by other compute platforms. You need to export/copy your data to a different system to train an ML model in python or R.

With the lakehouse, you can use python, R and Scala, (not just SQL) to interface with your data. You can use multiple compute engines (spark, Databricks, presto) so you are not locked into one compute engine.

I recall being a junior programmer, and wishing I could talk to my MySQL database in python code to do some processing that was difficult to express in SQL, that day is finally here.


BigQuery does support ML. But the pricing is kind of a racket ($250/TB) so I’ll stick to modeling in R/python. Which I guess reinforces your point. I wonder who pays for this.

https://cloud.google.com/bigquery-ml/docs/introduction


My experience is that's how it looks at first. But it is hard to actually make use of lake or lakehouse openness.

You can access data in Snowflake or BigQuery using JDBC or Python clients. You do pay for the compute that reads the data for you. You cannot access the data in storage directly.

You can access data in lakehouse directly, by going to cloud storage. That has two major challenges:

Lakehouse formats aren't easy to deal with. You need a smart engine (like Spark) to do that. But those engines are pretty heavy. Staring a Spark cluster to update 100 records in a table is wasteful.

The bigger challenge is security. Cloud storage can't give you granular access control. It only sees files, not tables and columns. So if you have a need for column or row-based security or data masking, you're out of luck. Cloud storage also makes it hard to assign even the non-granular access. Not sure about other clouds, but AWS IAM roles are hard to manage and don't scale for large number of users/groups.

You can sidestep this by using a long-running engine (like Trino) and applying security there. Then you don't need to start Spark to change or query a few records. But it means you're basically implementing your own cloud warehouse.

Which honestly can be the way if that's what you want! You can also use multiple engines if you are ok with implementing security multiple times. To me, that doesn't seem to be worth it.

In the end, I don't see data that's one SELECT away as much more proprietary and "outsourced" than data that is one Spark/Trino cluster and then SELECT away, just because you can read the S3 is sits on.


Have you ever tried to train models on large data sets over JDBC/ODBC? it’s terrible even with parallelism. Having direct access to the underlying storage and being able to bypass sucking a lot of data over a small straw is a game changer. That is one advantage that Spark and Databricks have over Snowflake.


Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

Sadly, those things are mutually exclusive at the moment and with the way things are deployed here (large multi-tenant platforms), the security has to take priority.

But if that's not your situation, then obviously it makes sense to make use of that!


> Have you tried to implement row- and column-based security on direct access to cloud storage? It flat out does not work.

It is a solved problem. Essentially you need a central place ( with decentralized ownership for the datamesh fans ) to specify the ACLS ( row-based, column-based, attribute-based etc.) - and an enforcement layer that understands these ACLs. There are many solutions, including the ones from Databricks. Data discovery, lineage, data quality etc., go hand in glove.

Security is front and centre for almost all organizations now.


This is exactly what FAANGs do with their data platforms. There are literally hundreds of groups within these companies with very strict data isolation requirements between them. Pretty sure something like that is either already possible or will be very soon, there's just too much prior art here.


Thats where Databricks comes in though, you can implement row/column based security on your data on cloud object storage and use it for all your downstream use cases (Not just BI/SQL but AI/ML without piping data over JDBC/ODBC).


According to their documentation [1], Databricks does not have this capability even for their own engines, and definitely not for "without piping data".

This is what I've personally seen few times - Databricks claiming they can do something and then it turns out they can't. Buyer beware lying salespeople and HN shills.

[1]: https://docs.databricks.com/administration-guide/access-cont...


Check out https://databricks.com/product/unity-catalog when you get a chance. There are other solutions in this space as well.


I don’t understand what capability you are saying Databricks lacks. This capability is literally the entire premise of the Data Lakehouse. With Snowflake you need to export data out/or pipe data over jdbc/odbc to an external tool. With Databricks you can use SQL for data warehousing and when you need you can work with that same data using python to train an ML model without piping data out over jdbc (using the spark engine). One security model, one dataset, multiple use cases (AI/ML/BI/SQL) on one platform.


They're still lacking things in the SQL space. For example, Databricks say they're ACID compliant, but it's only on a single-table basis. Snowflake offers multi-table ACID consistency, which is something that you would expect by default in the data warehousing world. If I'm loading, say, 10 tables in parallel, I want to be able to roll-back or commit the complete set of transactions in order to maintain data consistency. I'm sure you could work around this limitation, but it would feel like a hack, especially if you're coming from a traditional DWH world (Teradata, Netezza etc.).

Snowflake now offers Scala, Java and Python support, so it would seem their capabilities are converging even more, but both with their own strengths due to their respective histories.


Actually, you would expect that in an OLTP world. DW's for the longest time, even Oracle, recommends you disable txn to get better performance. The logic is implemented in the ETL layer. Very rarely do you need multi-table txn in large scale DW.

Snowpark is still inferior.


I have not, but I do not see why it would be much slower than direct access to the storage. Databases are quite good at streaming rows.


> I do not see why it would be much slower than direct access to the storage.

Implementations of protocols like ODBC/JDBC generally implement their custom on-wire binary protocols that must be marshalled to/from the lib - and the performance would vary a lot from one implementation to another. We are seeing a lot of improvements in this space though, especially with the adoption of Arrow.

There is also the question of computing for ML. Data scientists today use several tools/frameworks ranging from scikit-learn/XGBoost to PyTorch/Keras/TensorFlow - to name a few. Enabling data scientists to use these frameworks against near-realtime data without worrying about provisioning infrastructure or managing dependencies or adding an additional export-to-cloud-storage hop is a game changer IMO.


> There is also the question of computing for ML.

Few reasons why Databricks platform shines here.

1) Not limited by just udfs - Extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner.

2.) End to end MLOps solution - including Feature store, Model registry & Model Serving

3.) Open approach with https://www.mlflow.org/

4.) Glass box (not blackbox) model for AutoML


Here is the thing with the lakehouse though, you have flexibility and don’t need to use multiple engines to achieve the lakehouse vision. Databricks has all the security features a redshift / snowflake does so you can secure databases and tables rather than s3 buckets. It does get more complex if you want to introduce multiple engines but at least you have the option to make that trade off if you want to.

If you want simplicity, you can limit your engine to Databricks. You can also use JDBC/ODBC with Databricks if you want to use other tools that don’t support the delta format/parquet but piping data over JDBC/ODBC doesn’t scale with any tool to large datasets. Databricks has all the capabilities of big query/snowflake/redshift but none of those tools support python/r/scala. Their engines need to be rewritten from the ground up in order to do so.


But you do still have to secure the S3 buckets, right? And I guess also secure the infrastructure you have to deploy in order to run Databricks. Plus then configure for cross-AZ failover etc. So you get flexibility, but I would think at the cost of much more human labor to get it up and running.

Snowflake uses the Arrow data format with their drivers, so is plenty fast enough when retrieving data in general. But it would be way less efficient if a data scientist just does a SELECT * to bring everything back from a table to load into a notebook.

Snowflake has had Scala support since earlier in the year, along with Java UDFs, and also just announced Python support - not a Python connector, but executing Python code directly on the Snowflake platform. Not GA yet though.


You can use Scala, Java and Python with Snowflake now, as well as process structured, semi-structured and unstructured data. So I guess that means it doesn't fit into the data warehouse category, but is not a lakehouse either.


Interesting. I've always wondered why tomatoes in the US taste like shit. Why is there a trade off between high yield and taste? Is there not a high yield + great taste variety?


I think its largely about shelf life and bruising?

there is this kind of tomato I buy if there aren't any others. it always looks perfect on the outside...and seems to last forever sitting on the counter at home.

but if its old enough, the seeds will have sprouted when you cut into it.

pretty tasteless regardless


This comment doesn't make any sense. I don't see how Cloudflare publishing the source code to their own hosted s3 service would help prevent lockin when an open source alternative to s3 is out there with hdfs. While s3 is a proprietary system, Any programs you write to operate against s3 can also easily be migrated to other object stores (Azure ADLS, Google Object store) with relative ease.

The thing that keeps people locked into s3 are egress/bandwidth cost. Until Cloudflare came along, no hosted object store (Google,Azure, including self hosted HDFS onprem or in the cloud) had economical bandwidth/egress costs.


No it wouldn't, no executive at any company would risk federal time and money laundering charges if it was made illegal.


That's what throw-away shell companies are for...


Ah yeah HSBC Bank would never...


That's what Michael Cohen is for.


What you're describing entrepreneurship. I'd imagine pretty much every entrepreneur realizes they have an extremely low probability of success, but some people prefer the thrill of the hunt to the feeling of being a cog in the wheel job at large corporation.


This is not the case. It's hard to overstate the magnitude of the increase in risk you take by pursuing a VC-backablr business strategy vs a bootstrappable one. There are many, many totally good business ideas that will "only" make you tens of millions of dollars and that can be executed relatively casually.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: