I was quite surprised that we _always_ copy the array. I would have guessed that .substring(1) (all but the first character) is very common, and that it would be a win to share the array in some circumstances. A straw-man heuristic would be "share the array if we're using at least half of it".
But they didn't do that. Anyone know if there's a public discussion?
The way I see it the big win for memory is the fact that they've shaved off two ints from every String instance. Most Strings are small and probably never have .substring called on them, so I think memory use might even be lighter by always copying but using a smaller String instance in most typical cases.
Java's strings are immutable, so copy-on-write is nonsense.
Performance tuning for mutable string types has been moving away from the copy-on-write paradigm (it used to be common in C++, but not any more). It's a big mess because the design of the C++ string class was truly botched, it's much less of a mess in other languages where you have different types for mutable and immutable strings (Java, C#, Python, etc.)
So, I have to wonder. If I as a random nobody have been affected adversely by this change (Parser that now takes >10x the time), what business applications have been affected by this?
And I have a question: why could it not be done like this:
Java currently has weak references, soft references, and phantom references. If there was an additional type of weak-like reference, that had a callback with an object before it was garbage collected, this conundrum would be simple. Have any substrings have a weak-like reference to the character array. If the parent string is garbage-collected, then do the copying like the new behavior, but until then don't: use the old behavior.
> If there was an additional type of weak-like reference, that had a callback with an object before it was garbage collected, this conundrum would be simple.
You've basically described weak references to objects with finalizers. Unfortunately, I'd wager this is even more expensive, not less.
No, TheLoneWolfling's idea can't be implemented with weak references with finalizers. Firstly, finalizers only run after weak references are cleared (i think - please correct me if i'm wrong!). Secondly, he needs GC to trigger code that mutates the no-longer-referring object; the finalizer runs in the no-longer-referred-to object.
He's right that a new VM-level hook would be required to implement his idea. The benefits don't clearly justify making a change of that magnitude.
Depends. I'm imagining e.g. your large parent string invasively giving it's sub-slices copies on finalization through it's own list, in which case the weak references clearing isn't much of an issue. Not easy to make fast and thread safe however.
In C#, WeakReference can take a boolean parameter during construction to specify if it tracks post-finalize or not (defaulting to the same behavior as Java which google tells me is as you say: not), and you can re-register your finalizer, but even so this kind of thing isn't what the GC is primarily designed for and it will probably show. I'm not sure if you can re-register finalizers in Java.
If you have such callback that means your code can block a GC thread, which then affects the GC pacing params. I think it will be quite difficult to design such feature in a robust manner. You will definitely need some new internal flags on objects that are pending the callback and possibly extra locks around them. New threadpool just for that callback, blocking timeouts, lock coordination with all GC algorithms.
Your idea would make it easy to add an unacceptable amount of overhead to GC. Since GC is already the biggest performance killer on the JVM, I don't think it makes sense to give developers this particular loaded gun to shoot themselves in the foot with.
Speaking of CharSequence, one sneaky and bizarre option would be to make String's implementation of CharSequence::subSequence work the old way. subSequence returns a CharSequence, not a String, so it is free to return some specialised object which specifically represents a substring rather than a first-class string. Rather like slices in go, which would win Java some trendiness points.
This would be a surprising bit of behaviour, which means it's probably a bad idea. However, it wouldn't break any current code, because it preserves current behaviour, and hopefully wouldn't break future code, because people could learn about the quirk. Also, i think subSequence tends to be used to create temporary objects as computational intermediaries, rather than new long-lived objects which escape to the heap, so it shouldn't lead to excessive packratting in the way the old-style substring did.
Back in the days where I used to work in low latency FX trading, we identified this "feature" as we noticed the memory footprint of our process just kept growing and growing but couldn't quite pin down the root cause.
Returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string.
Now I see the benefit of the optimization so I am almost against removing it (we coded our own substring at the time). How about Oracle provide a static method to return a substring in 0(1) ?
I believe most Java devs don't even know about the current substring implementation and most programs are not adversely impacted by it either.
I wish oracle would put back in compressed strings... where strings are represented internally as byte[] rather than char[]. Huge memory and performance advantage if you application only handles ASCII-like characters.
We did the whole new on substring trick on a sparse csv. Turned out we had over a hundred mb of empty strings. Lots and lots of duplicates. So empty string is special cased to return a constant
http://www.reddit.com/r/programming/comments/1qw73v/til_orac...