Oracle Tunes Java's Internal String Representation

throwawaykf03 · on Dec 25, 2013

Reddit comment from the author of this change discussing the reasons behind it:

http://www.reddit.com/r/programming/comments/1qw73v/til_orac...

justinsb · on Dec 25, 2013

Here is the patch (I believe): http://hg.openjdk.java.net/jdk8/tl-gate/jdk/rev/2c773daa825d

I was quite surprised that we _always_ copy the array. I would have guessed that .substring(1) (all but the first character) is very common, and that it would be a win to share the array in some circumstances. A straw-man heuristic would be "share the array if we're using at least half of it".

But they didn't do that. Anyone know if there's a public discussion?

swinejihad · on Dec 25, 2013

The way I see it the big win for memory is the fact that they've shaved off two ints from every String instance. Most Strings are small and probably never have .substring called on them, so I think memory use might even be lighter by always copying but using a smaller String instance in most typical cases.

hnriot · on Dec 25, 2013

Does it matter with copy-on-write?

dietrichepp · on Dec 25, 2013

What?

Java's strings are immutable, so copy-on-write is nonsense.

Performance tuning for mutable string types has been moving away from the copy-on-write paradigm (it used to be common in C++, but not any more). It's a big mess because the design of the C++ string class was truly botched, it's much less of a mess in other languages where you have different types for mutable and immutable strings (Java, C#, Python, etc.)

TheLoneWolfling · on Dec 25, 2013

So, I have to wonder. If I as a random nobody have been affected adversely by this change (Parser that now takes >10x the time), what business applications have been affected by this?

And I have a question: why could it not be done like this:

Java currently has weak references, soft references, and phantom references. If there was an additional type of weak-like reference, that had a callback with an object before it was garbage collected, this conundrum would be simple. Have any substrings have a weak-like reference to the character array. If the parent string is garbage-collected, then do the copying like the new behavior, but until then don't: use the old behavior.

MaulingMonkey · on Dec 25, 2013

> If there was an additional type of weak-like reference, that had a callback with an object before it was garbage collected, this conundrum would be simple.

You've basically described weak references to objects with finalizers. Unfortunately, I'd wager this is even more expensive, not less.

twic · on Dec 25, 2013

No, TheLoneWolfling's idea can't be implemented with weak references with finalizers. Firstly, finalizers only run after weak references are cleared (i think - please correct me if i'm wrong!). Secondly, he needs GC to trigger code that mutates the no-longer-referring object; the finalizer runs in the no-longer-referred-to object.

He's right that a new VM-level hook would be required to implement his idea. The benefits don't clearly justify making a change of that magnitude.

MaulingMonkey · on Dec 25, 2013

Depends. I'm imagining e.g. your large parent string invasively giving it's sub-slices copies on finalization through it's own list, in which case the weak references clearing isn't much of an issue. Not easy to make fast and thread safe however.

In C#, WeakReference can take a boolean parameter during construction to specify if it tracks post-finalize or not (defaulting to the same behavior as Java which google tells me is as you say: not), and you can re-register your finalizer, but even so this kind of thing isn't what the GC is primarily designed for and it will probably show. I'm not sure if you can re-register finalizers in Java.

ajanuary · on Dec 25, 2013

Finalisers run before weak references are cleared. You do, however, have the GC cost of having a finalized.

vladimirralev · on Dec 25, 2013

If you have such callback that means your code can block a GC thread, which then affects the GC pacing params. I think it will be quite difficult to design such feature in a robust manner. You will definitely need some new internal flags on objects that are pending the callback and possibly extra locks around them. New threadpool just for that callback, blocking timeouts, lock coordination with all GC algorithms.

rafekett · on Dec 25, 2013

Your idea would make it easy to add an unacceptable amount of overhead to GC. Since GC is already the biggest performance killer on the JVM, I don't think it makes sense to give developers this particular loaded gun to shoot themselves in the foot with.

kenshiro_o · on Dec 25, 2013

GC is probably the biggest performance killer of most if not all managed languages (i.e. where you can't explicitly free memory).

And Java probably has the best GC algos out there.

PaulHoule · on Dec 25, 2013

I dunno, can you build a special kind of CharSequence which works like the old String did?

twic · on Dec 25, 2013

Speaking of CharSequence, one sneaky and bizarre option would be to make String's implementation of CharSequence::subSequence work the old way. subSequence returns a CharSequence, not a String, so it is free to return some specialised object which specifically represents a substring rather than a first-class string. Rather like slices in go, which would win Java some trendiness points.

This would be a surprising bit of behaviour, which means it's probably a bad idea. However, it wouldn't break any current code, because it preserves current behaviour, and hopefully wouldn't break future code, because people could learn about the quirk. Also, i think subSequence tends to be used to create temporary objects as computational intermediaries, rather than new long-lived objects which escape to the heap, so it shouldn't lead to excessive packratting in the way the old-style substring did.

twic · on Dec 26, 2013

Okay, so of course they considered this, and decided it was a bad move:

http://www.reddit.com/r/programming/comments/1qw73v/til_orac...

kenshiro_o · on Dec 25, 2013

Back in the days where I used to work in low latency FX trading, we identified this "feature" as we noticed the memory footprint of our process just kept growing and growing but couldn't quite pin down the root cause.

A bit of profiling and then code-digging helped find this oddity. That was probably an optimization where the side effects were not carefully considered. Even the Javadoc is vague (http://docs.oracle.com/javase/6/docs/api/java/lang/String.ht...):

Returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string.

Now I see the benefit of the optimization so I am almost against removing it (we coded our own substring at the time). How about Oracle provide a static method to return a substring in 0(1) ?

I believe most Java devs don't even know about the current substring implementation and most programs are not adversely impacted by it either.

brazzy · on Dec 25, 2013

> How about Oracle provide a static method to return a substring in 0(1) ?

Not possible anymore, since they used the occasion to eliminate the offset and length fields.

exabrial · on Dec 25, 2013

I wish oracle would put back in compressed strings... where strings are represented internally as byte[] rather than char[]. Huge memory and performance advantage if you application only handles ASCII-like characters.

See discussions here: http://stackoverflow.com/questions/8833385/is-support-for-co...

voidfunc · on Dec 25, 2013

Can't you implement a custom CharSequence for this where and when you need it?

exabrial · on Dec 26, 2013

like... everywhere? :D

idunno246 · on Dec 25, 2013

We did the whole new on substring trick on a sparse csv. Turned out we had over a hundred mb of empty strings. Lots and lots of duplicates. So empty string is special cased to return a constant

mytummyhertz · on Dec 26, 2013

so does this mean .length() is also now O(n) ?

jzwinck · on Dec 26, 2013

Of course not: it would simply pass through to the underlying char[].length, which is a field.