/* * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package com.addthis.basis.chars; import com.google.common.annotations.Beta; /** * A variation on ByteBufs for Character Strings. This variation has three primary goals: * * 1. Faster serialization and deserialization. Character Strings that are only * infrequently treated as anything more than byte sequences waste a lot of CPU * and (although also sort of CPU) heap garbage. This is especially egregious for * the all too frequent case of deserializing a string, passing it around a few * threads, and then serializing it again, but it is almost as bad when the only * operations are comparisons to other Strings. * * 1 (Example). In hydra, bundles are sent from a query worker to the master with * many String values serialized as byte arrays in the UTF-8 format. It is entirely * possible for that String to be passed to the user without ever being manipulated. * That means it was deserialized and then reserialized back to the same byte array * for essentially no reason. That worst case could be resolved by lazy loading or * a special un-deserializable value, but this does not scale well for the long tail * of few, low intensity operations like comparisons to other Character Strings. * Additionally, a lazy loading implementation would be likely implemented as a wrapper * class. That would cause another layer of indirection and memory waste. This solution * is closer to 'lazy loading of chars', which actually turns out to be pretty cheap. * * 2. Reduced memory overhead. Standard java char types are 16 bits, but for the common case * of all or mostly ASCII characters, this is twice (or near that) as much memory as needed. * * 3. More flexible char[] semantics similar to the difference between byte[]s and ByteBufs. Eg. decreasing * the number of readable values is possible as a constant time operation without creating a new array. * String itself is also really, deeply, into making char[] copies. See AsciiSequence.toString() for * an example of easy it can be to accidentally make lots of array copies, and how hard it is to avoid even * when you are trying to. (in hydra, AbstractBufferingHttpBundleEncoder ran into a similar issue where it * was mistakenly creating an unnecessary copy). * * * * * * Secondary goals/ benefits: * * * * * * - Specializing in one encoding with one backing structure allows for much more efficient * encode and decode methods than those in the standard library due to abstraction limitations. * * - Gets around some of the other more egregious inefficiencies with jdk UTF-8 encoding/ decoding * like decoding pre-allocating three times as much space as needed for the ASCII only case and * then cutting down by re-allocating to the smaller char array. This implementation allows and * encourages providing hints about how much to allocate, and should be able to more easily support * correcting under-estimates (as far as I can tell, the JDK NIO coding library does support that -- * it just isn't actually used anywhere I can find. Possibly because benchmarks showed it wasn't worth * it, but it is also possible that was due to limitations we do not have here). * * - Using CharSequence here and other places gives us more options with respect to optimizing * things like sub-string semantics (shared/ unshared), and efficient streaming cache hit * detection. * * - Using ByteBufs directly makes integration with other ByteBuf based IO easy and efficient. * * This interface combines several related ones and additionally imposes the following contracts: * * - all backing data should be stored in UTF-8 format only. UTF-8 is the one * true format, and heretics will be persecuted without remorse. * * - hashCode and equals should return consistent values across implementations * for the same underlying logic character sequence. * -- for lack of other motivations, but for possibly no actual benefit, this * will be the same values that an equivilent String representation would return. * * - compareTo should perform lexicographical string comparison. * -- Note that while such comparisons are likely to be consistent with other * CharSequence implementations, we cannot actually guarantee that to be the * case because CharSequence does not require it. Accordingly, we do not derive * much benefit from declaring Comparable of type CharSequence because eg. * native Strings declare Comparable only for other Strings. * -- Also note that the UTF-8 format (which you are required to implement) * should be able to do lexicographical comparisons without converting to chars * (byte-wise comparison should suffice). * * Component reasoning * * CharSequence: to sub in for arbitrary String usages * * Appendable: Convenient for building CharSequences, and CharBufs are likely efficient at doing so * * Comparable: so that CharBuf only CharSequence environments can use sorted data structures * * ByteBufHolder: subject to change, but helpful for resource management, and exposing * the underlying data store for more efficient operations than per-char method calls. * Possible replacements for ByteBufHolder might be directly extending ByteBuf with more/ * different char methods, or simply creating a whole char based equivalent with conversions. * * Maybe add Iteratable Character, or primitive equivalent? */ @Beta public interface CharBuf extends ReadableCharBuf, Appendable { }