UTF8Test.java example

Explorer
openjdk-master
/*
 * Copyright (c) 2004, 2015, Oracle and/or its affiliates. All rights reserved.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
 *
 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
 * or visit www.oracle.com if you need additional information or have any
 * questions.
 */

/**
 *  @test
 *  @bug 5033550
 *  @summary  JDWP back end uses modified UTF-8
 *
 *  @author jjh
 *
 *  @modules jdk.jdi
 *  @run build TestScaffold VMConnection TargetListener TargetAdapter
 *  @run compile -g UTF8Test.java
 *  @run driver UTF8Test
 */

/*
  There is UTF-8 and there is modified UTF-8, which I will call M-UTF-8.
  The two differ in the representation of binary 0, and
  in some other more esoteric representations.
  See
      http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8
      http://java.sun.com/javase/6/docs/technotes/guides/jni/spec/types.html#wp16542

  All the following are observations of the treatment
  of binary 0.  In UTF-8, this represented as one byte:
      0x00

  while in modified UTF-8, it is represented as two bytes
      0xc0 0x80

  ** I haven't investigated if the other differences between UTF-8 and
     M-UTF-8 are handled in the same way.

 Here is how these our handled in our BE, JDWP, and FE:

 - Strings in .class files are M-UTF-8.

 - To get the value of a string object from the VM, our BE calls
      char * utf = JNI_FUNC_PTR(env,GetStringUTFChars)(env, string, NULL);
   which returns M-UTF-8.

- To create a string object in the VM, our BE VirtualMachine.createString() calls
      string = JNI_FUNC_PTR(env,NewStringUTF)(env, cstring);
      This function expects the string to be M-UTF-8
      BUG:  If the string came from JDWP, then it is actually UTF-8

- I haven't investigated strings in JVMTI.

- The JDWP spec says that strings are UTF-8.  The intro
  says this for all strings, and the createString command and
  the StringRefernce.value command say it explicitly.

- Our FE java writes strings to JDWP as UTF-8.

- BE function outStream_writeString uses strlen meaning
  it expects no 0 bytes, meaning that it expects M-UTF-8
  This function writes the byte length and then calls
  outStream.c::writeBytes which just writes the bytes to JDWP as is.

  BUG: If such a string came from the VM via JNI, it is actually
       M-UTF-8
  FIX:  - scan string to see if contains an M-UTF-8 char.
          if yes,
             - call String(bytes, 0, len, "UTF8")
               to get a java string.  Will this work -ie, the
               input is M-UTF-8 instead of real UTF-8
             - call some java method (NOT JNI which
               would just come back with M-UTF-8)
               on the String to get real UTF-8


- The JDWP StringReference.value command does reads a string
  from the BE out of the JDWP stream and does this to
  createe a Java String for it (see PacketStream.readString):
         String readString() {
          String ret;
          int len = readInt();

          try {
              ret = new String(pkt.data, inCursor, len, "UTF8");
          } catch(java.io.UnsupportedEncodingException e) {

  This String ctor converts _both- the M-UTF-8 0xc0 0x80
  and UTF-8 0x00  into a Java char containing 0x0000

  Does it do this for the other differences too?

Summary:
1.  JDWP says strings are UTF-8.
    We interpret this to mean standard UTF-8.

2.  JVMTI will be changed to match JNI saying that strings
    are M-UTF-8.

3.  The BE gets UTF-8 strings off JDWP and must convert them to
    M-UTF-8 before giving it to JVMTI or JNI.

4.  The BE gets M-UTF-8 strings from JNI and JVMTI and
    must convert them to UTF-8 when writing to JDWP.


 Here is how the supplementals are represented in java Strings.
 This from java.lang.Character doc:
    The Java 2 platform uses the UTF-16 representation in char arrays and
    in the String and StringBuffer classes. In this representation,
    supplementary characters are represented as a pair of char values,
    the first from the high-surrogates range, (\uD800-\uDBFF), the second
    from the low-surrogates range (\uDC00-\uDFFF).
  See utf8.txt


----

NSK Packet.java in the nsk/share/jdwp framework does this to write
a string to JDWP:
 public void addString(String value) {
        final int count = JDWP.TypeSize.INT + value.length();
        addInt(value.length());
        try {
            addBytes(value.getBytes("UTF-8"), 0, value.length());
        } catch (UnsupportedEncodingException e) {
            throw new Failure("Unsupported UTF-8 ecnoding while adding string value to JDWP packet:\n\t"
                                + e);
        }
    }
 ?? Does this get the standard UTF-8?  I would expect so.

and the readString method does this:
        for (int i = 0; i < len; i++)
            s[i] = getByte();

        try {
            return new String(s, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            throw new Failure("Unsupported UTF-8 ecnoding while extracting string value from JDWP packet:\n\t"
                                + e);
        }
Thus, this won't notice the modified UTF-8 coming in from JDWP .


*/

import com.sun.jdi.*;
import com.sun.jdi.event.*;
import com.sun.jdi.request.*;
import java.io.UnsupportedEncodingException;
import java.util.*;

    /********** target program **********/

/*
 * The debuggee has a few Strings the debugger reads via JDI
 */
class UTF8Targ {
    static String[] vals = new String[] {"xx\u0000yy",           // standard UTF-8 0
                                         "xx\ud800\udc00yy",     // first supplementary
                                         "xx\udbff\udfffyy"      // last supplementary
                                         // d800 = 1101 1000 0000 0000   dc00 = 1101 1100 0000 0000
                                         // dbff = 1101 1011 1111 1111   dfff = 1101 1111 1111 1111
    };

    static String aField;

    public static void main(String[] args){
        System.out.println("Howdy!");
        gus();
        System.out.println("Goodbye from UTF8Targ!");
    }
    static void gus() {
    }
}

    /********** test program **********/

public class UTF8Test extends TestScaffold {
    ClassType targetClass;
    ThreadReference mainThread;
    Field targetField;
    UTF8Test (String args[]) {
        super(args);
    }

    public static void main(String[] args)      throws Exception {
        new UTF8Test(args).startTests();
    }

    /********** test core **********/

    protected void runTests() throws Exception {
        /*
         * Get to the top of main()
         * to determine targetClass and mainThread
         */
        BreakpointEvent bpe = startToMain("UTF8Targ");
        targetClass = (ClassType)bpe.location().declaringType();
        targetField = targetClass.fieldByName("aField");

        ArrayReference targetVals = (ArrayReference)targetClass.getValue(targetClass.fieldByName("vals"));

        /* For each string in the debuggee's 'val' array, verify that we can
         * read that value via JDI.
         */

        for (int ii = 0; ii < UTF8Targ.vals.length; ii++) {
            StringReference val = (StringReference)targetVals.getValue(ii);
            String valStr = val.value();

            /*
             * Verify that we can read a value correctly.
             * We read it via JDI, and access it directly from the static
             * var in the debuggee class.
             */
            if (!valStr.equals(UTF8Targ.vals[ii]) ||
                valStr.length() != UTF8Targ.vals[ii].length()) {
                failure("     FAILED: Expected /" + printIt(UTF8Targ.vals[ii]) +
                        "/, but got /" + printIt(valStr) + "/, length = " + valStr.length());
            }
        }

        /* Test 'all' unicode chars - send them to the debuggee via JDI
         * and then read them back.
         */
        doFancyVersion();

        resumeTo("UTF8Targ", "gus", "()V");
        try {
            Thread.sleep(1000);
        } catch (InterruptedException ee) {
        }


        /*
         * resume the target listening for events
         */

        listenUntilVMDisconnect();

        /*
         * deal with results of test
         * if anything has called failure("foo") testFailed will be true
         */
        if (!testFailed) {
            println("UTF8Test: passed");
        } else {
            throw new Exception("UTF8Test: failed");
        }
    }

    /**
     * For each unicode value, send a string containing
     * it to the debuggee via JDI, read it back via JDI, and see if
     * we get the same value.
     */
    void doFancyVersion() throws Exception {
        // This does 4 chars at a time just to save time.
        for (int ii = Character.MIN_CODE_POINT;
             ii < Character.MIN_SUPPLEMENTARY_CODE_POINT;
             ii += 4) {
            // Skip the surrogates
            if (ii == Character.MIN_SURROGATE) {
                ii = Character.MAX_SURROGATE - 3;
                break;
            }
            doFancyTest(ii, ii + 1, ii + 2, ii + 3);
        }

        // Do the supplemental chars.
        for (int ii = Character.MIN_SUPPLEMENTARY_CODE_POINT;
             ii <= Character.MAX_CODE_POINT;
             ii += 2000) {
            // Too many of these so just do a few
            doFancyTest(ii, ii + 1, ii + 2, ii + 3);
        }

    }

    void doFancyTest(int ... args) throws Exception {
        String ss = new String(args, 0, 4);
        targetClass.setValue(targetField, vm().mirrorOf(ss));

        StringReference returnedVal = (StringReference)targetClass.getValue(targetField);
        String returnedStr = returnedVal.value();

        if (!ss.equals(returnedStr)) {
            failure("Set: FAILED: Expected /" + printIt(ss) +
                    "/, but got /" + printIt(returnedStr) + "/, length = " + returnedStr.length());
        }
    }

    /**
     * Return a String containing binary representations of
     * the chars in a String.
     */
     String printIt(String arg) {
        char[] carray = arg.toCharArray();
        StringBuffer bb = new StringBuffer(arg.length() * 5);
        for (int ii = 0; ii < arg.length(); ii++) {
            int ccc = arg.charAt(ii);
            bb.append(String.format("%1$04x ", ccc));
        }
        return bb.toString();
    }

    String printIt1(String arg) {
        byte[] barray = null;
        try {
             barray = arg.getBytes("UTF-8");
        } catch (UnsupportedEncodingException ee) {
        }
        StringBuffer bb = new StringBuffer(barray.length * 3);
        for (int ii = 0; ii < barray.length; ii++) {
            bb.append(String.format("%1$02x ", barray[ii]));
        }
        return bb.toString();
    }

}