Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spec/java): add strip flag in meta string encoding spec #1565

Merged
merged 3 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/specification/java_serialization_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,11 +223,11 @@ Meta string is mainly used to encode meta strings such as class name and field n

String binary encoding algorithm:

| Algorithm | Pattern | Description |
|---------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9[c1,c2]` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `c1,c2`: `0b111110~0b111111`, `c1,c2` should be two of `._$` |
| UTF-8 | any chars | UTF-8 encoding |
| Algorithm | Pattern | Description |
|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| UTF-8 | any chars | UTF-8 encoding |

Encoding flags:

Expand Down
10 changes: 5 additions & 5 deletions docs/specification/xlang_serialization_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,11 +338,11 @@ Meta string is mainly used to encode meta strings such as field names.

String binary encoding algorithm:

| Algorithm | Pattern | Description |
|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101` |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111` |
| UTF-8 | any chars | UTF-8 encoding |
| Algorithm | Pattern | Description |
|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: `0b110100~0b111101`, `._`: `0b111110~0b111111`, prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
| UTF-8 | any chars | UTF-8 encoding |

Encoding flags:

Expand Down
40 changes: 15 additions & 25 deletions java/fury-core/src/main/java/org/apache/fury/meta/MetaString.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

import java.util.Arrays;
import java.util.Objects;
import org.apache.fury.util.Preconditions;

/**
* Represents a string with metadata that describes its encoding. It supports different encodings
Expand Down Expand Up @@ -61,31 +62,27 @@ public static Encoding fromInt(int value) {
private final char specialChar1;
private final char specialChar2;
private final byte[] bytes;
private final int numChars;
private final int numBits;
private final boolean stripLastChar;

/**
* Constructs a MetaString with the specified encoding and data.
*
* @param encoding The type of encoding used for the string data.
* @param bytes The encoded string data as a byte array.
* @param numBits The number of bits used for encoding.
*/
public MetaString(
String string,
Encoding encoding,
char specialChar1,
char specialChar2,
byte[] bytes,
int numChars,
int numBits) {
String string, Encoding encoding, char specialChar1, char specialChar2, byte[] bytes) {
this.string = string;
this.encoding = encoding;
this.specialChar1 = specialChar1;
this.specialChar2 = specialChar2;
this.bytes = bytes;
this.numChars = numChars;
this.numBits = numBits;
if (encoding != Encoding.UTF_8) {
Preconditions.checkArgument(bytes.length > 0);
this.stripLastChar = (bytes[0] & 0b1) != 0;
} else {
this.stripLastChar = false;
}
}

public String getString() {
Expand All @@ -108,12 +105,8 @@ public byte[] getBytes() {
return bytes;
}

public int getNumChars() {
return numChars;
}

public int getNumBits() {
return numBits;
public boolean stripLastChar() {
return stripLastChar;
}

@Override
Expand All @@ -127,15 +120,14 @@ public boolean equals(Object o) {
MetaString that = (MetaString) o;
return specialChar1 == that.specialChar1
&& specialChar2 == that.specialChar2
&& numChars == that.numChars
&& numBits == that.numBits
&& stripLastChar == that.stripLastChar
&& encoding == that.encoding
&& Arrays.equals(bytes, that.bytes);
}

@Override
public int hashCode() {
int result = Objects.hash(encoding, specialChar1, specialChar2, numChars, numBits);
int result = Objects.hash(encoding, specialChar1, specialChar2, stripLastChar);
result = 31 * result + Arrays.hashCode(bytes);
return result;
}
Expand All @@ -153,10 +145,8 @@ public String toString() {
+ specialChar2
+ ", bytes="
+ Arrays.toString(bytes)
+ ", numChars="
+ numChars
+ ", numBits="
+ numBits
+ ", stripLastChar="
+ stripLastChar
+ '}';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,18 @@ public MetaStringDecoder(char specialChar1, char specialChar2) {
*
* @param encodedData encoded data using passed <code>encoding</code>.
* @param encoding encoding the passed data.
* @param numBits total bits for encoded data.
* @return Decoded string.
*/
public String decode(byte[] encodedData, Encoding encoding, int numBits) {
public String decode(byte[] encodedData, Encoding encoding) {
switch (encoding) {
case LOWER_SPECIAL:
return decodeLowerSpecial(encodedData, numBits);
return decodeLowerSpecial(encodedData);
case LOWER_UPPER_DIGIT_SPECIAL:
return decodeLowerUpperDigitSpecial(encodedData, numBits);
return decodeLowerUpperDigitSpecial(encodedData);
case FIRST_TO_LOWER_SPECIAL:
return decodeRepFirstLowerSpecial(encodedData, numBits);
return decodeRepFirstLowerSpecial(encodedData);
case ALL_TO_LOWER_SPECIAL:
return decodeRepAllToLowerSpecial(encodedData, numBits);
return decodeRepAllToLowerSpecial(encodedData);
case UTF_8:
return new String(encodedData, StandardCharsets.UTF_8);
default:
Expand All @@ -66,30 +65,36 @@ public String decode(byte[] encodedData, Encoding encoding, int numBits) {
}

/** Decoding method for {@link Encoding#LOWER_SPECIAL}. */
private String decodeLowerSpecial(byte[] data, int numBits) {
private String decodeLowerSpecial(byte[] data) {
StringBuilder decoded = new StringBuilder();
int bitIndex = 0;
int bitMask = 0b11111; // 5 bits for mask
while (bitIndex + 5 <= numBits) {
int totalBits = data.length * 8; // Total number of bits in the data
boolean stripLastChar = (data[0] & 0x80) != 0; // Check the first bit of the first byte
int bitMask = 0b11111; // 5 bits for the mask
int bitIndex = 1; // Start from the second bit
while (bitIndex + 5 <= totalBits) {
LiangliangSui marked this conversation as resolved.
Show resolved Hide resolved
int byteIndex = bitIndex / 8;
int intraByteIndex = bitIndex % 8;
// Extract the 5-bit character value across byte boundaries if needed
int charValue =
((data[byteIndex] & 0xFF) << 8)
| (byteIndex + 1 < data.length ? (data[byteIndex + 1] & 0xFF) : 0);
charValue = ((byte) ((charValue >> (11 - intraByteIndex)) & bitMask));
charValue = (byte) ((charValue >> (11 - intraByteIndex)) & bitMask);
bitIndex += 5;
decoded.append(decodeLowerSpecialChar(charValue));
}

if (stripLastChar) {
decoded.deleteCharAt(decoded.length() - 1);
}
return decoded.toString();
}

/** Decoding method for {@link Encoding#LOWER_UPPER_DIGIT_SPECIAL}. */
private String decodeLowerUpperDigitSpecial(byte[] data, int numBits) {
private String decodeLowerUpperDigitSpecial(byte[] data) {
StringBuilder decoded = new StringBuilder();
int bitIndex = 0;
int bitIndex = 1;
boolean stripLastChar = (data[0] & 0x80) != 0; // Check the first bit of the first byte
int bitMask = 0b111111; // 6 bits for mask
int numBits = data.length * 8;
while (bitIndex + 6 <= numBits) {
int byteIndex = bitIndex / 8;
int intraByteIndex = bitIndex % 8;
Expand All @@ -102,6 +107,9 @@ private String decodeLowerUpperDigitSpecial(byte[] data, int numBits) {
bitIndex += 6;
decoded.append(decodeLowerUpperDigitSpecialChar(charValue));
}
if (stripLastChar) {
decoded.deleteCharAt(decoded.length() - 1);
}
return decoded.toString();
}

Expand Down Expand Up @@ -140,13 +148,13 @@ private char decodeLowerUpperDigitSpecialChar(int charValue) {
}
}

private String decodeRepFirstLowerSpecial(byte[] data, int numBits) {
String str = decodeLowerSpecial(data, numBits);
private String decodeRepFirstLowerSpecial(byte[] data) {
String str = decodeLowerSpecial(data);
return StringUtils.capitalize(str);
}

private String decodeRepAllToLowerSpecial(byte[] data, int numBits) {
String str = decodeLowerSpecial(data, numBits);
private String decodeRepAllToLowerSpecial(byte[] data) {
String str = decodeLowerSpecial(data);
StringBuilder builder = new StringBuilder();
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@ public MetaStringEncoder(char specialChar1, char specialChar2) {
*/
public MetaString encode(String input) {
if (input.isEmpty()) {
return new MetaString(
input, Encoding.LOWER_SPECIAL, specialChar1, specialChar2, new byte[0], 0, 0);
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, new byte[0]);
}
Encoding encoding = computeEncoding(input);
return encode(input, encoding);
Expand All @@ -66,53 +65,27 @@ public MetaString encode(String input, Encoding encoding) {
Preconditions.checkArgument(
input.length() < Short.MAX_VALUE, "Long meta string than 32767 is not allowed");
if (input.isEmpty()) {
return new MetaString(
input, Encoding.LOWER_SPECIAL, specialChar1, specialChar2, new byte[0], 0, 0);
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, new byte[0]);
}
int length = input.length();
byte[] bytes;
switch (encoding) {
case LOWER_SPECIAL:
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeLowerSpecial(input),
length,
length * 5);
bytes = encodeLowerSpecial(input);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
case LOWER_UPPER_DIGIT_SPECIAL:
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeLowerUpperDigitSpecial(input),
length,
length * 6);
bytes = encodeLowerUpperDigitSpecial(input);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
case FIRST_TO_LOWER_SPECIAL:
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeFirstToLowerSpecial(input),
length,
length * 5);
bytes = encodeFirstToLowerSpecial(input);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
case ALL_TO_LOWER_SPECIAL:
char[] chars = input.toCharArray();
int upperCount = countUppers(chars);
return new MetaString(
input,
encoding,
specialChar1,
specialChar2,
encodeAllToLowerSpecial(chars, upperCount),
length,
(upperCount + length) * 5);
bytes = encodeAllToLowerSpecial(chars, upperCount);
return new MetaString(input, encoding, specialChar1, specialChar2, bytes);
default:
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);
return new MetaString(
input, Encoding.UTF_8, specialChar1, specialChar2, bytes, bytes.length * 8, 0);
bytes = input.getBytes(StandardCharsets.UTF_8);
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, bytes);
}
}

Expand Down Expand Up @@ -238,10 +211,10 @@ private byte[] encodeGeneric(String input, int bitsPerChar) {
}

private byte[] encodeGeneric(char[] chars, int bitsPerChar) {
int totalBits = chars.length * bitsPerChar;
int totalBits = chars.length * bitsPerChar + 1;
int byteLength = (totalBits + 7) / 8; // Calculate number of needed bytes
byte[] bytes = new byte[byteLength];
int currentBit = 0;
int currentBit = 1;
for (char c : chars) {
int value =
(bitsPerChar == 5) ? charToValueLowerSpecial(c) : charToValueLowerUpperDigitSpecial(c);
Expand All @@ -256,7 +229,10 @@ private byte[] encodeGeneric(char[] chars, int bitsPerChar) {
currentBit++;
}
}

boolean stripLastChar = bytes.length * 8 >= totalBits + bitsPerChar;
if (stripLastChar) {
bytes[0] = (byte) (bytes[0] | 0x80);
}
return bytes;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@

@Internal
final class MetaStringBytes {
static final int STRIP_LAST_CHAR = 0b1000;
static final short DEFAULT_DYNAMIC_WRITE_STRING_ID = -1;

final byte[] bytes;
Expand Down Expand Up @@ -57,25 +56,14 @@ public MetaStringBytes(MetaString metaString) {
}
hashCode &= 0xffffffffffffff00L;
int header = metaString.getEncoding().getValue();
String decoded =
new MetaStringDecoder(metaString.getSpecialChar1(), metaString.getSpecialChar2())
.decode(bytes, metaString.getEncoding(), bytes.length * 8);
if (decoded.length() > metaString.getString().length()) {
header |= STRIP_LAST_CHAR;
}
this.hashCode = hashCode | header;
}

public String decode(char specialChar1, char specialChar2) {
int header = (int) (hashCode & 0xff);
int encodingFlags = header & 0b111;
MetaString.Encoding encoding = MetaString.Encoding.values()[encodingFlags];
String str =
new MetaStringDecoder(specialChar1, specialChar2).decode(bytes, encoding, bytes.length * 8);
if ((header & STRIP_LAST_CHAR) != 0) {
str = str.substring(0, str.length() - 1);
}
return str;
return new MetaStringDecoder(specialChar1, specialChar2).decode(bytes, encoding);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@
* share common immutable datastructure globally across multiple fury.
*/
public final class MetaStringResolver {
public static final byte USE_STRING_VALUE = 0;
public static final byte USE_STRING_ID = 1;
private static final int initialCapacity = 8;
// use a lower load factor to minimize hash collision
private static final float furyMapLoadFactor = 0.25f;
Expand Down
Loading
Loading