Sure, one could do it for every variable individually, but how about compressing whole Objects using the standard serializing mechanism?
If you use WEKA, you have the SerializedObject.
If not - enter the CompressedReference:
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
public class CompressedReference<T extends Serializable> implements Serializable {
private static final long serialVersionUID = 7967994340450625830L;
private byte[] theCompressedReferent = null;
public CompressedReference(T referent) {
try {
compress(referent);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public int size() {
return theCompressedReferent.length;
}
public T get() {
try {
return decompress();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
private void compress(T referent) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(bos);
ObjectOutputStream ous = new ObjectOutputStream(zos);
ous.writeObject(referent);
zos.finish();
bos.flush();
theCompressedReferent = bos.toByteArray();
bos.close();
}
@SuppressWarnings("unchecked")
private T decompress() throws IOException, ClassNotFoundException {
T tmpObject = null;
ByteArrayInputStream bis = new ByteArrayInputStream(theCompressedReferent);
GZIPInputStream zis = new GZIPInputStream(bis);
ObjectInputStream ois = new ObjectInputStream(zis);
tmpObject = (T) ois.readObject();
ois.close();
return tmpObject;
}
}
A quick test shows 528 byte size for a String of 250 characters (since Unicode needs two bytes per char) and 64 bytes after compression, a ratio of about 8:1. The only requirement is that the Object stored in the reference has to implement Serializable.
And yes, I have to clear those TODO reminders... ;-)
Update
The compression ratio for a CDK Molecule representation of 2-Fluoronaphthalene is about 3:1.
The compression ratio for the V2000 mofile of 2-Fluoronaphthalene is about 7:1.