Aleksandar Prokopec, David Leopoldseder,
Gilles Duboscq, Thomas Wuerthinger
Oracle Labs Switzerland
October 2017
The following is intended to provide some insight into a line of research in Oracle Labs. It is intended for information purposes only, and may not be incorporated into any contract.
It is not a commitment to deliver any material, code, or functionality, and should not be relied in making purchasing decisions. Oracle reserves the right to alter its development plans and practices at any time, and the development, release, and timing of any features or functionality described in connection with any Oracle product or service remains at the sole discretion of Oracle.
Any views expressed in this presentation are my own and do not necessarily reflect the views of Oracle.
1. A JIT compiler can make Scala collection operations optimal.
2. For this, a JIT compiler is usually a better choice than a static compiler.
3. Specific optimizations are more easily done by a static compiler.
==
Same performance as equivalent handwritten code
with while
loops,
if
statements,
stacklocal variables and array operations
HighLevel Optimizations in Graal
Case Study: foldLeft
Case Study: map
Performance
def sq(x: Double) = x * x
def tss(xs: List[Double]): Double =
if (xs.isEmpty) 0
else sq(xs.head) + tss(xs.tail)
def sq(x: Double) = x * x
def tss(xs: List[Double]): Double =
if (xs.isEmpty) 0 else {
val h = xs.head
h * h + tss(xs.tail)
}
How to inline? Simple.
When to inline? Anything but simple.
def tss(xs: Seq[Int]): Int =
xs.fold(0)(_ + sq(_))
def tss(xs: Seq[Int]): Int =
xs.fold(0)(_ + sq(_))

> xs: List[Int] 62%

> xs: ArrayOps[Int] 34%

\> xs: Vector[Int] 4%
def tss(xs: Seq[Int]): Int =
xs match {
case xs: List[Int] =>
xs.fold(0)(_ + sq(_))
case xs: ArrayOps[Int] =>
xs.fold(0)(_ + sq(_))
case _ =>
xs.fold(0)(_ + sq(_))
}
A JIT compiler has access to receiver type profiles.
val xs: Seq[Int] = 1 :: 2 :: 3
val s = xs.fold(0)(_ + _)
val p = s * s
s * s + 2 * p
Simplify until convergence.
val xs: Seq[Int] = 1 :: 2 :: 3
val s = xs.fold(0)(_ + _)
val p = s * s
p + 2 * p
Global value numbering.
val xs: Seq[Int] = 1 :: 2 :: 3
val s = xs.fold(0)(_ + _)
val p = s * s
p + (p << 1)
Strength reduction.
val xs: Seq[Int] = 1 :: 2 :: 3
val s = (xs: List[Int]).fold(0)(_ + _)
val p = s * s
p + (p << 1)
Monomorphic invokes.
Canonicalization  many different simplification rules, triggered by localized patterns in the IR.
Usually monotonic, and converging.
def tss(a: Array[Int]): Int = {
val it = new ArrayIterator
it.current = 0
it.array = a
var sum = 0
while (it.current < it.array.length) {
sum += sq(it.array(it.current))
it.current += 1
}
sum
}
def tss(a: Array[Int]): Int = {
var current = 0
var sum = 0
while (current < a.length) {
sum += sq(a(current))
current += 1
}
sum
}
See:
Partial Escape Analysis and Scalar Replacement for Java
def foldLeft[B](z: B)(op: (B, A) => B): B =
foldl(0, length, z, op)
@tailrec
def foldl[B](
s: Int, e: Int, z: B, op: (B, A) => B
): B =
if (s == e) z
else foldl(s + 1, e, op(z, this(s)), op)
def foldLeft[B](z: B)(op: (B, A) => B): B =
foldl(0, length, z, op)
@tailrec
def foldl[B](
s: Int, e: Int, z: B, op: (B, A) => B
): B =
if (s == e) z
else foldl(s + 1, e, op(z, this(s)), op)
def foldLeft[B](z: B)(op: (B, A) => B): B =
foldl(0, length, z, op)
def computeSum(xs: Array[Int]): Long =
xs.foldLeft(0L)(_ + _)
def computeSum(xs: Array[Int]): Long =
xs.foldLeft(0L)(_ + _)
def computeSum(xs: Array[Int]): Long = {
val ops = new ArrayOps(xs)
val z = BoxesRuntime.boxToLong(0L)
val op = new $$anonfun$computeSum$1
val res = ops.foldLeft(0L)(op)
BoxesRuntime.unboxToLong(res)
}
To optimize, callees must be inlined. First, a call graph is created.
The call graph is expanded incrementally.
A heuristic is used to inline some of the nodes.
Polymorphics callsites are represented with typeswitches.
@tailrec def foldl[B](
s: Int, e: Int, z: B, op: (B, A) => B): B =
if (s == e) z
else foldl(s + 1, e, op(z, apply(s)), op)
def foldLeft[B](z: B)(op: (B, A) => B): B =
foldl(0, length, z, op)
def computeSum(xs: Array[Int]): Long = {
val z = BoxesRuntime.boxToLong(0L)
val res = new ArrayOps(xs).foldLeft(0L)(_ + _)
BoxesRuntime.unboxToLong(res)
}
Polymorphics callsites are represented with typeswitches.
Receiver type information improves, so polymorphic callsites are removed.
@tailrec def foldl[B](
s: Int, e: Int, z: B, op: (B, A) => B): B =
if (s == e) z
else foldl(s + 1, e, op(z, apply(s)), op)
def foldLeft[B](z: B)(op: (B, A) => B): B =
foldl(0, length, z, op)
def computeSum(xs: Array[Int]): Long = {
val z = BoxesRuntime.boxToLong(0L)
val res = new ArrayOps(xs).foldLeft(0L)(_ + _)
res.value
}
The expanded part of the call graph can be completely inlined.
def computeSum(xs: Array[Int]): Long = {
val ops = new ArrayOps(xs)
var z = 0L
var s = 0; val e = ops.array.length
while (s < e) {
if (s < 0  s >= ops.array.length)
throw new ArrayIndexOutOfBoundsException
z = z + ops.array(s)
s += 1
}
val res = new java.lang.Long(e)
res.value
}
def computeSum(xs: Array[Int]): Long = {
val ops = new ArrayOps(xs)
var z = 0L
var s = 0; val e = ops.array.length
while (s < e) {
z = z + ops.array(s)
s += 1
}
val res = new java.lang.Long(e)
res.value
}
def computeSum(xs: Array[Int]): Long = {
var z = 0L
var s = 0; val e = xs.length
while (s < e) {
z = z + xs(s)
s += 1
}
e
}
def computeSum(xs: Seq[Int]): Long = {
xs match {
case xs: Array[Length] =>
var z = 0L
var s = 0
val e = xs.length
while (s < e) {
z = z + xs(s)
s += 1
}
e
}
}
val x: A = apply(i).asInstanceOf[A]
val y: B = f(x)
val x: AnyRef = array(i)
val y: Int = x.value + 1
val y: B = f(x)
result += y
val y: Int = x.value + 1
outArray(i) = new java.lang.Integer(y)
Conclusion:
Optimizing the data structures
would require wholeprogram analysis,
and is best left to the static compiler.
Use a spacebar or arrow keys to navigate