Tuesday, September 25, 2012

Strange Looping in St. Louis

I spent the last three days in St. Louis at the Strange Loop conference and the associated Emerging Languages Camp:


It was truly a great conference.  The two keynotes were exceptional.  Michael Stonebraker (of Ingres and Postgres fame) spoke about his new extraordinarily fast in-memory database VoltDB.  Jeff Hawkins (of Palm and "On Intelligence" fame) spoke about his theory of how the neocortex works, with special emphasis on Sparse Distributed Representations. He also talked about how his new company Numenta had figured out how to embody some of these theories in a program called Grok, which is actually useful for real-world problems such as predicting energy usage of a building twenty-four hours in advance, to allow for advance purchase of electricity at the best rates.

But it wasn't just the keynotes.  There was a wonderful grab-bag of talks on all sorts of subjects, from retargeting the Scala compiler to generate LLVM assembly language (rather than Java), to how Grand Central Dispatch works in iOS and Mac OS X, to ruminations on P vs. NP.  The crowd was heavily tilted toward functional programming, but there were still enough others around to keep things interesting.  And the forces for static typing vs. those for dynamic typing seem to have been closely matched in strength, so Scala and JavaScript were both very hot topics. 

I had presented on ParaSail at the prior Emerging Languages Camp, and that seemed to disqualify me from presenting again this year.  But I did lead a (very small) unsession on the use of region-based storage management for parallel programming.  One benefit was it forced me to coalesce ideas on why region-based storage management is an excellent alternative to a global garbage-collected heap in a highly parallel environment.  Slides from this unsession are available at:


All in all, these past three days were a great mind-expanding experience.  I encourage anyone who has the time, to make an effort to attend Strange Loop next year; I presume it will be at about the same time of year in St. Louis again.

Monday, September 24, 2012

Work stealing and mostly lock-free access

ParaSail uses work stealing to schedule the picothreads that make up a ParaSail program.  With work stealing, there are a relatively small number of heavy-weight worker processes, roughly one per physical core/processor, each serving their own queue of picothreads (in a LIFO manner), and periodically stealing a picothread from some other worker process (using FIFO, so as to pick up a picothread that has been languishing on the other worker's queue).  See the following blog entry for more discussion of work stealing:


What this means is that a worker's queue is referenced mostly by only one process, namely the owner of the queue.

A similar situation arises in the region-based storage management used in ParaSail.  A region is created when a new scope is entered, and most of the allocation and deallocation within a region is done by the worker that created the scope.  But due to work stealing, some other worker might be executing a picothread that is expanding or shrinking an object associated with the region, so some synchronization is necessary in this case.

So what sort of synchronization should be used for these situations where most of the access to a resource arises from an owning worker process, but some of the access arises from other non-owning workers?  We could use a traditional lock-based mutex all of the time, but this slows down the common case where all the access comes from the owner.  We could use a general N-way lock-free synchronization, but this generally requires some kind of atomic compare-and-swap and involves busy waiting.  Atomic compare-and-swap is not always available in a portable fashion at the high-level language level, and busy waiting presumes that there are never more worker processes than there are available physical processors/cores, so the current holder of the lock-free resource is actually making progress while other workers are busy-waiting.

So for ParaSail we are adopting a middle ground between fully lock-based synchronization and N-way lock-free synchronization, which recognizes the asymmetric nature of the problem, namely that one process, the owner, will be performing most of the references.  With the adopted solution, we only need atomic load and store, rather than atomic compare-and-swap, and there is never any busy waiting, so we can run on top of, for example, a time-slicing operating system, where some worker processes might be preempted.

So what is the adopted solution?  For a given resource, we have two flags which are atomic variables, one mutex, and a queue, named as follows:
  • Owner-wants-resource flag
  • Nonowner-wants-resource flag
  • Resource mutex
  • Nonowner-waiting queue
When the owner wants to use the resource:
  • Owner sets the owner-wants-resource flag atomically;
  • It then checks the nonowner-wants-resource flag:
    •  If nonowner-wants-resource flag is set:
      • Owner calls the mutex lock operation;
      • Owner manipulates the resource;
      • Owner clears the owner-wants-resource flag;
      • <<Check_Queue>> Owner then checks the nonowner-waiting queue:
        • If the queue is empty, it clears the nonowner-wants-resource flag;
        • If the queue is not empty, it wakes up one of the waiting nonowners;
      • Owner calls the mutex unlock operation (note that this might be combined with the above waking up of one of the nonowners -- e.g. using a lock handoff).
    •  If nonowner-wants-resource flag is not set:
      • Owner manipulates the resource;
      • Owner clears the owner-wants-resource flag.
      • Owner rechecks the nonowner-wants-resource flag:
        • If nonowner-wants-resource flag is now set:
          • Owner calls the mutex lock operation;
          • Owner does the <<Check_Queue>> operation (see above);
          • Owner calls the mutex unlock operation (note that this might be combined with the waking up of one of the nonowners by Check_Queue -- e.g. using a lock handoff).
When a nonowner wants to use the resource:
  • Nonowner calls the mutex lock operation;
  • Nonowner sets the nonowner-wants-resource flag atomically;
  • Nonowner checks the owner-wants-resource flag;
    • While owner-wants-resource flag is set:
      • Nonowner adds itself to the nonowner-waiting queue;
      • When woken up, reacquire the lock (or get lock automatically via a handoff from the process that woke us up);
  • Nonowner manipulates the resource;
  • Nonowner does the <<Check_Queue>> operation (see above);
  • Nonowner calls the mutex unlock operation (note that this might be combined with the waking up of another nonowner by Check_Queue -- e.g. using a lock handoff).
How do we know this approach is safe?  We need to prove that the resource is never manipulated simultaneously by the owner and a nonowner.  This can only happen if the owner decides to not use the mutex, since otherwise the manipulation happens under protection of the mutex lock.  We know that the owner sets the owner-wants-resource flag before checking the nonowner-wants-resource flag, and similarly a nonowner sets the nonowner-wants-resource flag before checking the owner-wants-resource flag.  Therefore, if the owner decides to bypass the mutex, while a nonowner is going after the resource simultaneously, the nonowner must not yet have checked the owner-wants-resource flag (think about it!). If later the nonowner does reach the check on the owner-wants-resource before the owner is done, it will put itself onto a queue rather than immediately manipulating the resource.

How do we know this approach does not leave a nonowner waiting on the queue forever?  We know the owner rechecks the nonowner-wants-resource flag after clearing the owner-wants-resource flag, so the owner will never miss the possibility that a nonowner queued itself while the owner was manipulating the resource.

So what does this approach accomplish?  We see that the owner only uses a lock-based mutex when it bumps into a nonowner that is simultaneously manipulating the resource.  On the other hand, a nonowner always uses a lock-based mutex, and in addition it uses a queue if it happens to bump into the owner simultaneously manipulating the resource.  As mentioned above, this approach also avoids the need for atomic compare-and-swap, as well as avoiding the need for busy waiting.

Wednesday, September 19, 2012

ParaSail standard library prefix?

Due to popular demand, we are going to try to solidify and extend the ParaSail standard library.  One practical question is the naming convention for the ParaSail standard library.  C++ uses "std::" as the namespace for the standard library.  Java uses "java." or "javax." or "com.sun." or "com.oracle.".  C# (and .NET in general) uses "System." as the prefix.  Ada uses "Ada." (as well as "System." and "Interfaces.").

So here are some possibilities for ParaSail:
  • Standard::
  • System::
  • ParaSail::
  • PS::
  • PSL::
  • ??
I think I am currently favoring "System::" because ParaSail has those two upper case letters which would be a bit of a pain, and the others don't seem to capture the idea as well.  Note that we will probably arrange things so that you rarely need to write the "System::" prefix, but if there is ambiguity, then it would be necessary.


Friday, September 14, 2012

Rev 3.4 alpha 0.5 release of ParaSail now supports lambda expressions, etc.

We just released a new version (3.4 alpha 0.5) of the ParaSail compiler and virtual machine, available at the same URL as before:


This new release includes support for declaring and passing operations as parameters of other operations.  The actual operation passed can be a nested operation, an operation of a module, or a lambda expression.  (Passing operations as parameters in module instantiations is not yet supported.)

Additional updates in this release:
  1. optional outputs of an operation are now default initialized to the null value;
  2. nested operations can now correctly make up-level references to variables of enclosing operations;
  3. the Univ_String module now includes operations To_Vector and From_Vector for converting a Univ_String to/from a Vector of Univ_Characters; this is useful because Univ_String values are immutable;
  4. the ZString module includes corresponding operations To_ZVector and From_ZVector for converting to/from a ZVector of Univ_Characters.
Here is an example of declaring an operation as a parameter of another operation, and passing a lambda expression as the actual parameter in a call:
interface Mapper<Element_Type is Assignable<>> is
    func Apply
      (func Op(Input : Element_Type) -> Element_Type;
       Vec : Vector<Element_Type>) 
      -> Vector<Element_Type>;
      // Apply "Op" to each element of vector and return the result
end interface Mapper;

class Mapper is
    func Apply
      (func Op(Input : Element_Type) -> Element_Type;
       Vec : Vector<Element_Type>) 
      -> Vector<Element_Type> is
      // Apply "Op" to each element of vector and return the result
       return [for I in 1..Length(Vec) => Op(Vec[I])];
    end func Apply;
end class Mapper;

. . .

type Univ_Int_Mapper is Mapper<Univ_Integer>;

const Squares : Vector<Univ_Integer> := [for I in 1..100 => I ** 2];
const Triple_Squares : Vector<Univ_Integer> := 
       (lambda (X : Univ_Integer) -> Univ_Integer is (X * 3), Squares);
           // Triple each element of Squares