Efficient Sequence Operators: Seq.breakBy

I wanted to create a sequence function which would break a sequence into a number of equally sized chunks (e.g. to create a sequence of 4 byte arrays out of a byte stream to pass it to System.BitConverter). Here is my first attempt

~> cat seq1.fs
namespace Microsoft.FSharp.Collections
  module Seq =
    let breakByV1 windowSize (source:seq<_>) =
      source |> Seq.windowed windowSize
      |> Seq.mapi (fun i j -> if i%windowSize=0 then Some j else None)
      |> Seq.choose (fun i -> i)
~> fsc.exe --target:library seq1.fs
~> fsi.exe

Microsoft (R) F# 2.0 Interactive build 2.0.0.0
Copyright (c) Microsoft Corporation. All Rights Reserved.

For help type #help;;
> #r "seq1.dll";;

--> Referenced 'seq1.dll'

> open Microsoft.FSharp.Collections;;
> {1..100} |> Seq.breakByV1 3;;
val it : seq<int []> =
  seq [[|1; 2; 3|]; [|4; 5; 6|]; [|7; 8; 9|]; [|10; 11; 12|]; ...]

The file ‘seq1.fs’ module defines a function ‘Seq.breakByV1’ inside namespace ‘Microsoft.FSharp.Collections’ and module ‘Seq’. This allows naming and access of the as ‘Seq.breakByV1’ function as it were part of the original Microsoft Seq module.

Let’s do some timing on the first implementation:

> #time "on";;

--> Timing now on

> let n=1000000;; let chunk=100;;
> {1..n} |> Seq.breakByV1 chunk |> Seq.nth (n/chunk-1);;
Real: 00:00:02.588, CPU: 00:00:00.000, GC gen0: 39
val it : int [] = 
[|999901; 999902; 999903; 999904; 999905; 999906; 999907; 999908; 999909;
  999910; 999911; 999912; 999913; 999914; 999915; 999916; 999917; 999918;
  999919; 999920; 999921; 999922; 999923; 999924; 999925; 999926; 999927;
  999928; 999929; 999930; 999931; 999932; 999933; 999934; 999935; 999936;
  999937; 999938; 999939; 999940; 999941; 999942; 999943; 999944; 999945;
  999946; 999947; 999948; 999949; 999950; 999951; 999952; 999953; 999954;
  999955; 999956; 999957; 999958; 999959; 999960; 999961; 999962; 999963;
  999964; 999965; 999966; 999967; 999968; 999969; 999970; 999971; 999972;
  999973; 999974; 999975; 999976; 999977; 999978; 999979; 999980; 999981;
  999982; 999983; 999984; 999985; 999986; 999987; 999988; 999989; 999990;
  999991; 999992; 999993; 999994; 999995; 999996; 999997; 999998; 999999;
  1000000|]

As you can see this implementation takes approximately 2.5 seconds to cut a sequence with 1 million entries into chunks with 100 elements each. However, this implementation is using 3 sequence operators (Seq.windowed, Seq.mapi, Seq.choose) in a not too straightforward fashion, therefore, some optimization should be possible.

Since the F# compiler and library sources were released into open source, there is a good opportunity to learn from the core library implementations. Therefore, let’s have look at the implementation of ‘Seq.windowed’ inside seq.fs of the F# Compiler (get it here):

~> cat compiler/2.0/Nov2010/src/fsharp/FSharp.Core/seq.fs
...
       // windowed : int -> seq<'T> -> seq<'T[]>
        [<CompiledName("Windowed")>]
        let windowed windowSize (source: seq<_>) =    
            checkNonNull "source" source
            if windowSize <= 0 then invalidArg "windowSize" (SR.GetString(SR.inputMustBeNonNegative))
            seq { let arr = Microsoft.FSharp.Primitives.Basics.Array.zeroCreateUnchecked windowSize
                  let r = ref (windowSize-1)
                  let i = ref 0
                  use e = source.GetEnumerator()
                  while e.MoveNext() do
                      arr.[!i] <- e.Current
                      i := (!i + 1) % windowSize
                      if !r = 0 then
                          yield Array.init windowSize (fun j -> arr.[(!i+j) % windowSize])
                      else
                      r := (!r - 1) }
...

The Microsoft implementation is implemented in a not too functional fashion, with reference values and assignments, however, speed improvement justifies this low level plumbing as we will see in a moment.

The Seq.breakByV2 function is an implementation following the structure of Seq.Windowed as seen above.

~> cat seq1.fs
namespace Microsoft.FSharp.Collections
  module Seq =
    let breakByV1 windowSize (source:seq<_>) =
      source |> Seq.windowed windowSize
      |> Seq.mapi (fun i j -> if i%windowSize=0 then Some j else None)
      |> Seq.choose (fun i -> i)

    let breakByV2 windowSize (source: seq<_>) =
      seq { let arr = Array.create windowSize (Seq.nth 0 source)
            let i = ref 0
            use e = source.GetEnumerator()
            while e.MoveNext() do
              arr.[!i] <- e.Current
              i := !i + 1
              if !i = windowSize then
                yield Array.init windowSize (fun j -> arr.[j])
                i := 0
          }
~> fsc.exe --target:library seq1.fs
~> fsi.exe

Microsoft (R) F# 2.0 Interactive build 2.0.0.0
Copyright (c) Microsoft Corporation. All Rights Reserved.

For help type #help;;
> #r "seq1.dll";;

--> Referenced 'seq1.dll'

> open Microsoft.FSharp.Collections;;

Let’s do the same benchmark as done before on ‘Seq.breakByV1’:

> {1..n} |> Seq.breakByV2 chunk |> Seq.nth (n/chunk-1);;
Real: 00:00:00.118, CPU: 00:00:00.000, GC gen0: 1
val it : int [] =
  [|999901; 999902; 999903; 999904; 999905; 999906; 999907; 999908; 999909;
    999910; 999911; 999912; 999913; 999914; 999915; 999916; 999917; 999918;
    999919; 999920; 999921; 999922; 999923; 999924; 999925; 999926; 999927;
    999928; 999929; 999930; 999931; 999932; 999933; 999934; 999935; 999936;
    999937; 999938; 999939; 999940; 999941; 999942; 999943; 999944; 999945;
    999946; 999947; 999948; 999949; 999950; 999951; 999952; 999953; 999954;
    999955; 999956; 999957; 999958; 999959; 999960; 999961; 999962; 999963;
    999964; 999965; 999966; 999967; 999968; 999969; 999970; 999971; 999972;
    999973; 999974; 999975; 999976; 999977; 999978; 999979; 999980; 999981;
    999982; 999983; 999984; 999985; 999986; 999987; 999988; 999989; 999990;
    999991; 999992; 999993; 999994; 999995; 999996; 999997; 999998; 999999;
    1000000|]

The new implementation Seq.breakByV2 just takes 118 miliseconds, thats an improvement of more than 20 times compared to Seq.breakByV1.

Further reading:

This entry was posted in Algorithms, Language. Bookmark the permalink.

1 Response to Efficient Sequence Operators: Seq.breakBy

  1. Pingback: Reading and writing WAV audio files | 2#4u

Leave a comment