24 Jul 2008

feedPlanet OpenSolaris

Adam Leventhal: Hybrid Storage Pools: The L2ARC

I've written recently about the hybrid storage pool (HSP), using ZFS to augment the conventional storage stack with flash memory. The resulting system improve performance, cost, density, capacity, power dissipation - pretty much evey axis of importance.

An important component of the HSP is something called the second level adaptive replacement cache (L2ARC). This allows ZFS to use flash as a caching tier that falls between RAM and disk in the storage hierarchy, and permits huge working sets to be serviced with latencies under 100us. My colleague, Brendan Gregg, implemented the L2ARC, and has written a great summary of how the L2ARC works and some concrete results. Using the L2ARC, Brendan was able to achieve a 830% improvement in performance. Compare that to 15K RPM drives which will improve performance at most 200-300%, while costing more, using more power, and delivering less total capacity than Brendan's configuration. Score one for the hybrid storage pool!

24 Jul 2008 12:09am GMT

23 Jul 2008

feedPlanet OpenSolaris

Dave Stewart: I'm speaking tomorrow at OSCON


OSCON 2008

( I know it's a little late, but I took a little mental vacation here for a while, so sue me! )

If you happen to be at OSCON in Portland, OR this week, I would invite you to come by and introduce yourself at my talk. It's at 10:45 - 11:30 in room E141, which is quite close to the Expo.

I'm speaking on the topic: OpenSolaris and Intel: Greybeards no More. The idea was to contrast the prevailing notion that "Solaris" is "UNIX" and thus is for a bunch of grey-bearded old poops with no social skills and dried food on their shirts.

Anyway, we'll also talk about the things Intel is working on with this project and get some good feedback.

(For my part, I really hope I don't see a photo of myself from the 80s, complete with beard. Or, even worse, photoshopped on my current mug...)

23 Jul 2008 10:08pm GMT

Jim Grisanzio: Second Quake in a Week

We had a pretty big earthquake in Japan on the northern part of Honshu (the main island) about 45 min ago. It was 6.9 up there, but probably about 4 or so in Tokyo where I am. A 4 is not big here, but this one shook for about a minute. Most small quakes around here are about 15 seconds or so, but this one kept on going. There were pre-warnings for this, and I saw many reports on Twitter from friends all around Japan during and after. The NHK dudes were on TV immediately, too.

23 Jul 2008 4:18pm GMT

Ben Rockwood: DTrace IP Provider... Oh no you didn't....

In my previous post about the IP Provider I got the following comment: "There is nothing unpleasant about the wonderfulness that is tcpdump! You'll need to put a lot of work in to match tcpdump's usefulness with Dtrace…"

That just sounds like a challenge. Bring it on! Can snoop or tcpdump do this?

root@ultra ~$ ./ip_whosent.d 
Packet sent to 192.168.100.4: 88 byte packet on behalf of ssh (PID: 1075)
Packet sent to 192.168.100.4: 88 byte packet on behalf of ssh (PID: 1075)
Packet sent to 208.67.222.222: 56 byte packet on behalf of nscd (PID: 152)
Packet sent to 208.67.222.222: 71 byte packet on behalf of nscd (PID: 152)
Packet sent to 208.67.222.222: 56 byte packet on behalf of nscd (PID: 152)
Packet sent to 72.14.207.99: 52 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 8.12.32.9: 52 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 54 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 87 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 58 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 64 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 8.12.32.9: 65 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 208.67.219.230: 644 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 208.67.219.230: 637 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 72.14.207.99: 660 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 208.67.219.230: 52 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 208.67.219.230: 664 byte packet on behalf of firefox-bin (PID: 1944)
Packet sent to 8.12.32.9: 48 byte packet on behalf of thunderbird-bin (PID: 1133)
Packet sent to 72.14.207.99: 40 byte packet on behalf of firefox-bin (PID: 1944)
^C

Here is the script:

#!/usr/sbin/dtrace -qs 



ip:ip:*:send
/execname != "sched"/
{ 
        printf("Packet sent to %s: %d byte packet on behalf of %s (PID: %d)n", 
                        args[2]->ip_daddr, args[4]->ipv4_length, execname, pid ); 
}

Oh but wait....... how about a full call stack on each sent packet? Just add a new line to the above script: stack();

root@ultra ~$ ./ip_sentstack.d 
Packet sent to 72.14.207.99: 84 byte packet on behalf of ping (PID: 2020)

              ip`ip_wput_ire+0x21f5
              ip`ire_send+0x1c9
              ip`ire_add_then_send+0x2b9
              ip`ip_newroute+0xa0a
              ip`ip_output_options+0x18c7
              ip`icmp_wput+0x44a
              unix`putnext+0x22b
              genunix`strput+0x1ad
              genunix`kstrputmsg+0x261
              sockfs`sosend_dgram+0x26e
              sockfs`sotpi_sendmsg+0x4a8
              sockfs`sendit+0x160
              sockfs`sendto+0x8e
              sockfs`sendto32+0x2d
              unix`sys_syscall32+0x101

Or check out one of the examples on the IP Provider wiki page (this is almost certainly by Brendan Gregg):

# ./ipio.d
 CPU  DELTA(us)          SOURCE               DEST      INT  BYTES
   1     598913    10.1.100.123 ->   192.168.10.75  ip.tun0     68
   1         73   192.168.1.108 ->     192.168.5.1     nge0    140
   1      18325   192.168.1.108 -     192.168.5.1     nge0    140
   1         69    10.1.100.123 -   192.168.10.75  ip.tun0     68
   0     102921    10.1.100.123 ->   192.168.10.75  ip.tun0     20
   0         79   192.168.1.108 ->     192.168.5.1     nge0     92

Here is the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=10hz

dtrace:::BEGIN
{
        printf(" %3s %10s %15s    %15s %8s %6sn", "CPU", "DELTA(us)",
            "SOURCE", "DEST", "INT", "BYTES");
        last = timestamp;
}

ip:::send
{
        this->elapsed = (timestamp - last) / 1000;
        printf(" %3d %10d %15s -> %15s %8s %6dn", cpu, this->elapsed,
            args[2]->ip_saddr, args[2]->ip_daddr, args[3]->ill_name,
            args[2]->ip_plength);
        last = timestamp;
}

ip:::receive
{
        this->elapsed = (timestamp - last) / 1000;
        printf(" %3d %10d %15s - %15s %8s %6dn", cpu, this->elapsed,
            args[2]->ip_daddr, args[2]->ip_saddr, args[3]->ill_name,
            args[2]->ip_plength);
        last = timestamp;
}

Can DTrace decrypt IPsec ESP payloads? No. Ok, so tcpdump isn't dead yet, but the capabilities offered by DTrace are far deeper. I've got a ton of ideas more that I could put here, but don't have time atm. DTrace for the win!

23 Jul 2008 9:01am GMT

Simon Phipps: Un-Booth at OSCON

One of the perennial problems of sponsoring an open source conference is that the organisers always seem to want the sponsorship to pay for an exhibition booth. Exhibition booths need furnishing and decorating. They need things to exhibit. They need staffing. Most of this would be fine at a traditional exhibition, but at an open source conference there aren't many people attending to choose things to buy and thus the sales staff aren't keen to do all the above.

So what should we do with that booth? An approach we first tried at FISL a few years ago was to stop treating it as a selling space and start treating it as a social space. This year at OSCON in Portland we've decided to open up and dedicate our booth to hosting a micro-unconference. We've set it up with whiteboards, tables, electrical outlets and fresh coffee. And if having a place to veg isn't enough, we've invited all comers to deliver lightning talks throughout the two days. There are still a few slots on the agenda if you want to deliver a talk, but the quality of the speakers already listed is high (check out Monty's talk on Maria for example).

By the way, the legendary (or is that "mythical") Sun FOSS Party is back again this year, 8pm in the parking garage at the Doubletree hotel on Wednesday (July 23). Loads of cool diversions and I gather there is plenty more to drink this year than last. All welcome.

23 Jul 2008 5:54am GMT

Brendan Gregg: ZFS L2ARC

An exciting new ZFS feature has now become publicly known - the second level ARC, or L2ARC. I've been busy with its development for over a year, however this is my first chance to post about it. This post will show a quick example and answer some basic questions.

Background in a nutshell

The "ARC" is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency. An ARC read miss would normally read from disk, at millisecond latency (especially random reads). The L2ARC sits in-between, extending the main memory cache using fast storage devices - such as flash memory based SSDs (solid state disks).


old model

new model

with ZFS


Some example sizes to put this into perspective, from a lab machine named "walu":

Layer Medium Total Capacity
ARC DRAM 128 Gbytes
L2ARC 6 x SSDs 550 Gbytes
Storage Pool 44 Disks 17.44 Tbytes (mirrored)

For this server, the L2ARC allows around 650 Gbytes to be stored in the total ZFS cache (ARC + L2ARC), rather than just DRAM with about 120 Gbytes.

A previous ZFS feature (the ZIL) allowed you to add SSD disks as log devices to improve write performance. This means ZFS provides two dimensions for adding flash memory to the file system stack: the L2ARC for random reads, and the ZIL for writes.

Adam has been the mastermind behind our flash memory efforts, and has written an excellent article in Communications of the ACM about flash memory based storage in ZFS; for more background, check it out.

L2ARC Example

To illustrate the L2ARC with an example, I'll use walu - a medium sized server in our test lab, which was briefly described above. Its ZFS pool of 44 x 7200 RPM disks is configured as a 2-way mirror, to provide both good reliability and performance. It also has 6 SSDs, which I'll add to the ZFS pool as L2ARC devices (or "cache devices").

I should note - this is an example of L2ARC operation, not a demonstration of the maximum performance that we can achieve (the SSDs I'm using here aren't the fastest I've ever used, nor the largest.)

20 clients access walu over NFSv3, and execute a random read workload with an 8 Kbyte record size across 500 Gbytes of files (which is also its working set).

1) disks only

Since the 500 Gbytes of working set is larger than walu's 128 Gbytes of DRAM, the disks must service many requests. One way to grasp how this workload is performing is to examine the IOPS that the ZFS pool delivers:

walu# zpool iostat pool_0 30
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write  
----------  -----  -----  -----  -----  -----  -----
pool_0      8.38T  9.06T     95      4   762K  29.1K
pool_0      8.38T  9.06T  1.87K     15  15.0M  30.3K
pool_0      8.38T  9.06T  1.88K      3  15.1M  20.4K
pool_0      8.38T  9.06T  1.89K     16  15.1M  39.3K
pool_0      8.38T  9.06T  1.89K      4  15.1M  23.8K
[...]

The pool is pulling about 1.89K ops/sec, which would require about 42 ops per disk of this pool. To examine how this is delivered by the disks, we can either use zpool iostat or the original iostat:

walu# iostat -xnz 10
[...trimmed first output...]
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   43.9    0.0  351.5    0.0  0.0  0.4    0.0   10.0   0  34 c0t5000CCA215C46459d0  
   47.6    0.0  381.1    0.0  0.0  0.5    0.0    9.8   0  36 c0t5000CCA215C4521Dd0
   42.7    0.0  349.9    0.0  0.0  0.4    0.0   10.1   0  35 c0t5000CCA215C45F89d0
   41.4    0.0  331.5    0.0  0.0  0.4    0.0    9.6   0  32 c0t5000CCA215C42A4Cd0
   45.6    0.0  365.1    0.0  0.0  0.4    0.0    9.2   0  34 c0t5000CCA215C45541d0
   45.0    0.0  360.3    0.0  0.0  0.4    0.0    9.4   0  34 c0t5000CCA215C458F1d0
   42.9    0.0  343.5    0.0  0.0  0.4    0.0    9.9   0  33 c0t5000CCA215C450E3d0
   44.9    0.0  359.5    0.0  0.0  0.4    0.0    9.3   0  35 c0t5000CCA215C45323d0
   45.9    0.0  367.5    0.0  0.0  0.5    0.0   10.1   0  37 c0t5000CCA215C4505Dd0
[...etc...]

iostat is interesting as it lists the service times: wsvc_t + asvc_t. These I/Os are taking on average between 9 and 10 milliseconds to complete, which the client application will usually suffer as latency. This time will be due to the random read nature of this workload - each I/O must wait as the disk heads seek and the disk platter rotates.

Another way to understand this performance is to examine the total NFSv3 ops delivered by this system (these days I use a GUI to monitor NFSv3 ops, but for this blog post I'll hammer nfsstat into printing something concise):

walu# nfsstat -v 3 1 | sed '/^Server NFSv3/,/^[0-9]/!d' 
[...]
Server NFSv3:
calls     badcalls 
2260      0
Server NFSv3:
calls     badcalls
2306      0
Server NFSv3:
calls     badcalls
2239      0
[...]

That's about 2.27K ops/sec for NFSv3; I'd expect 1.89K of that to be what our pool was delivering, and the rest are cache hits out of DRAM, which is warm at this point.

2) L2ARC devices

Now the 6 SSDs are added as L2ARC cache devices:

walu# zpool add pool_0 cache c7t0d0 c7t1d0 c8t0d0 c8t1d0 c9t0d0 c9t1d0 

And we wait until the L2ARC is warm.

Time passes ...

Several hours later the cache devices have warmed up enough to satisfy most I/Os which miss main memory. The combined 'capacity/used' column for the cache devices shows that our 500 Gbytes of working set now exists on those 6 SSDs:

walu# zpool iostat -v pool_0 30
[...]
                              capacity     operations    bandwidth
pool                        used  avail   read  write   read  write 
-------------------------  -----  -----  -----  -----  -----  -----
pool_0                     8.38T  9.06T     30     14   245K  31.9K
  mirror                    421G   507G      1      0  9.44K      0
    c0t5000CCA216CCB905d0      -      -      0      0  4.08K      0
    c0t5000CCA216CCB74Cd0      -      -      0      0  5.36K      0
  mirror                    416G   512G      0      0  7.66K      0
    c0t5000CCA216CCB919d0      -      -      0      0  4.34K      0
    c0t5000CCA216CCB763d0      -      -      0      0  3.32K      0
[... 40 disks truncated ...]
cache                          -      -      -      -      -      -
  c7t0d0                   84.5G  8.63G  2.63K      0  21.1M  11.4K
  c7t1d0                   84.7G  8.43G  2.62K      0  21.0M      0
  c8t0d0                   84.5G  8.68G  2.61K      0  20.9M      0
  c8t1d0                   84.8G  8.34G  2.64K      0  21.1M      0
  c9t0d0                   84.3G  8.81G  2.63K      0  21.0M      0
  c9t1d0                   84.2G  8.91G  2.63K      0  21.0M  1.53K
-------------------------  -----  -----  -----  -----  -----  -----

The pool_0 disks are still serving some requests (in this output 30 ops/sec) but the bulk of the reads are being serviced by the L2ARC cache devices - each providing around 2.6K ops/sec. The total delivered by this ZFS pool is 15.8K ops/sec (pool disks + L2ARC devices), about 8.4x faster than with disks alone.

This is confirmed by the delivered NFSv3 ops:

walu# nfsstat -v 3 1 | sed '/^Server NFSv3/,/^[0-9]/!d' 
[...]
Server NFSv3:
calls      badcalls   
18729      0          
Server NFSv3:
calls      badcalls   
18762      0          
Server NFSv3:
calls      badcalls   
19000      0          
[...]

walu is now delivering 18.7K ops/sec, which is 8.3x faster than without the L2ARC.

However the real win for the client applications is that of read latency; the disk-only iostat output showed our average was between 9 and 10 milliseconds, the L2ARC cache devices are delivering the following:

walu# iostat -xnz 10
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
[...]
 2665.0    0.4 21317.2    0.0  0.7  0.7    0.2    0.2  39  67 c9t0d0 
 2668.1    0.5 21342.0    3.2  0.6  0.7    0.2    0.2  38  66 c9t1d0
 2665.4    0.4 21320.4    0.0  0.7  0.7    0.3    0.3  42  69 c8t0d0
 2683.6    0.4 21465.9    0.0  0.7  0.7    0.3    0.3  41  68 c8t1d0
 2660.7    0.6 21295.6    3.2  0.6  0.6    0.2    0.2  36  65 c7t1d0
 2650.7    0.4 21202.8    0.0  0.6  0.6    0.2    0.2  36  64 c7t0d0

Our average service time is between 0.4 and 0.6 ms (wsvt_t + asvc_t columns), which is about 20x faster than what the disks were delivering.

What this means ...

An 8.3x improvement for 8 Kbyte random IOPS across a 500 Gbyte working set is impressive, as is improving storage I/O latency by 20x.

But this isn't really about the numbers, which will become dated (these SSDs were manufactured in July 2008, by a supplier who is providing us with bigger and faster SSDs every month).

What's important is that ZFS can make intelligent use of fast storage technology, in different roles to maximize their benefit. When you hear of new SSDs with incredible ops/sec performance, picture them as your L2ARC; or if it were great write throughput, picture them as your ZIL.

The example above was to show that the L2ARC can deliver, over NFS, whatever these SSDs could do. And these SSDs are being used as a second level cache, in-between main memory and disk, to achieve the best price/performance.

Questions

I recently spoke to a customer about the L2ARC and they asked a few questions which may be useful to repeat here:

What is L2ARC?

The L2ARC is best pictured as a cache layer in-between main memory and disk, using flash memory based SSDs or other fast devices as storage. It holds non-dirty ZFS data, and is currently intended to improve the performance of random read workloads.

Isn't flash memory unreliable? What have you done about that?

It's getting much better, but we have designed the L2ARC to handle errors safely. The data stored on the L2ARC is checksummed, and if the checksum is wrong or the SSD reports an error, we defer that read to the original pool of disks. Enough errors and the L2ARC device will offline itself. I've even yanked out busy L2ARC devices on live systems as part of testing, and everything continues to run.

Aren't SSDs really expensive?

They used to be, but their price/performance has now reached the point where it makes sense to start using them in the coming months. See Adam's ACM article for more details about price/performance.

What about writes - isn't flash memory slow to write to?

The L2ARC is coded to write to the cache devices asynchronously, so write latency doesn't affect system performance. This allows us to use "read-bias" SSDs for the L2ARC, which have the best read latency (and slow write latency).

What's bad about the L2ARC?

It was designed to either improve performance or do nothing, so there isn't anything that should be bad. To explain what I mean by do nothing - if you use the L2ARC for a streaming or sequential workload, then the L2ARC will mostly ignore it and not cache it. This is because the default L2ARC settings assume you are using current SSD devices, where caching random read workloads is most favourable; with future SSDs (or other storage technology), we can use the L2ARC for streaming workloads as well.

Internals

If anyone is interested, I wrote a summary of L2ARC internals as a block comment in usr/src/uts/common/fs/zfs/arc.c, which is also surrounded by the actual implementation code. The block comment is below (see the source for the latest version), and is an excellent reference for how it really works:

/*
 * Level 2 ARC
 *
 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
 * It uses dedicated storage devices to hold cached data, which are populated
 * using large infrequent writes.  The main role of this cache is to boost
 * the performance of random read workloads.  The intended L2ARC devices
 * include short-stroked disks, solid state disks, and other media with
 * substantially faster read latency than disk.
 *
 *                 +-----------------------+
 *                 |         ARC           |
 *                 +-----------------------+
 *                    |         ^     ^
 *                    |         |     |
 *      l2arc_feed_thread()    arc_read()
 *                    |         |     |
 *                    |  l2arc read   |
 *                    V         |     |
 *               +---------------+    |
 *               |     L2ARC     |    | 
 *               +---------------+    |
 *                   |    ^           |
 *          l2arc_write() |           |
 *                   |    |           |
 *                   V    |           |
 *                 +-------+      +-------+
 *                 | vdev  |      | vdev  |
 *                 | cache |      | cache |
 *                 +-------+      +-------+
 *                 +=========+     .-----.
 *                 :  L2ARC  :    |-_____-|
 *                 : devices :    | Disks |
 *                 +=========+    `-_____-'
 *
 * Read requests are satisfied from the following sources, in order:
 *
 *      1) ARC
 *      2) vdev cache of L2ARC devices
 *      3) L2ARC devices
 *      4) vdev cache of disks
 *      5) disks
 *
 * Some L2ARC device types exhibit extremely slow write performance.
 * To accommodate for this there are some significant differences between
 * the L2ARC and traditional cache design:
 *
 * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
 * the ARC behave as usual, freeing buffers and placing headers on ghost
 * lists.  The ARC does not send buffers to the L2ARC during eviction as
 * this would add inflated write latencies for all ARC memory pressure.
 *
 * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
 * It does this by periodically scanning buffers from the eviction-end of
 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
 * not already there.  It scans until a headroom of buffers is satisfied,
 * which itself is a buffer for ARC eviction.  The thread that does this is
 * l2arc_feed_thread(), illustrated below; example sizes are included to
 * provide a better sense of ratio than this diagram:
 *
 *             head -->                        tail
 *              +---------------------+----------+
 *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
 *              +---------------------+----------+   |   o L2ARC eligible
 *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
 *              +---------------------+----------+   |
 *                   15.9 Gbytes      ^ 32 Mbytes    |
 *                                 headroom          |
 *                                            l2arc_feed_thread()
 *                                                   |

 *                       l2arc write hand --[oooo]--'
 *                               |           8 Mbyte
 *                               |          write max
 *                               V
 *                +==============================+
 *      L2ARC dev |####|#|###|###|    |####| ... |
 *                +==============================+
 *                           32 Gbytes
 *
 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
 * evicted, then the L2ARC has cached a buffer much sooner than it probably
 * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
 * safe to say that this is an uncommon case, since buffers at the end of
 * the ARC lists have moved there due to inactivity.
 *
 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
 * then the L2ARC simply misses copying some buffers.  This serves as a
 * pressure valve to prevent heavy read workloads from both stalling the ARC
 * with waits and clogging the L2ARC with writes.  This also helps prevent
 * the potential for the L2ARC to churn if it attempts to cache content too
 * quickly, such as during backups of the entire pool.
 *
 * 5. After system boot and before the ARC has filled main memory, there are
 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
 * lists can remain mostly static.  Instead of searching from tail of these
 * lists as pictured, the l2arc_feed_thread() will search from the list heads
 * for eligible buffers, greatly increasing its chance of finding them.
 *
 * The L2ARC device write speed is also boosted during this time so that
 * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
 * there are no L2ARC reads, and no fear of degrading read performance
 * through increased writes.
 *
 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
 * the vdev queue can aggregate them into larger and fewer writes.  Each
 * device is written to in a rotor fashion, sweeping writes through
 * available space then repeating.
 *
 * 7. The L2ARC does not store dirty content.  It never needs to flush
 * write buffers back to disk based storage.
 *
 * 8. If an ARC buffer is written (and dirtied) which also exists in the
 * L2ARC, the now stale L2ARC buffer is immediately dropped.
 *
 * The performance of the L2ARC can be tweaked by a number of tunables, which
 * may be necessary for different workloads:
 *
 *      l2arc_write_max         max write bytes per interval
 *      l2arc_write_boost       extra write bytes during device warmup
 *      l2arc_noprefetch        skip caching prefetched buffers
 *      l2arc_headroom          number of max device writes to precache
 *      l2arc_feed_secs         seconds between L2ARC writing
 *
 * Tunables may be removed or added as future performance improvements are
 * integrated, and also may become zpool properties.
 */

Jonathan recently linked to this block comment in a blog entry about flash memory, to show that ZFS can incorporate flash into the storage hierarchy, and here is the actual implementation.

23 Jul 2008 4:48am GMT

Moinak Ghosh: Compression in Ramdisk - Dcfs


I hammered out the BeleniX 0.7.1 release recently after quite a bit of work. One of the unfortunate things is the inexorable growth of the Ramdisk size which is used as the root filesystem when booting off CD. Obviously the OpenSolaris kernel has been getting more and more new features and drivers which means more and more kernel modules. In addition the Nvidia driver weighs in at a hefty 7.5MB for a single kernel module!

For all these reasons the ramdisk in 0.7.1 is around 84MB in size when uncompressed. This inexorable increase in size means that it will be more of a problem going forward when trying to run the livecd on low-memory machines. In addition it is also a problem for distros like MilaX which try to create an image that is as small as possible.

The ramdisk in 0.7.1 would have been even larger had it not been for a small tweak . There are many modules under /kernel that are not necessary for booting. These can be easily placed under /usr/kernel which will reside in the compressed file freeing up space in the ramdisk. The BeleniX Constructor now relocates modules given in a list from /kernel to /usr/kernel. However these modules are moved back to their original locations during harddisk installation to maintain consistency in the installed image. This list is not complete and more modules can be added.

In addition it will be nice to be able to compress files in the ramdisk akin to SquashFS. There is an ongoing SquashFS porting project on opensolaris.org. However it will also mean making the OpenSolaris kernel bootable off SquashFS. Recently I came to know of a little-known piece in OpenSolaris called Dcfs. A little OpenGrokking leads me to the following source code: onnv-gate/usr/src/uts/common/fs/dcfs/. It is just a single file and comments at the top explain all. This is a shim pseudo filesystem designed to work in conjunction with UFS. This adds an on-the-fly decompression layer above UFS. This is read-only compression and a command line utility called fiocompress is used to compress the files initially. This feature was added last Dec as part of the SPARC Newboot project that makes booting on SPARC similar to that on x86 using a boot_archive or ramdisk. That project employed this Dcfs feature to reduce size requirements of the ramdisk (very much needed since RISC binaries are bigger). So Dcfs can be used for BeleniX as well.

Now the interesting part came up when I spent half an hour reading through the code. The compression algorithm is identical to what is being done in Lofi Compression! The same header, index and segmented compression structure. Portions of the fiocompress code bear resemblance to the compression code in lofiadm. In addition Lofi Compression was introduced more than 2 years back in Feb 2006. This leaves no doubt as to where the algorithm came from even though the exact implementation is a little different. I am very happy to see my innovation making it's way into other parts of OpenSolaris and adding value.

However this knowledge also caused me some grief. All this was done silently giving no attribution to the original source. Even the comments in the Dcfs file does not mention where this algorithm comes from! This is unfortunate. I'd have expected the courtesy to give recognition/attribution where due.

23 Jul 2008 3:35am GMT

Jim Grisanzio: Solaris Book Author Coming to Japan

Solaris Application Programming author Darryl Grove will be in Japan on Friday to present at the Solaris Night Seminar in Tokyo. Hisayoshi Kato will also present. Key topic is DTrace.

23 Jul 2008 12:02am GMT

22 Jul 2008

feedPlanet OpenSolaris

OpenSolaris Observatory: Mounting an ISO

If you need to pull files from an ISO image file, is it unnecessary to first burn the image to a CD or DVD. By using the lofiadm command you can just mount the ISO and browse its contents.

The lofiadm command associates a file with a block device (you must provide an absolute path to the file). The device that becomes associated with the ISO is returned:

bleonard@opensolaris:~$ pfexec lofiadm -a ~/Desktop/sol-10-u5-ga-x86-dvd.iso 
/dev/lofi/1
bleonard@opensolaris:~$ 

Running lofiadm with no parameters will list the associated devices:

bleonard@opensolaris:~$ lofiadm
Block Device             File                           Options
/dev/lofi/1              /export/home/bleonard/Desktop/sol-10-u5-ga-x86-dvd.iso -
bleonard@opensolaris:~$ 

Now the device can be mounted:

bleonard@opensolaris:~$ pfexec mount -F hsfs /dev/lofi/1 /mnt
bleonard@opensolaris:~$ ls /mnt
boot  Copyright  installer  JDS-THIRDPARTYLICENSEREADME  License  Solaris_10
bleonard@opensolaris:~$

The lofiadm and mount steps can be combined into one as follows:

pfexec mount -F hsfs `pfexec lofiadm -a ~/Desktop/sol-10-u5-ga-x86-dvd.iso` /mnt

When finished, use the following to unmount and detach the image:

pfexec umount /mnt
pfexec lofiadm -d /dev/lofi/1 

22 Jul 2008 9:48pm GMT

Marcelo Leal: Hang in there, do it, don’t be a pain in the butt and don’t bump into the scenery

scenery, by FarlexNo, it's not my phrase, but i think it resumes the sysadmin's work… Actually that phrase is a citation from Seu Jorge, talking about a theater school where the motto was the title of this post. Seu Jorge is a singer, song writer, actor, and soundtrack composer. You can see him in "City of God" (wonderful movie), and in many Jazz festivals around the world.
peace.

22 Jul 2008 8:16pm GMT

Tim Foster: ... it's the only way to be sure

I stumbled on this tip today on planet.gnome.org about how to tune what gets displayed on your favourite planet - this has made me extremely happy, as I now get to have a userContent.css file that says:

@-moz-document domain(planet.opensolaris.org) {
  div.observatory div.person-info { display:none; }
  div.observatory div.post { display:none; }
}

Why? Well, in it's own words, "Planet OpenSolaris is a window into the world, work and lives of OpenSolaris hackers and contributors." - more particularly, I don't feel it's the right place for documentation or marketing spiel about OpenSolaris - there's other places for that.

Don't get me wrong, I'm thrilled that those guys are writing content about OpenSolaris that Google will cache and end-users will benefit from - they're doing a fantastic job! Personally though, I go to planet.opensolaris.org to read what people think: I don't go there to read software documentation or watch hundreds of screen shots of installation wizards, let alone read about quintuple-boot setups (gak!). Come on guys - there's got to be real people behind the marketeers?

So, to paraphrase Lt. Ripley, I vote we take off and userContent.css those entire posts from orbit. It's the only way to be sure.

Of course, I could be wrong - if so, feel free to load your text editor of choice, and with feeling, type div.timf div.person-info ... I'll totally understand!

22 Jul 2008 5:38pm GMT

Jim Grisanzio: A Young Mind

Inside T. Boone Pickens' Brain: "I'd rather surround myself with sharp young minds than play golf and gin rummy all day." -- T. Boone Pickins. That attitude isn't just talk from a cocky oil billionaire. It just may be a critical component to staying young as you grow old. Very interesting article on brain research, and specifically about how this guy thinks.

22 Jul 2008 3:49pm GMT

Jim Grisanzio: New OpenSolaris Trademark Policy

Michelle posted the new OpenSolaris Trademark Policy. Nice to see this document out there. The FAQ has been updated too. Trademark discussions take place on trademark-policy-dev.

22 Jul 2008 12:30pm GMT

Ben Rockwood: DTrace IP Provider

Recently introduced (snv_92) is the first piece of the DTrace Network Providers, the DTrace IP Provider. Here is a taste:

root@ultra include$ dtrace -qn 'ip:ip:*:receive{ printf("Packet recieved from %s: %d byte packetn", args[2]->ip_saddr, args[4]->ipv4_length ); }'
Packet recieved from 74.125.15.85: 40 byte packet
Packet recieved from 74.125.15.85: 40 byte packet
Packet recieved from 8.11.47.20: 88 byte packet
Packet recieved from 8.11.47.20: 216 byte packet
Packet recieved from 8.11.47.20: 200 byte packet
Packet recieved from 8.11.47.20: 136 byte packet
Packet recieved from 8.11.47.20: 104 byte packet
^C

Pretty soon snoop and tcpdump will be nothing more than unpleasant memories. :)

A big thank you to the DTrace Team!!!

22 Jul 2008 1:43am GMT

21 Jul 2008

feedPlanet OpenSolaris

OpenSolaris Observatory: Triple Boot, Part 4: Windows via VirtualBox

<< Back to Part 3, Install OpenSolaris

After setting up my triple-boot system, I was not content. I want access to Windows Vista and Ubuntu 8.04 without having to restart my system. I mostly run OpenSolaris and rebooting to switch to Vista or Ubuntu is a disruption. If I need to do some sort of benchmarking or a lengthy task on Vista or Ubuntu then it is worth the time to reboot. But there are several use cases where I just want to do something quickly and then go back to what I was doing in OpenSolaris.

With Vista, typically I just want to edit an OpenOffice document, usually a presentation. I frequently edit presentations that were originally created by someone who was running OpenOffice on Windows. Since OpenOffice does not embed the actual fonts in the document, editing a document created on Windows while running OpenOffice on a non-Windows operating system can cause problems. Even with fonts such as Arial - Arial on Windows is not exactly the same as Arial on OpenSolaris, Ubuntu, etc. As a result, items in the documents frequently do not align correctly.

So I need the ability to quickly and easily run Windows Vista for just a few minutes without having to shutdown OpenSolaris, boot Vista, and then shutdown Vista and restart OpenSolaris. There is an easy solution: VirtualBox.

Installing VirtualBox is painless, just be sure to download the correct version for OpenSolaris: 32 or 64 bit. On my system, which has an Intel Core 2 Duo chip, OpenSolaris runs in 64-bit mode, as confirmed by the isainfo command:

gs145266@opensolaris-gs-08.05:~$ isainfo -k
amd64  

VirtualBox is a small download and it installs quickly. It is very easy to use - I did not need the documentation until I started doing advanced configuration type stuff.

In VirtualBox I created a virtual machine for Windows Vista and then I had to make a choice:

  1. Use the existing Vista installation that was already on my hard disk. In other words, configure the virtual machine to use the NTFS partition on my hard drive and let it boot Windows Vista from the same installation that boots from the bare metal.
  2. Create a virtual disk image (.vdi) file and configure the virtual machine to use it. The .vdi file would be empty, so this would require getting a Vista DVD so that I could run the Windows installer, etc.

I ultimately decided to go with option 2 because of complications with option 1:

Option 2 was easy to setup and then I was able to run the installer for Windows Vista from a DVD. I added the VirtualBox guest additions and everything worked great.

There was just one problem: when running Vista in that VirtualMachine, I have no access to all that data on my hard drive's NTFS partition. It should be noted, this is not a limitation of VirtualBox - it is possible to configure "shared folders" that allow the installation of Vista that is running in a virtual machine to read/write any directory that OpenSolaris can access.

The problem is that OpenSolaris 2008.05 cannot access my NTFS partition because it does not include any support for NTFS partitions. You can add some packages to it in order to get read-only support for NTFS (see this entry from Pradhap for details), but I would like to be able to write files to the NTFS partition as well.

Once again, VirtualBox provides a solution.

VirtualBox supports raw access by a virtual machine to the host operating system's disk drives. The details are provided in section 9.9 of the VirtualBox 1.6.2 User Guide, which you will need to study before attempting this on your own system. Section 9.9 includes this important warning:

Warning: Raw hard disk access is for expert users only. Incorrect use or use
of an outdated configuration can lead to total loss of data on the physical
disk. Most importantly, do not attempt to boot the partition with the cur-
rently running host operating system in a guest. This will lead to severe data
corruption.

I wanted to provide access to the NTFS partition only, so I attempted to follow the instructions in section 9.9.2 of the documentation. Unfortunately, the VBoxManage command described in section 9.9.2 does not work in VirtualBox 1.6.2 when run on OpenSolaris - the bug is documented here.

Luckily, I have other operating systems installed on this machine. :-) I booted up Ubuntu and used the VBoxManage in my installation of VirtualBox on Ubuntu to find out the partition numbers used by VirtualBox:

gs145266@gs145266-laptop-ubu-804:~$ sudo VBoxManage internalcommands listpartitions -rawdisk /dev/sda
[sudo] password for gs145266: 
VirtualBox Command Line Management Interface Version 1.6.2
(C) 2005-2008 Sun Microsystems, Inc.
All rights reserved.
Number  Type   StartCHS       EndCHS      Size (MiB)  Start (Sect)
1       0x27  0   /32 /33  888 /172/52          6970         2048
2       0x07  888 /172/53  1023/254/63         63415     14276608
3       0xbf  1023/254/63  1023/254/63         72927    144151245
5       0x83  1023/254/63  1023/254/63           972    293507613
6       0x82  1023/254/63  1023/254/63          3827    295499673
7       0x83  1023/254/63  1023/254/63         42667    303339393

Partition 2 is the NTFS partition, so I created the files described by section 9.9.2 by using this command:

gs145266@gs145266-laptop-ubu-804:~$ sudo VBoxManage internalcommands createrawvmdk -filename /home/gs145266/part2Access.vmdk -rawdisk /dev/sda -partitions 2
VirtualBox Command Line Management Interface Version 1.6.2
(C) 2005-2008 Sun Microsystems, Inc.
All rights reserved.
RAW host disk access VMDK file /home/gs145266/part2Access.vmdk created successfully.

In addition to part2Access.vmdk, VBoxManage also created part2Access-pt.vmdk. I copied both files to a USB drive and then rebooted the system in order to bring up OpenSolaris.

The content of part2Access.vmdk was:

# Disk DescriptorFile
version=1
CID=4920ffa0
parentCID=ffffffff
createType="partitionedDevice"

# Extent description
RW 63 FLAT "part2Access-pt.vmdk"
RW 1985 ZERO 
RW 14274560 ZERO 
RW 129874637 FLAT "/dev/sda" 14276608
RW 149356367 ZERO 
RW 1 FLAT "part2Access-pt.vmdk" 63
RW 1991997 ZERO 
RW 63 FLAT "part2Access-pt.vmdk" 64
RW 7839657 ZERO 
RW 63 FLAT "part2Access-pt.vmdk" 127
RW 87382575 ZERO 

# The disk Data Base 
#DDB

ddb.virtualHWVersion = "4"
ddb.adapterType="ide"
ddb.geometry.cylinders="16383"
ddb.geometry.heads="16"
ddb.geometry.sectors="63"
ddb.uuid.image="840529aa-1989-46ce-c09e-7d3bbec98243"
ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
ddb.uuid.modification="00000000-0000-0000-0000-000000000000"
ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"

Note the value in bold: /dev/sda, which of course is the device name for my hard disk in Linux. I changed it to the value that OpenSolaris would expect on my system: /dev/dsk/c5d0p0 and then saved the file.

The final step was to make it available to the virtual machine. In order for this to work, VirtualBox has to be run with enough privileges to get raw access to the disk drive. So I modified the GNOME menu entry for VirtualBox so that it is started by pfexec:

After starting VirtualBox, I added part2Access.vmdk as a virtual disk by using File > Virtual Disk Manager > Add. Then I was able to add it as the IDE Primary Slave device for my Vista virtual machine:

And now when I run Vista in that virtual machine, it sees my NTFS partition on my hard disk as drive E:


21 Jul 2008 11:03pm GMT

Sriram Narayanan: Opportunity: Sun hardware and software in demanding environments.

As an erstwhile hardware technician, I have always loved reading about hardware. And as an erstwhile Industrial Automation technician, machine uptime is something that I enjoy learning about.

A long time discussion in the Industrial Automation Industry has been around how people would live to have the reliability of Sun hardware, the stability of Solaris, and the feature set of Windows based Automation tools.

Some years ago, I was fortunate to see a Bayer Diagnostics' Advia Centaur being commissioned at Asia's largest Thyroid testing center at Thane (pronounced: thaaney), Maharasthra. This machine (called an immunoassay system) did a whole bunch of tests at very high speeds. When you have to process something like 30,000 independent blood samples in 4 hours at night at very high speeds, then you need reliability and accuracy. The high speed system would process the test-tubes which had barcoded stickers on them.

The only thing that counted as a minus for me was on how the data was finally made available. All the data collation was made available via some custom software. There was some different mechanism of providing the data over serial port, and this was in terms of "ask me for a very specific and I'll provide you that". There was no notion of making the data available as a stream, or via some http mechanism, or even some TCP stream.

It was not as if the hardware platform was limited - the equipment was driven by a dual SPARC Solaris box running Solaris 8.

Well, this is not really a post discussing how data access could be provided better, so let met get back to the point.

I learned then from the senior technicians from Bayer, that they trusted Solaris and the Sun hardware a lot. They had a few war stories to tell me about how some Sun hardware that they had would just _refuse_ to fail ! We all wondered if Sun was actually competing with itself !

These war stories of Sun hardware just working on and on, never failing, never dying, is something that I have heard from other guys who worked on Bank ATM systems too. They'd have OS/2 Warp on the kiosks (Windows 95/98 on rare occassions), and Solaris 8 on Sun servers.

With the addition of awesome tools and technologies such as ZFS, DTrace, FMA, Trusted Security, multi core awareness, the plethora of storage options, and the soon to be integrated Crossbow - the Solaris OS and Sun hardware combination could become more attractive as a platform to delivery great solutions on.

What, then is preventing system integrators from selecting Sun for Automation Solutions ? The lack of a Automation IDE targetting Solaris would be one thing. The absence of device drivers would be another. Writing high quality device drivers is an important job, especially when writing them for devices such as high speed counters (where accuracy is important) or Analog to Digital convertors (where precision and timing are important). National Instruments' LabView did have some Solaris based releases, but that was never really marketed too well - at least in India, unfortunately.

There was an Eclipsecon presentation by Siemens' Brazil some years ago, but I haven't heard more since then.

Bottomline: Having a killer IDE, a healthy eco system of device drivers, and strong marketing could position Solaris well in those worlds where other OS can't cut it.

21 Jul 2008 8:30pm GMT