17 Jun 2024

feedPlanet Maemo

Incorporating 3D Gaussian Splats into the graphics pipeline

3D Gaussian splatting is the emerging rendering technique that is overtaking NeRFs. Since it is centered around point primitives, it is more compatible with traditional graphics pipelines that already support point rendering.

Gaussian splats essentially enhance the concept of point rendering by converting the point primitive into a 3D ellipsoid, which is then projected into 2D during the rendering process.. This concept was initially described in 2002 [3], but the technique of extending Structure from Motion scans in this way was only detailed more recently [1].

In this post, I explore how to integrate Gaussian splats into the traditional graphics pipeline. This allows them to be used alongside triangle-based primitives and interact with them through the depth buffer for occlusion (see header image). This approach also simplifies deployment by eliminating the need for CUDA.

Storage

The original implementation uses .ply files as their checkpoint format, focusing on maintaining training-relevant data structures at the expense of storage efficiency, leading to increased file sizes.

For example, it stores the covariance as scaling and a rotation quaternion, necessitating reconstruction during rendering. A more efficient approach would be to leverage orthogonality, storing only the diagonal and upper triangular vectors, thereby eliminating reconstruction and reducing storage requirements.

Further analysis of the storage usage for each attribute shows that the spherical harmonics of orders 1-3 are the main contributors to the file size. However, according to the ablation study in the original publication [1], these harmonics only lead to a modest PSNR improvement of 0.5.

Therefore, the most straightforward way to decrease storage is by discarding the higher-order spherical harmonics. Additionally, the level 0 spherical harmonics can be converted into a diffuse color and merged with opacity to form a single RGBA value. These simple yet effective methods were implemented in one of the early WebGL implementations, resulting in the .splat format. As an added benefit, this format can be easily interpreted by viewers unaware of Gaussian splats as a simple colored point cloud:

Results using a non Gaussian-splat aware renderer

By directly storing the covariance as previously mentioned we can reduce the precision from float32 to float16, thereby halving the storage needed for that data. Furthermore, since most splats have limited spatial extents, we can also utilize float16 for position data, yielding additional storage savings.

With these changes, we achieve a storage requirement of 22 bytes per splat, in contrast to the 44 bytes needed by the .splat format and 236 bytes in the original implementation. Thus, we have attained a 10x reduction in storage compared to the original implementation simply by using more suitable data types.

Blending

The image formation model presented in the original paper [1] is similar to the NeRF rendering, as it is compared to it. This involves casting a ray and observing its intersection with the splats, which leads to front-to-back blending. This is precisely the approach taken by the provided CUDA implementation.

Blending remains a component of the fixed-function unit within the graphics pipeline, which can be set up for front-to-back blending [2] by using the factors (one_minus_dest_alpha, one) and by multiplying color and alpha in the shader as color.rgb * color.a. This results in the following equation:

\begin{aligned}C_{dst} &= (1 - \alpha_{dst}) \cdot \alpha_{src} C_{src} &+ C_{dst}\\ \alpha_{dst} &= (1 - \alpha_{dst})\cdot\alpha_{src} &+ \alpha_{dst}\end{aligned}

However, this method requires the framebuffer alpha value to be zero before rendering the splats, which is not typically the case as any previous render pass could have written an arbitrary alpha value.

A simple solution is to switch to back-to-front sorting and use the standard alpha blending factors (src_alpha, one_minus_src_alpha) for the following blending equation:

C_{dst} = \alpha_{src} \cdot C_{src} + (1 - \alpha_{src}) \cdot C_{dst}

This allows us to regard Gaussian splats as a special type of particles that can be rendered together with other transparent elements within a scene.

References

  1. Kerbl, Bernhard, et al. "3d gaussian splatting for real-time radiance field rendering." ACM Transactions on Graphics 42.4 (2023): 1-14.
  2. Green, Simon. "Volumetric particle shadows." NVIDIA Developer Zone (2008).
  3. Zwicker, Matthias, et al. "EWA splatting." IEEE Transactions on Visualization and Computer Graphics 8.3 (2002): 223-238.

0 Add to favourites0 Bury

17 Jun 2024 1:28pm GMT

30 Apr 2024

feedPlanet Maemo

Dissecting GstSegments

During all these years using GStreamer, I've been having to deal with GstSegments in many situations. I've always have had an intuitive understanding of the meaning of each field, but never had the time to properly write a good reference explanation for myself, ready to be checked at those times when the task at hand stops being so intuitive and nuisances start being important. I used the notes I took during an interesting conversation with Alba and Alicia about those nuisances, during the GStreamer Hackfest in A Coruña, as the seed that evolved into this post.

But what are actually GstSegments? They are the structures that track the values needed to synchronize the playback of a region of interest in a media file.

GstSegments are used to coordinate the translation between Presentation Timestamps (PTS), supplied by the media, and Runtime.

PTS is the timestamp that specifies, in buffer time, when the frame must be displayed on screen. This buffer time concept (called buffer running-time in the docs) refers to the ideal time flow where rate isn't being had into account.

Decode Timestamp (DTS) is the timestamp that specifies, in buffer time, when the frame must be supplied to the decoder. On decoders supporting P-frames (forward-predicted) and B-frames (bi-directionally predicted), the PTS of the frames reaching the decoder may not be monotonic, but the PTS of the frames reaching the sinks are (the decoder outputs monotonic PTSs).

Runtime (called clock running time in the docs) is the amount of physical time that the pipeline has been playing back. More specifically, the Runtime of a specific frame indicates the physical time that has passed or must pass until that frame is displayed on screen. It starts from zero.

Base time is the point when the Runtime starts with respect to the input timestamp in buffer time (PTS or DTS). It's the Runtime of the PTS=0.

Start, stop, duration: Those fields are buffer timestamps that specify when the piece of media that is going to be played starts, stops and how long that portion of the media is (the absolute difference between start and stop, and I mean absolute because a segment being played backwards may have a higher start buffer timestamp than what its stop buffer timestamp is).

Position is like the Runtime, but in buffer time. This means that in a video being played back at 2x, Runtime would flow at 1x (it's physical time after all, and reality goes at 1x pace) and Position would flow at 2x (the video moves twice as fast than physical time).

The Stream Time is the position in the stream. Not exactly the same concept as buffer time. When handling multiple streams, some of them can be offset with respect to each other, not starting to be played from the begining, or even can have loops (eg: repeating the same sound clip from PTS=100 until PTS=200 intefinitely). In this case of repeating, the Stream time would flow from PTS=100 to PTS=200 and then go back again to the start position of the sound clip (PTS=100). There's a nice graphic in the docs illustrating this, so I won't repeat it here.

Time is the base of Stream Time. It's the Stream time of the PTS of the first frame being played. In our previous example of the repeating sound clip, it would be 100.

There are also concepts such as Rate and Applied Rate, but we didn't get into them during the discussion that motivated this post.

So, for translating between Buffer Time (PTS, DTS) and Runtime, we would apply this formula:

Runtime = BufferTime * ( Rate * AppliedRate ) + BaseTime

And for translating between Buffer Time (PTS, DTS) and Stream Time, we would apply this other formula:

StreamTime = BufferTime * AppliedRate + Time

And that's it. I hope these notes in the shape of a post serve me as reference in the future. Again, thanks to Alicia, and especially to Alba, for the valuable clarifications during the discussion we had that day in the Igalia office. This post wouldn't have been possible without them.

0 Add to favourites0 Bury

30 Apr 2024 6:00am GMT

02 Feb 2024

feedPlanet Maemo

libSDL2 and VVVVVV for the Wii

Just a quick update on something that I've been working on in my free time.

I recently refurbished my old Nintendo Wii, and for some reason I cannot yet explain (not even to myself) I got into homebrew programming and decided to port libSDL (the 2.x version -- a 1.x port already existed) to it. The result of this work is already available via the devkitPro distribution, and although I'm sure there are still many bugs, it's overall quite usable.

In order to prove it, I ported the game VVVVVV to the Wii:

During the process I had to fix quite a few bugs in my libSDL port and in a couple of other libraries used by VVVVVV, which will hopefully will make it easier to port more games. There's still an issue that bothers me, where the screen resolution seems to be not totally supported by my TV (or is it the HDMI adaptor's fault?), resulting in a few pixels being cut at the top and at the bottom of the screen. But unless you are a perfectionist, it's a minor issue.

In case you have a Wii to spare, or wouldn't mind playing on the Dolphin emulator, here's the link to the VVVVVV release. Have fun! :-)

0 Add to favourites0 Bury

02 Feb 2024 5:50pm GMT