29 Jan 2026
Planet Debian
C.J. Collier: Part 3: Building the Keystone – Dataproc Custom Images for Secure Boot & GPUs

Part
3: Building the Keystone - Dataproc Custom Images for Secure Boot &
GPUs
In Part 1, we established a secure, proxy-only network. In Part 2, we
explored the enhanced install_gpu_driver.sh initialization
action. Now, in Part 3, we'll focus on using the LLC-Technologies-Collier/custom-images
repository (branch proxy-exercise-2025-11) to build the
actual custom Dataproc images embedded with NVIDIA drivers signed for
Secure Boot, all within our proxied environment.
Why Custom Images?
To run NVIDIA GPUs on Shielded VMs with Secure Boot enabled, the
NVIDIA kernel modules must be signed with a key trusted by the VM's EFI
firmware. Since standard Dataproc images don't include these
custom-signed modules, we need to build our own. This process also
allows us to pre-install a full stack of GPU-accelerated software.
The
custom-images Toolkit
(examples/secure-boot)
The examples/secure-boot directory within the
custom-images repository contains the necessary scripts and
configurations, refined through significant development to handle proxy
and Secure Boot challenges.
Key Components & Development Insights:
env.json: The central configuration
file (as used in Part 1) for project, network, proxy, and bucket
details. This became the single source of truth to avoid configuration
drift.create-key-pair.sh: Manages the Secure
Boot signing keys (PK, KEK, DB) in Google Secret Manager, essential for
the module signing.build-and-run-podman.sh: Orchestrates
the image build process in an isolated Podman container. This was
introduced to standardize the build environment and encapsulate
dependencies, simplifying what the user needs to install locally.pre-init.sh: Sets up the build
environment within the container and calls
generate_custom_image.py. It crucially passes metadata
derived fromenv.json(like proxy settings and Secure Boot
key secret names) to the temporary build VM.generate_custom_image.py: The core
Python script that automates GCE VM creation, runs the customization
script, and creates the final GCE image.gce-proxy-setup.sh: This script from
startup_script/is vital. It's injected into the temporary
build VM and runs first to configure the OS, package
managers (apt, dnf), tools (curl, wget, GPG), Conda, and Java to use the
proxy settings passed in the metadata. This ensures the entire build
process is proxy-aware.install_gpu_driver.sh: Used as the
--customization-scriptwithin the build VM. As detailed in
Part 2, this script handles the driver/CUDA/ML stack installation and
signing, now able to function correctly due to the proxy setup by
gce-proxy-setup.sh.
Layered Image Strategy:
The pre-init.sh script employs a layered approach:
secure-bootImage: Base image with
Secure Boot certificates injected.tfImage: Based on
secure-boot, this image runs the full
install_gpu_driver.shwithin the proxy-configured build VM
to install NVIDIA drivers, CUDA, ML libraries (TensorFlow, PyTorch,
RAPIDS), and sign the modules. This is the primary target image for our
use case.
(Note: secure-proxy and proxy-tf layers
were experiments, but the -tf image combined with runtime
metadata emerged as the most effective solution for 2.2-debian12).
Build Steps:
-
Clone Repos & Configure
env.json: Ensure you have the
custom-imagesandcloud-dataprocrepos and a
completeenv.jsonas described in Part 1. -
Run the Build:
bash # Example: Build a 2.2-debian12 based image set # Run from the custom-images repository root bash examples/secure-boot/build-and-run-podman.sh 2.2-debian12
This command will build the layered images, leveraging the proxy
settings fromenv.jsonvia the metadata injected into the
build VM. Note the final image name produced (e.g.,
dataproc-2-2-deb12-YYYYMMDD-HHMMSS-tf).
Conclusion of Part 3
Through an iterative process, we've developed a robust workflow
within the custom-images repository to build Secure
Boot-compatible GPU images in a proxy-only environment. The key was
isolating the build in Podman, ensuring the build VM is fully
proxy-aware using gce-proxy-setup.sh, and leveraging the
enhanced install_gpu_driver.sh from Part 2.
In Part 4, we'll bring it all together, deploying a Dataproc cluster
using this custom -tf image within the secure network, and
verifying the end-to-end functionality.
29 Jan 2026 9:08am GMT
28 Jan 2026
Planet Debian
C.J. Collier: Part 2: Taming the Beast – Deep Dive into the Proxy-Aware GPU Initialization Action

Part
2: Taming the Beast - Deep Dive into the Proxy-Aware GPU Initialization
Action
In Part 1 of this series, we laid the network foundation for running
secure Dataproc clusters. Now, let's zoom in on the core component
responsible for installing and configuring NVIDIA GPU drivers and the
associated ML stack in this restricted environment: the
install_gpu_driver.sh script from the LLC-Technologies-Collier/initialization-actions
repository (branch gpu-202601).
This isn't just any installation script; it has been significantly
enhanced to handle the nuances of Secure Boot and to operate seamlessly
behind an HTTP/S proxy.
The
Challenge: Installing GPU Drivers Without Direct Internet
Our goal was to create a Dataproc custom image with NVIDIA GPU
drivers, sign the kernel modules for Secure Boot, and ensure the entire
process works seamlessly when the build VM and the eventual cluster
nodes have no direct internet access, relying solely on an HTTP/S proxy.
This involved:
- Proxy-Aware Build: Ensuring all build steps within
the custom image creation process (package downloads, driver downloads,
GPG keys, etc.) correctly use the customer's proxy. - Secure Boot Signing: Integrating kernel module
signing using keys managed in GCP Secret Manager, especially when
drivers are built from source. - Conda Environment: Reliably and speedily installing
a complex Conda environment with PyTorch, TensorFlow, Rapids, and other
GPU-accelerated libraries through the proxy. - Dataproc Integration: Making sure the custom image
works correctly with Dataproc's own startup, agent processes, and
cluster-specific configurations like YARN.
The
Development Journey: Key Enhancements in
install_gpu_driver.sh
To address these challenges, the script incorporates several key
features:
- Robust Proxy Handling (
set_proxy
function):- Challenge: Initial script versions had spotty proxy
support. Many tools likeapt,curl,
gpg, and evengsutilfailed in proxy-only
environments. - Enhancements: The
set_proxyfunction
(also used ingce-proxy-setup.sh) was completely overhauled
to parse various proxy metadata (http-proxy,
https-proxy,proxy-uri,
no-proxy). Critically, environment variables
(HTTP_PROXY,HTTPS_PROXY,
NO_PROXY) are now set before any network
operations.NO_PROXYis carefully set to include
.google.comand.googleapis.comto allow
direct access to Google APIs via Private Google Access. System-wide
trust stores (OS, Java, Conda) are updated with the proxy's CA
certificate if provided viahttp-proxy-pem-uri.
gcloud,apt,dnf, and
dirmngrare also configured to use the proxy.
- Challenge: Initial script versions had spotty proxy
- Reliable GPG Key Fetching (
import_gpg_keys
function):- Challenge: Importing GPG keys for repositories
often failed as keyservers use non-HTTP ports (e.g., 11371) blocked by
firewalls, andgpg --recv-keysis not proxy-friendly. - Solution: A new
import_gpg_keys
function now fetches keys over HTTPS usingcurl, which
respects the environment's proxy settings. This replaced all direct
gpg --recv-keyscalls.
- Challenge: Importing GPG keys for repositories
- GCS Caching is King:
- Challenge: Repeatedly downloading large files
(drivers, CUDA, source code) through a proxy is slow and
inefficient. - Solution: Implemented extensive GCS caching for
NVIDIA drivers, CUDA runfiles, NVIDIA Open Kernel Module source
tarballs, compiled kernel modules, and even packed Conda environments.
Scripts now check a GCS bucket (dataproc-temp-bucket)
before hitting the internet. - Impact: Dramatically speeds up subsequent runs and
init action execution times on cluster nodes after the cache is
warmed.
- Challenge: Repeatedly downloading large files
- Conda Environment Stability & Speed:
- Challenge: Large Conda environments are prone to
solver conflicts and slow installation times. - Solution: Integrated Mamba for faster package
solving. Refined package lists for better compatibility. Added logic to
force-clean and rebuild the Conda environment cache on GCS and locally
if inconsistencies are detected (e.g., driver installed but Conda env
not fully set up).
- Challenge: Large Conda environments are prone to
- Secure Boot & Kernel Module Signing:
- Challenge: Custom-compiled kernel modules must be
signed to load when Secure Boot is enabled. - Solution: The script integrates with GCP Secret
Manager to fetch signing keys. Thebuild_driver_from_github
function now includes robust steps to compile, sign (using
sign-file), install, and verify the signed modules.
- Challenge: Custom-compiled kernel modules must be
- Custom Image Workflow & Deferred Configuration:
- Challenge: Cluster-specific settings (like YARN GPU
configuration) should not be baked into the image. - Solution: The
install_gpu_driver.sh
script detects when it's run during image creation
(--metadata invocation-type=custom-images). In this mode,
it defers cluster-specific setups to a systemd service
(dataproc-gpu-config.service) that runs on the first boot
of a cluster instance. This ensures that YARN and Spark configurations
are applied in the context of the running cluster, not at image build
time.
- Challenge: Cluster-specific settings (like YARN GPU
Conclusion of Part 2
The install_gpu_driver.sh initialization action is more
than just an installer; it's a carefully crafted tool designed to handle
the complexities of secure, proxied environments. Its robust proxy
support, comprehensive GCS caching, refined Conda management, Secure
Boot signing capabilities, and awareness of the custom image build
lifecycle make it a critical enabler.
In Part 3, we'll explore how the LLC-Technologies-Collier/custom-images
repository (branch proxy-exercise-2025-11) uses this
initialization action to build the complete, ready-to-deploy Secure Boot
GPU custom images.
28 Jan 2026 10:45am GMT
C.J. Collier: Dataproc GPUs, Secure Boot, & Proxies

Part
1: Building a Secure Network Foundation for Dataproc with GPUs &
SWP
Welcome to the first post in our series on running GPU-accelerated
Dataproc workloads in secure, enterprise-grade environments. Many
organizations need to operate within VPCs that have no direct internet
egress, instead routing all traffic through a Secure Web Proxy (SWP).
Additionally, security mandates often require the use of Shielded VMs
with Secure Boot enabled. This series will show you how to meet these
requirements for your Dataproc GPU clusters.
In this post, we'll focus on laying the network foundation using
tools from the LLC-Technologies-Collier/cloud-dataproc
repository (branch proxy-sync-2026-01).
The Challenge: Network
Isolation & Control
Before we can even think about custom images or GPU drivers, we need
a network environment that:
- Prevents direct internet access from Dataproc cluster nodes.
- Forces all egress traffic through a manageable and auditable
SWP. - Provides the necessary connectivity for Dataproc to function and for
us to build images later. - Supports Secure Boot for all VMs.
The Toolkit:
LLC-Technologies-Collier/cloud-dataproc
To make setting up and tearing down these complex network
environments repeatable and consistent, we've developed a set of bash
scripts within the gcloud directory of the
cloud-dataproc repository. These scripts handle the
creation of VPCs, subnets, firewall rules, service accounts, and the
Secure Web Proxy itself.
Key Script:
gcloud/bin/create-dpgce-private
This script is the cornerstone for creating the private, proxied
environment. It automates:
- VPC and Subnet creation (for the cluster, SWP, and management).
- Setup of Certificate Authority Service and Certificate Manager for
SWP TLS interception. - Deployment of the SWP Gateway instance.
- Configuration of a Gateway Security Policy to control egress.
- Creation of necessary firewall rules.
- Result: Cluster nodes in this VPC have NO default
internet route and MUST use the SWP.
Configuration via env.json
We use a single env.json file to drive the
configuration. This file will also be used by the
custom-images scripts in Part 3. This env.json
should reside in your custom-images repository clone, and
you'll symlink it into the cloud-dataproc/gcloud
directory.
Running the Setup:
# Assuming you have cloud-dataproc and custom-images cloned side-by-side
# And your env.json is in the custom-images root
cd cloud-dataproc/gcloud
# Symlink to the env.json in custom-images
ln -sf ../../custom-images/env.json env.json
# Run the creation script, but don't create a cluster yet
bash bin/create-dpgce-private --no-create-cluster
cd ../../custom-imagesNode
Configuration: The Metadata Startup Script for Runtime
For the Dataproc cluster nodes to function correctly in this proxied
environment, they need to be configured to use the SWP on boot. We
achieve this using a GCE metadata startup script.
The script startup_script/gce-proxy-setup.sh (from the
custom-images repository) is designed to be run on each
cluster node at boot. It reads metadata like http-proxy and
http-proxy-pem-uri (which our cluster creation scripts in
Part 4 will pass) to configure the OS environment, package managers, and
other tools to use the SWP.
Upload this script to your GCS bucket:
# Run from the custom-images repository root
gsutil cp startup_script/gce-proxy-setup.sh gs://$(jq -r .BUCKET env.json)/custom-image-deps/This script is essential for the runtime behavior of the
cluster nodes.
Conclusion of Part 1
With the cloud-dataproc scripts, we've laid the
groundwork by provisioning a secure VPC with controlled egress through
an SWP. We've also prepared the essential node-level proxy configuration
script (gce-proxy-setup.sh) in GCS, ready to be used by our
clusters.
Stay tuned for Part 2, where we'll dive into the
install_gpu_driver.sh initialization action from the
LLC-Technologies-Collier/initialization-actions repository
(branch gpu-202601) and how it's been adapted to install
all GPU-related software through the proxy during the image build
process.
28 Jan 2026 10:37am GMT