AI Case Study in Automotive Advertising

This case study serves as a proof of concept to demonstrate the usability of AI in a professional production environment. Created as a spec spot for the BMW i4, the project functions both as an internal marketing piece and as an exploration of how generative AI can be integrated into high-end VFX workflows for automotive advertising.

The process begins with rough scene blocking in Unreal Engine to establish basic geometry and camera layout. Based on a grey-shaded render, AI stills are generated in ComfyUI using ControlNets, guiding the generation of the environment and a vehicle that matches the intended proportions and placement within the scene. In parallel, the car is rendered in 3D and aligned to the AI-generated vehicle in the backplate. The car is lit using scene-appropriate HDRIs and composited into the AI-generated stills, where the shots are refined before transitioning into motion. Image-to-video generation is then performed using Runway Gen-4 with text-prompt guidance. The resulting video is tracked in SynthEyes to enable the possibility of a fully rendered 3D vehicle with a matching camera.

The car itself is rendered in Maya using V-Ray with full AOV outputs, providing extensive artistic control during the compositing stage in Nuke. This allows precise adjustment of lighting and material characteristics to accurately represent the BMW i4’s visual identity. Product accuracy and therefore fine-grained adjustability are essential requirements in automotive advertising and often cannot be fully guaranteed by purely generative workflows yet. While the surrounding environment remains AI-generated to enable fast iterations and flexible scene building, the vehicle is kept fully 3D to allow precise control for client iterations. Additional post-production enhancements are applied to enrich detail in visually sparse areas, followed by final color grading to unify the overall look.

The project demonstrates a hybrid approach to automotive advertising, where AI accelerates scene development and iteration while established VFX workflows ensure precision, control and product accuracy.

© All rights reserved by RECOM FILM GmbH.

Character Replacement with Rig Retargeting

The following project explores an approach for character replacement in video using captured motion. Key focus is the rig retargeting for characters with significantly different proportions and physical behavior. A common limitation of motion transfer is that the extracted skeleton and motion are directly applied to a new character, causing visual distortion or implausible movement when body size, weight or posture differ.

The source footage shows a woman balancing on a slackline. For character replacement, this scenario becomes particularly demanding, as the animated rig must accurately reproduce subtle camera movement and the moving ground plane. Special attention is required at the contact point between foot and slackline to avoid visible sliding or floating. The original performer is replaced by an AI-generated grizzly bear, whose proportions, mass and inertia require a fundamentally different interpretation of the same movement.

Pose detection is first performed on the actress in the source video, capturing both motion and alignment within the shot. In parallel, pose detection is applied on the reference character.

Rather than directly transferring the extracted pose onto a differently positioned character, the static reference rig is first scaled and aligned to match the size and position of the original performer within the scene. After the alignment the original motion is transferred onto the reference rig, resulting in an animation that closely matches the target character’s proportions while preserving the motion of the original performance. The process is automated, but allows manual refinement with adjustments of joint scales to better reflect the physical characteristics of the reference. The workflow uses WAN 2.2 Animate to remove the original performer and generate the new character.

The output is enhanced through video upscaling with detail refinement, followed by AI-based sound and music generation with KLING Video to Audio. Final post-processing and color grading are done in Nuke to add small camera artifacts and lens imperfections to enhance the less polished look of real-world photography.

The result demonstrates a proportion-aware approach for animated character replacement in complex live-action footage.

Multimodal Pipeline for Product Representation

This multimodal workflow for accurate product representation across images and video is developed through a fictional advertising campaign using a Nintendo collectible figure. The key focus of the project is maintaining product consistency across multiple outputs, perspectives and lighting conditions.

The process begins with the creation of a dataset that combines newly photographed product images with existing material to cover a broad range of perspectives, detail close-ups and lighting scenarios. Based on this dataset, a LoRA is trained for FLUX.1  using 47 images at a resolution of 1024×1024 px and 2000 training steps, resulting in a training time of approximately two hours on an RTX 3090. Although the dataset includes descriptive captions with camera perspective information, camera control based on text prompts proves unreliable during later stages of image generation.

To achieve more reliable control of the perspective, the workflow incorporates 3D mesh generation using Hunyuan3D. The resulting model can either be rendered externally or placed directly within the ComfyUI 3D Viewer, allowing precise control over scale, position and camera perspective.

After generating or selecting a background, the product is integrated using FLUX Inpainting with the trained product LoRA to ensure visual accuracy, combined with ControlNets (Depth and Canny) for perspective transfer. While simultaneous generation of product and background is possible, it exposes current limitations of the dataset size: increasing LoRA strength to improve product accuracy can unintentionally affect the background and recurring visual patterns from the training data become noticeable. For this reason, a sequential generation approach is preferred, as it provides greater control over both product placement and background composition.

After image generation, the workflow continues with an image-to-video approach using WAN 2.2, followed by video upscaling and enhancement. Sound and music are generated via KLING Video-to-Audio and final post-processing is performed in Nuke.

A notable limitation remains in video generation: while FLUX handles text rendering reliably in still images, WAN tends to lose typographic detail during the diffusion process. Overall, the project demonstrates a scalable, hybrid 2D–3D approach for product representation with a strong emphasis on controllability and visual consistency.

Context-Aware Pipeline for Product Photography

This project introduces a product-aware workflow for transforming simple and cost-efficient photo shoots into polished product photography outputs suitable for advertising. The approach is demonstrated through a fictional perfume campaign with a primary focus on preserving product accuracy while achieving cohesive lighting, clean integration and overall visual quality.

The workflow begins by extracting the product from the original photograph and placing it as a 2D element within the frame. Using FLUX Inpainting, the background is generated around the product based on a text prompt. At this stage, the product is not yet fully integrated: edges remain imperfect, the product does not match the lighting of the environment and material interactions such as reflections on the product are missing. However, environmental effects cast by the product, such as shadows and indirect lighting, are already introduced in the background. Alternatively, pre-existing backgrounds can be used with image blending.

In the next step, a relighting pass is performed using IC-Light, based on Stable Diffusion 1.5 as the underlying model. This step aligns product and background under a shared lighting setup and color tonality, allowing both elements to visually merge. A subsequent image-to-image process using FLUX further homogenizes the image and improves overall coherence. The diffusion processes also help to refine product edges by reducing cutout artifacts.

During relighting and image-to-image processing, fine details and typographic clarity on the product can be degraded. To address this, frequency separation within ComfyUI is used in the final step. This photo editing technique preserves the lighting and color of the processed image while restoring high-frequency detail from the original photograph, recovering characteristic product features and text.

The workflow concludes with post-processing to enhance realism by mimicking camera characteristics, resulting in a visually cohesive product image that bridges the gap between low-effort source imagery and high-quality commercial visuals.