Turning a video of a shoe into something you could try on

For our third-year minor project, my friends and I decided to work on virtual shoe try-on.

The idea was easy to explain.

When buying shoes online, you usually see a few photographs taken from fixed angles. You can inspect the color and perhaps zoom into the material, but you cannot pick the shoe up, turn it around, or see what it might look like on your own feet.

We wanted to see whether someone could open a mobile application, point the camera toward their feet, and try on a realistic digital version of a shoe.

At first, the augmented-reality part seemed like the main problem.

It was not.

Before we could place a shoe on someone's feet, we first needed a good 3D version of the shoe. Most stores do not already have accurate 3D models of every product. Creating those models manually would take too much time and skill to be useful for a normal catalogue.

So the real problem became:

Could we take a short video of a physical shoe and turn it into a 3D model that was good enough to use in augmented reality?

That question took us through computer vision, segmentation, Gaussian Splatting, mesh extraction, foot tracking, occlusion, and mobile development.

The final work became our paper, "3D Reconstruction of Shoes for Augmented Reality".

A demo of the reconstructed shoe inside the mobile AR try-on flow.

We started by collecting shoes

A reconstruction model does not need thousands of different shoes to build one shoe. It learns the appearance of one object from many views of that object.

But we also wanted the larger pipeline to work on different shoes and backgrounds. For that, we needed a reliable way to separate a shoe from everything behind it.

We started asking students around Pulchowk Campus if we could record their shoes.

We collected 101 videos of different shoes. Each video was around 30 seconds long and was recorded using whatever phones were available, including devices such as a Poco X3, Samsung Galaxy S9, Redmi 12, and others.

The fact that the videos came from different phones was useful. It meant the system was not being tested only on images from one controlled camera.

From every video, we sampled images at regular intervals. This gave us roughly 3,000 shoe images.

At that point, we had a large folder full of shoes photographed from different angles, on different floors, under different lighting conditions, and with different things appearing in the background.

The next problem was removing those backgrounds.

Removing the background was more expensive than it looked

For the 3D reconstruction to focus on the shoe, we needed a mask for each image: the pixels belonging to the shoe should remain, and everything else should be ignored.

We first used Meta's Segment Anything Model, or SAM.

The quality was good, but SAM required more than 6 GB of GPU memory in our setup. Hosting something that large only to separate a shoe from its background did not make much sense for the kind of application we imagined.

We wanted a much smaller model that could do this specific job.

But training a segmentation model requires an annotated dataset, and manually tracing the outline of a shoe in around 3,000 images would have taken a long time.

So we used the larger model to help create the dataset for the smaller one.

First, we trained a shoe detector using an existing dataset. The detector found a bounding box around the shoe. We then passed that box to SAM, which produced a detailed segmentation mask.

This automated most of the annotation process.

It was not perfect.

Sometimes the detector selected the wrong area. Sometimes SAM missed part of the shoe or included some of the background. A few frames did not contain a useful view of the shoe at all.

We still went through the generated annotations manually, corrected the bad ones, and removed the frames that should not have been there.

It was much faster than drawing every mask from the beginning, but there was still a human cleanup step between "the model generated an annotation" and "the dataset was trustworthy."

Avoiding an easy mistake in the dataset

When we divided the images into training, validation, and test sets, we had to be careful about one subtle problem.

Each 30-second video produced many similar images of the same shoe.

If frames from one video appeared in both the training and test sets, the model could appear unusually accurate simply because it had already seen nearly identical images during training.

So we wrote the split around videos rather than individual images.

All images extracted from one video stayed together in the same split.

That gave us a more honest test: the model had to segment shoes from videos it had not already seen.

We trained several alternatives, including YOLOv8 models and a UNet-based model. They all performed reasonably well. The best model achieved an Intersection over Union score of around 0.95.

More importantly for the actual application, the final segmentation model was only about 6.5 MB.

We had gone from depending on a model measured in gigabytes to a small model trained specifically for the problem we had.

Reconstructing the shoe

Once the background had been removed, we could begin turning the images into a 3D representation.

The first step was COLMAP.

COLMAP looked for visual features that appeared across multiple images. From those correspondences, it estimated where the camera had been positioned when each frame was captured and produced an initial point cloud.

This gave us the basic geometry and camera information required for the next stage.

We then trained a Gaussian Splatting model for each shoe.

Gaussian Splatting represents an object using a large number of small 3D Gaussians. Each Gaussian stores properties such as its position, color, opacity, scale, and orientation. When all of them are rendered together, they reproduce the appearance of the shoe from different viewpoints.

The training itself was surprisingly fast. A typical run of around 7,000 iterations took roughly ten minutes in our setup.

The result looked good when rendered as a Gaussian scene. You could rotate around the shoe and see it from viewpoints that had not appeared directly in the input video.

But there was another problem.

Augmented-reality tools wanted a mesh

Gaussian Splatting produced a good visual representation, but most existing 3D and augmented-reality tools were built to work with polygon meshes.

We could not simply take the splat file and place it inside the mobile AR experience.

We needed to convert it.

For this, we used SuGaR, which aligns the Gaussians toward the surface and extracts a mesh from them.

This was the slowest part of the reconstruction pipeline. The refinement ran for around 9,000 iterations, and extracting one mesh took approximately an hour.

The first meshes also contained a lot of unwanted black geometry.

During an earlier processing stage, the removed background had become black pixels. Those pixels later appeared as black faces in the extracted mesh.

So we wrote another cleanup script.

It removed the black vertices and kept the largest connected component, which was usually the actual shoe. We then used Blender to combine and prepare the object, material, and texture files and export them into a format the AR system could use.

The complete path had become:

Short video of a shoe
-> sampled images
-> shoe segmentation
-> camera estimation and point cloud
-> Gaussian Splatting reconstruction
-> mesh extraction
-> mesh cleanup
-> mobile-ready 3D shoe

Every arrow in the diagram hid another problem we had not known about when we started.

Putting the shoe on a foot

Once we finally had a usable mesh, we could return to the part that had originally attracted us to the project: augmented reality.

We used Lens Studio for foot and leg tracking.

We created models for both the left and right shoes and aligned them with the user's detected feet. The system had to follow the position and orientation of the legs as the person moved.

Alignment alone was not enough.

For the shoe to look believable, parts of the real foot and leg had to correctly appear in front of or behind the virtual object.

Without occlusion, the digital shoe could look like it was simply floating over the image.

We used transparent cylinders through the openings of the shoes to help create the occlusion effect. It was a practical workaround rather than some perfect understanding of the foot, but it made the result look more natural.

We published the experience through Lens Studio and connected it to a Flutter mobile application using Snap's AR tooling.

Inside the application, a user could inspect the reconstructed shoe in 3D and then open the camera to see the shoe placed on their own feet.

That was the moment when all the independent parts finally became one product.

It was a chain, not one model

Looking back, the interesting part of this project was not any single model.

The application depended on a chain of systems:

Collecting a useful video
Sampling the frames
Detecting and segmenting the shoe
Estimating the camera positions
Reconstructing the object with Gaussian Splatting
Converting the splats into a mesh
Cleaning the mesh
Tracking the feet
Handling occlusion
Loading everything inside a mobile application

A failure in any one part affected everything after it.

A poor mask produced a poor reconstruction. A good Gaussian reconstruction could still produce a messy mesh. A good mesh could still look wrong if the shoe was badly aligned with the foot.

This was my first experience with a project where several different technical systems had to work together before the user could experience anything useful.

What worked, and what did not

The reconstructed shoes achieved strong image-quality results, with average PSNR values in the low-to-mid thirties across the tested examples.

The segmentation models also performed well, and we reduced the masking model from something requiring several gigabytes of memory to a model of around 6.5 MB.

But the prototype was not ready to replace trying on a real shoe.

The quality still depended heavily on the original video. Reflective materials, missing viewpoints, blur, and poor lighting could affect the reconstruction. Mesh extraction took much longer than Gaussian training. Foot alignment and occlusion could still fail in unusual poses.

And seeing a shoe on your feet is not the same as knowing whether it will fit.

The project dealt with appearance, not physical comfort or sizing.

Still, it showed that the complete idea was technically possible: a relatively short phone video could become a reconstructed 3D shoe, and that shoe could then appear on someone's feet through a mobile camera.

What it led to

We wrote the work up as a paper and continued thinking about how the pipeline could extend beyond shoes to other products in fashion.

But its biggest effect on me came later.

Before this project, 3D reconstruction had mostly been something I understood through individual techniques and examples. This was the first time we had taken it through a complete application: from collecting our own data to producing a model that a person could interact with on a phone.

It also gave us the experience that later made a much larger idea seem possible.

For our next project, instead of reconstructing an object that could fit on a table, we tried reconstructing parts of Patan Durbar Square.

A shoe was where we learned the pipeline.

The temples were where we tested how far we could take it.