Reconstructing Patan Durbar Square from videos

During our third-year minor project, my friends and I had worked on reconstructing shoes in 3D for a virtual try-on application.

By the time we had to choose our fourth-year major project, we had some experience with 3D reconstruction. Jyoti sir suggested that we try applying what we had learned to something much larger: cultural heritage sites.

That eventually led us to Patan Durbar Square.

A shoe can be placed on a table, photographed from every direction, and captured again whenever something goes wrong.

A temple cannot.

It is always there, surrounded by people, birds, changing light, shadows, neighbouring buildings, and parts that are difficult to see from the ground. We soon learned that reconstructing a public heritage site was as much a data-collection problem as a machine-learning problem.

The project eventually became our paper, “3D reconstruction of cultural heritage sites: A case study of Patan Durbar Square”, published in Elsevier’s Digital Applications in Archaeology and Cultural Heritage.

You can also watch the final reconstruction here.

Final reconstruction video from the Patan Durbar Square project.

The first challenge was deciding when to go

Before we could train anything, we had to capture the square properly.

We first had to figure out which time of day would give us usable footage. If we went when it was crowded, people would block large parts of the structures. If the sunlight was too harsh, the shadows and brightness could change sharply as we walked around a temple. If the footage was blurred or did not overlap enough, the reconstruction could fail later.

There was no way to know immediately whether a capture was good enough. We could spend time recording a structure, return with gigabytes of data, process it, and only then discover that some angle or section had not been captured properly.

We recorded videos while walking around the structures at different distances and heights. We tried to keep neighbouring frames overlapping so that the system could understand that two images were looking at the same physical points from slightly different positions.

We captured sites including Krishna Mandir, Vishwanath Temple, Charnarayan Temple, Maharani Hiti, Chyasin Dega, courtyards, statues, and several other structures around Patan Durbar Square.

For each site, we later extracted hundreds or sometimes more than a thousand frames from the videos.

We came back with gigabytes of images.

That was only the beginning.

We wanted a view from above

One thing we wanted was drone footage.

From the ground, it is difficult to capture rooftops and elevated architectural details. You can walk around a temple many times and still never show the camera what the top actually looks like.

We tried to get permission to use a drone at Patan Durbar Square, but we were not able to get access.

So most of our Patan data had to be captured from the ground.

This later became one of the clearest limitations of the project. The reconstruction looked good from the angles represented in the videos, but areas that the camera had never properly seen could become incomplete or distorted.

It was a useful lesson: a model cannot reconstruct information that was never present in the data.

Turning thousands of images into a scene

Once we had the footage, we extracted individual frames and passed them through COLMAP.

COLMAP tries to identify the same visual features across different images. From those matches, it estimates where the camera was positioned when each image was captured and generates an initial sparse point cloud of the scene.

In simple terms, it takes a large collection of overlapping images and begins to work out how they relate to one another in 3D space.

That camera information and sparse point cloud then became the starting point for Gaussian Splatting.

Instead of representing the site only as a traditional polygon mesh, Gaussian Splatting represents it using a large collection of small 3D Gaussians. Each one has properties such as position, color, shape, scale, and opacity.

When rendered together, they reproduce the appearance of the original scene from different viewpoints.

The first time the reconstruction began to look like the actual square was exciting. You could move the camera through a scene that had been created from ordinary videos captured on a phone.

But the first results also had many strange things inside them.

People became floating artifacts

Patan Durbar Square is a public place. People were constantly walking through the videos. Birds would appear for a few frames and then disappear. Objects moved. Lighting changed.

The reconstruction method assumes that the scene is mostly static.

A temple remains in the same place across every frame. A person does not.

Because a person appeared in some images but not others, the system sometimes reconstructed pieces of them as blurry floating artifacts. The same happened with birds and other moving objects. From one angle, the scene could look clean. As soon as we moved the virtual camera, a strange collection of colors or Gaussian blobs might appear in the air.

The baseline reconstruction could be technically successful and still look obviously wrong.

At first, we removed some of these blobs manually. We would inspect the scene, identify unwanted Gaussians, and clean them up.

That worked, but doing it manually for every site would not scale.

So we began experimenting with removing the unwanted objects before training the reconstruction.

Masking the things that should not be there

We used GroundingDINO to detect objects such as people and birds in the extracted frames.

GroundingDINO could take a text prompt like “person” or “bird” and return the part of the image where that object appeared.

We then passed those detected areas to the Segment Anything Model, or SAM, to produce a more precise mask around each object.

During reconstruction, those masked regions were ignored.

This meant the system could learn from the temple, ground, walls, and surrounding structures without trying to reconstruct every person who happened to walk past the camera.

The difference was easy to see. Many of the floating human-shaped and bird-shaped artifacts disappeared.

Interestingly, the improvement was not always obvious in the usual numerical metrics.

A model with visible artifacts could sometimes receive a similar or even better score than one that looked cleaner to us. Metrics such as PSNR and SSIM measure particular kinds of image similarity, but they do not always capture whether a floating blob ruins the experience of moving through a 3D scene.

That made visual inspection an important part of our evaluation.

We had to look at the metrics, but we also had to actually enter the scene and move around it.

Trying different versions of Gaussian Splatting

We did not settle on the first implementation that produced a result.

We compared the original Gaussian Splatting pipeline with Splatfacto, an implementation built around gsplat and Nerfstudio.

Splatfacto was faster and gave us easier access to additional techniques such as masking and bilateral grids.

The bilateral grid helped with a different problem.

Phone cameras do not always process every frame in exactly the same way. Exposure, brightness, and other image adjustments can change slightly as the camera moves. In the final reconstruction, those inconsistencies could appear as bright-colored floaters, especially around the sky, or create uneven-looking surfaces.

Using a bilateral grid helped the model account for some of those image-processing differences.

It increased training time, but the training was still manageable. A scene that took around seven or eight minutes in the simpler setup could take around twenty minutes with the bilateral grid.

We tried several combinations:

The original Gaussian Splatting implementation
Splatfacto without additional techniques
Splatfacto with masking
Splatfacto with masking and bilateral grids
Separate training for individual structures
Joint training for neighbouring structures

Some of the results were not what we expected.

Training two adjacent sites together sometimes produced a more continuous reconstruction than training them separately and placing them beside one another later. The overlapping images helped the model understand how the two structures related spatially.

But combining more and more sites did not continue improving the result.

When we combined three larger datasets, the output became worse. Increasing the number of Gaussians did not solve it.

More data was not automatically better. The additional scale and complexity made the optimization harder.

Some blobs still had to be removed by hand

Masking and bilateral grids removed many artifacts, but they did not make every reconstruction perfect.

Some areas had not been captured with enough detail. Some masks were inaccurate. Some residual Gaussians still appeared where they should not.

We imported the reconstructed models into Blender and manually cleaned the remaining artifacts.

This was one of the less glamorous parts of the project.

You could automate most of the pipeline, train sophisticated models, and evaluate multiple configurations, but the final scene could still contain one bright blob floating beside a temple that had to be found and removed manually.

The final output was a combination of automated reconstruction, experiments, visual judgment, and patient cleanup.

Building a way to explore the result

A reconstructed scene stored on our computers was not very useful by itself.

We wanted people to be able to move through Patan Durbar Square virtually, so we built a web viewer for the Gaussian splats.

The raw splat files were large, so we worked on compressing and loading them in a way that could still render smoothly in a browser.

We also added a map interface. A user could choose different locations around Patan Durbar Square, move between reconstructed sites, and read information about the structures.

That was the point where the project began to feel less like a collection of experiments and more like something a normal person could experience.

The result was not a recorded fly-through or a stitched panoramic video. It was an interactive 3D environment. A person could change the viewpoint and move through the reconstructed space.

Then we had to write the paper

Building the system was only one part of the project.

We also had to explain what we had done, compare the methods, organize the experiments, create visualizations, report the metrics, discuss what failed, and place the work in the context of existing research.

I worked across the software, data collection, visualization, experiments, and writing.

Writing the paper forced us to be more precise about claims that had felt obvious while building.

For example, we could see that masking removed artifacts. But how should we measure that? What happened when the visual quality improved but standard metrics barely changed? Why did combining two datasets help while combining three made the result worse? What could the system genuinely support, and what remained future work?

We had to turn months of experiments, failed runs, images, models, and observations into one coherent account.

The paper went through writing, revision, and review before it was eventually accepted and published by Elsevier.

Seeing the work on ScienceDirect felt very different from submitting the college project.

The project had begun with Jyoti sir suggesting that we apply our previous 3D reconstruction experience to cultural heritage. It ended with a published paper and an actual pipeline that someone else could reproduce and extend.

What I remember from it

The polished output makes the process look more orderly than it was.

What I remember is choosing when to visit Patan, repeatedly walking around structures with a phone, returning with gigabytes of footage, discovering missing viewpoints, processing thousands of frames, waiting for COLMAP, watching training runs, finding people turned into floating blobs, trying masks, comparing scenes, cleaning Gaussian artifacts manually, and rewriting sections of the paper.

I also remember the part we could not do.

Not getting drone access meant that some roofs and higher areas remained poorly represented. It was disappointing, but it made the limitation of ground-only capture very clear. If we did the project again, aerial views would be one of the first additions.

Our reconstruction was not a perfect digital copy of Patan Durbar Square. It was an experiment in how far a small team could go using videos from a phone, modern reconstruction methods, and a lot of iteration.

The work showed that heritage sites could be documented without specialized scanning equipment, and that the resulting models could support virtual exploration, education, research, and eventually restoration work.

For me, it also completed a strange progression.

We had started by reconstructing a shoe so someone could view it in augmented reality.

A year later, we were using the same curiosity to reconstruct temples and courtyards that had stood for centuries.