About a year ago I started investigating if it would be possible to create an immersive, engaging and realistic virtual presence experience for live theatre. The general idea would be to construct a live 3D representation of a stage, then allow a remote audience to select where they would like to see the show from. Perhaps virtual cameras could be used by a live video production team to broadcast a show with pre-defined edits, but without cameras interrupting the view or distracting the audience that is physically present in the theatre at the time. Those viewing remotely could use a Virtual Reality headset to watch the show, looking around them as they wanted, or even getting up and moving around without the risk of verbal abuse and projectiles being thrown at them.
Before I go any further, yes, I know plays are tightly controlled under licensing restrictions and that normally recording or broadcast of shows is strictly forbidden. That’s just a legal problem, one that can be overcome if people are willing. The technical challenges are the larger issue, it is down to how far the limits of current technology can be pushed, and whether they can be pushed far enough. There are also questions over whether it’s appropriate for live theatre, and whether the experience will be the same. Granted, there’s nothing like live theatre in-person, but live and pre-recorded broadcast of live performances are becoming increasingly common, such as NT Live.
Originally my (casual) research centred around using Time-of-Flight cameras, which are used in devices like Microsoft’s Kinect, and operate by timing the period between light being emitted from an infra-red emitter and received by a sensor array. This tiny interval is then divided by the speed of light, and halved, to produce the distance to the object. This creates a ‘depth map’, which can then be mapped onto a full colour image from a separate photographic sensor to create a rough three-dimensional model of what’s in front of the sensors.

Visual artists have successfully used these devices to create interactive installations, sometimes combined with projection. The art can be formed around the capabilities and limitations of the technology, and the novelty is in the interactivity.
These sensors can only see what’s in front of them, and any objects in the foreground will occlude objects in the background. The depth data is therefore limited in its utility for creating a full representation of a scene on stage. This limitation can be mitigated by using many sensors in different positions, surrounding the area to be watched. The sensor data can then be processed to create a better defined model of the stage environment. At least, that’s the theory. Unless some form of synchronisation is employed, or differing wavelengths can be used, the light emitted from one emitter/sensor pair could be received by other sensors to completely confuse them.
There are much bigger problems than confused sensors though. The depth that the sensors can operate to is severely limited, generally around 5 or 6 metres, maybe 10 metres for higher-end ones. It might be just about possible to get coverage of a small stage, but in general that range is just too short. They are also very limited in their resolution, being equivalent to an early webcam. In addition, the capture rate is limited, a restriction that becomes compounded when you start to synchronise multiple transceivers. They will also likely produce odd results when they encounter stage effects such as smoke, haze and snow.
I gave up on the idea after the above flaws mounted up. Recently, however, there have been incredible advances in a similar technology: light-field cameras.
These cameras capture normal wavelengths of light, but in addition to colour and intensity, they also measure the direction of the light. This information allows the depth to be inferred and mapped to individual pixels, leading to to some remarkable consequences. One example frequently paraded is the ability to select the focal point of a photograph after it’s been shot. Yes, that’s right, you take a photo and decide afterwards what you want to be in focus… and we’re not talking just applying some Gaussian blurring to fake a depth-of-field effect, you’re effectively selecting which rays of light you want to include to produce the image.
Such cameras have been around for years, but have only recently started to become commercialised. One of the leaders in the field is Lytro, who started off with a tubular camera in the shape of a lipstick. This was mostly a novelty device, but they have now developed 3 more ‘serious’ products: The Illum, a mirror-less DSLR-style handheld camera; Lytro Immerge, designed for VR; and Lytro Cinema, for Hollywood-scale video productions.
Details on these products are thin on the ground, especially for Immerge and Cinema, but just imagine what one could do with a few of these. They could, potentially, replace the ToF cameras in the Virtual Reality Theatre concept. I must confess I got ridiculously excited when I discovered the Immerge and Cinema cameras last week.
The amount of data produced from these cameras is truly monstrous. The Cinema and Immerge cameras are paired with a huge, fast storage array. To process this data in real-time would require immense computing capacity, oodles of fast intermediate storage (or an awful lot of RAM) and serious networking to tie it all together. It’s certainly not the sort of thing a small theatre could afford to do themselves, or probably a large theatre either. Maybe it’s the sort of thing a specialist business would do, offering up a mobile service-for-hire, with the processing equipment stored across several large vehicles (similar to what’s done for live outdoor broadcasts) plus generators to power it all.
In a solution that allows arbitrary virtual camera locations to be viewed, with multiple light field cameras capturing simultaneously, there’d need to be a pipeline something like this:
Raw data -> depth map -> combination -> 3D model -> camera rendering -> transcoding -> delivery
Whether it uses a point cloud or a texture-mapped polygonal representation, the 3D model would probably have to be very high quality to make it believable. This will be the most computationally intense part of the system. Once the model is constructed, it’s relatively trivial to render the images for individual virtual cameras as required. The remainder is a matter of standard video delivery.
There is still a barrier to viewing these live performances through a VR headset, however: that of latency. This is one of the biggest problems that manufacturers like Oculus and HTC have needed to address to prevent motion sickness. The time between moving your head and the image updating to reflect the change in view must be minuscule to prevent the brain from rejecting it and consequently rejecting the contents of the viewer’s stomach.
This is a problem for the Virtual Theatre system because it requires that the position change be sent over the Internet, to the rendering system, and the resulting video back to the viewers location and then into the headset display. Latencies on the Internet may appear small enough over a fast connection when you’re browsing sites, but VR, the times are huge. Perhaps this problem could be mitigated by pre-empting movement of the wearer, and constantly sending several options of viewpoints which the viewer’s computer would then select. Maybe an alternative is to send a representation of the 3D model back to the viewer’s location, and let it be rendered there. Or even have a clever mixture of local and remote rendering. Latency issues have been overcome before by clever workarounds, as evidenced by live game streaming services.
I’m sure there are other flaws in my proposed system too, but using light-field instead of time-of-flight cameras it might be a little bit closer to being something technically feasible – even if it would require budgets only accessible to the tech giants.