Wealth of 2D-to-3D Conversion Techniques Available, Patents Show
CE makers building 2D-to-3D conversion chips into their 3D TVs and studios converting legacy 2D program content to 3D have been understandably secretive about the techniques they're using or studying to reproduce simulated 3D from 2D video. But published patents reveal a wealth of more detailed information than manufacturers and studios are willing to share about the 2D-to-3D conversion techniques that are available.
Though US 2009/0116732, filed by Samuel Zhou, Paul Judkins and Ping Ye in 2007, refers specifically to the conversion to 3D of Imax 2D material, it usefully explains the basic challenge confronting anyone of building a “depth map” from a single, flat 2D image. According to the patent, individual objects in a scene must first be isolated, by rotoscoping (tracing the contour of an object in every frame of the movie) or matting (using the characteristic color, brightness or motion of an object to construct a mask that follows it around). The relative depths of the isolated objects are then estimated to create the depth map.
One of the first patents granted on this technology (US 6,208,348) was filed by Michael Kaye and In-Three Inc. of Westlake Village, Calif., in 1998. To “dimensionalize” a 2D film, as the process is called, it is scanned into a computer, and cloned to produce two identical sequences, the patent said. One sequence (for the left eye) remains untouched, and a human operator works on the other (for the right eye), manually drawing around the edges of key objects in each scene, visually judging and slightly altering their relative depths to create the equivalent of footage shot with a native 3D camera, it said.
Ideally, the patent says, an intelligent supercomputer, not a human, would make the decisions instead to dimensionalize 2D footage into 3D, “but unfortunately even with the technology as it is in 1998, software to accomplish such a task does not yet exist.” But even a supercomputer has its limits, the patent said. For example, it said, the human brain can recognize a bunch of bananas against a yellow background, at all angle and lighting variations, but a computer can’t. Still, In-Three’s CEO, Neil Feldman, has since devised a “unique software suite” that “expedites and makes practical the conversion of any 2D motion images into realistic 3D images,” the company’s Web site says. Feldman “has helped In-Three evolve from a technology development organization into a new kind of post-production facility that is uniquely qualified to address the burgeoning market for compelling 3D content,” it says.
Most patents filed since In-Three’s, by a wide variety of inventors, claim ways of at least partly automating the process. But a common thread throughout them all is that there’s no magic bullet way for a computer to estimate depth. The Zhou/Judkins/Ye patent US 2009/0116732 recommends a hybrid human/computer approach, predicated on the requirement that the computer-generated results must be viewed in 3D by humans and rejected if they look unnatural. This “is labor-intensive and time-consuming,” admit the inventors, but still necessary for good results, the patent said. It’s a philosophy that BSkyB in the U.K. has put into practice in writing strict technical requirements on what 3D program content it will accept for broadcast. The operator will broadcast 3D material that has been converted from 2D only if it comes from a production house that uses humans to check computer-generated results and will do so only for commercials, not core program material, it has said.
This hybrid approach obviously can’t work for converter chips built into 3D TVs, such as those that Sony and Samsung are putting in their sets. US 2010/0104219, filed by Samsung in 2008, admits that when an object in a scene is moved to create depth, a hole in the scene is created along one edge of the object. This hole must then be filled by interpolated pixels in much the same way that interpolated pixels add detail when SD video is upscaled to HD.
Another Samsung filing (2010/0073364 from 2008) tells how the process of drawing around objects to create depth can be automated by detecting the differences in brightness between each side of the object’s edge. So brighter objects are easier to work with, Samsung has said when it has demonstrated the 2D-to-3D conversion in its TVs. A companion Samsung patent (US 2010/0110070) tells how to improve edge detection by first artificially sharpening any blurred edges.
US 2009/0295791, filed in 2008 by Microsoft, tells how to build a 3D view, for instance of the Great Pyramids of Ginza, by combining different perspective views from a series of 2D images. Sony’s research lab in the U.K. has been working along similar lines, its patent (EP 2034747) says, by taking the output from several cameras shooting a live sports event, and in real time combining the different views of the game into a virtual 3D model.
Another commonality in the patents is that most say a huge amount of computer processing power is needed for 2D-to-3D conversion, and material shot with 3D in mind will be much easier to convert than legacy footage. US 2008/0085049, based on filings from 2001 by Rolf-Dieter Naske, Steven Schwartz and William Hopewell, explains why material shot without 3D in mind is much harder to convert. Hard cuts, where the scene changes abruptly, disrupts the process of tracking motion to determine depth, it says. Vertical motion is similarly disruptive, because depth detection works mainly in the horizontal plane, it says.
That’s why it’s not surprising that that Samsung’s recent demonstrations of real-time 3D conversion in its TV using the 2D Blu-ray disc of Avatar generally impressed European reviewers. The movie was made with 3D in mind.
A Philips filing (US 2006/0187325) summarizes the work of Philips researchers in Germany who were looking for ways to reduce the processing power needed for 3D conversion. They noted that each frame of live video is built from two interlaced fields, so moving objects will be at slightly different positions in successive fields. Similarly, moving objects that are close to the camera appear to be moving faster, so their position in successive fields will be more widely spaced. If the fields are used as left and right eye images, a 3D effect is created. To increase the 3D effect, the left and right eye fields are taken from successive frames, which inevitably show larger position differences.
Most modern TV sets are designed to up-convert video for 100 or 120 Hz HD display, so already have memory circuitry to store and process TV fields. It is a simple job, says Philips, to use this same circuitry to display different fields for the left and right eyes. So this method of conversion effectively comes free.
In a more recent U.S. patent (2010/0026784), Philips proposes what it calls an easy way to reduce the processing power needed for computerized scene analysis. Its method also describes how to avoid the kinds of mistakes that a fully automated system can make, such as mistaking a small object up close for a large object in the distance. The Philips trick is to analyze the sound accompanying the image as well as the image itself. The relative loudness of sounds and the overall surround sound field can tell whether a scene is a close-up, medium view or long shot, and this helps the processor judge depths and avoid mistakes.
Audio analysis needs far less computing power, and is far faster, than picture analysis, the Philips patent says. So it can work in real time alongside the visual processor, or work at high speed to memorize the entire sound track and then sit in memory to assist the picture analysis process.