Brickspace and the Structure Sensor

Last fall, my computer graphics professor told us about a recently-funded Kickstarter project called the Structure Sensor, which is like a mini-Kinect for the iPad. He would be receiving a Structure soon and challenged the class to think of a research project using the sensor. I came up with an idea, inspired in part by Disney Infinity and Skylanders, and that idea became Brickspace.

The Structure Sensor. Image by  Occipital, Inc .

The Structure Sensor. Image by Occipital, Inc.

Brickspace is an iPad app that helps you build new things with your Lego. To use the app, you spread out your bricks onto a table, then take a picture with the app. The image is run through OpenCV's SimpleBlobDetector, which finds brick-like blobs in the image and labels each with a keypoint. The app then queries the pixel colors via OpenCV's Mat matrix image format around the keypoint and calculates an average color. This average color is compared to a list of known brick colors, and the known color with the shortest Euclidean distance to the unknown color is pronounced the winner.

With the Structure Sensor attached, the app reads depth data from the sensor and attempts to determine the volume of the bricks. Measured volume gives us the size of the brick, or how many studs wide and long the brick is. I set up the Structure Sensor to be optional; without it, the app assumes that all bricks are 2x4. Brick size can be adjusted after capture and detection to any size between 1x1 and 10x10, but I haven't yet taught the app how to generate any models that use anything but 2x4 bricks.

I worked on a size-estimation algorithm that was meant to tell the difference between 2x1, 2x2, 2x3, and 2x4 bricks. After a couple of iterations of development, the best I was able to do was 50% accuracy with the incorrectly-sized bricks being only one unit too large or small. It was a good research exercise, but I wasn't sure about releasing the Structure functionality to the public. The accuracy wasn't great, most people trying out the app wouldn't have a Structure Sensor, and the app doesn't currently know how to build with anything other than 2x4 bricks. I decided to pull the Structure support from the app before submitting it to the iOS App Store.

This makes the code of the app a bit simpler, but I regret that it meant pulling out some of the most interesting code from the app. The code-frozen research version of the app is available on my GitHub, under the structureEnabled branch, and an explanation of the removed code follows.


The CaptureMaster

The Structure is first involved during the image capture process. BKPCapturingViewController presents a camera preview to the user and displays the status of the camera and Structure Sensor. (I chose the BKP class prefix to avoid BS.)

OpenCV for iOS contains a class called CvPhotoCamera, a very handy wrapper around AVFoundation. At first glance, it looks like CvPhotoCamera is all we need. It's got -initWithParentView: and -start from CvAbstractCamera, and -takePicture to get the job done. (Here's a tutorial in the OpenCV documentation that describes how to use CvVideoCamera, a related class.) This would work, but the snag is that the Structure SDK needs the raw CMSampleBufferRef from the camera. The Structure SDK takes a CMSampleBufferRef from the iPad camera, synchronizes it with a frame from the depth sensors, and returns the original CMSampleBufferRef and a STDepthFrame containing the depth data that matches the color image. If we're using CvPhotoCamera, we lose access to the color sample buffers, and have nothing to send to the Structure SDK.

Enter BKPCaptureMaster, the class whose job is to provide a clean interface to the iPad camera and Structure Sensor.

Connecting to the Sensor

When an instance of BKPCaptureMaster is initialized with -initWithCameraPreviewView:, the preview view and BKPCaptureMaster are connected and ready for action.

  • The BKPCaptureMaster instance must then be sent -startPreviewing, at which point the object dispatches an asynchronous request to itself to start running the camera.
  • All of the AVCaptureSession setup happens then, inside the private method -async_startPreviewing.
  • The BKPCapturingViewController has no choice but to wait for the camera to get started; it is notified of a camera (or Structure Sensor) status change via the required CaptureMasterResultsDelegate method -captureMasterStatusChanged.

Talking to the iPad camera is easy enough, but we don't always know when the Structure is connected, or when we should be looking for it. I also needed the connection process to be rock-solid, no matter whether the app loses focus, or I switch to another app using the Sensor, or the Sensor is only plugged in after the app is launched.

To solve this problem, I added a "Connect to Structure Sensor?" toggle switch to the capture screen UI. This switch defaults to false and must be flipped on before the app will attempt to connect to the Structure. Aside from preventing the app from having to do continuous Structure Sensor connection attempts when the user likely doesn't have a Sensor, the switch provides feedback to the user by indicating whether Structure support is active or inactive.

  • When the switch changes value (that's an IBAction), BKPCaptureMaster *_captureMaster is notified via setStructureSensorEnabled:.
  • -setStructureSensorEnabled: method activates (or deactivates) the BKPCaptureMaster's NSTimer *_structureConnectionTimer.
  • When _structureConnectionTimer fires (once per second), the BKPCaptureMaster checks to see if it needs to look for the sensor, and if it does, it sends tryToConnectToStructure to itself.
  • At last, -tryToConnectToStructure tells the STSensorController *_sensorController singleton to reach out to the Structure.

If the Structure SDK's connection attempt succeeds, then the command to begin streaming data from the Sensor is issued immediately. If the attempt fails, an error is NSLog-ged. In either case, the timer continues firing; if the Sensor is pulled out while running, we want to try to reconnect to it in case it gets plugged in again. (-structureConnectionTimerFired doesn't do anything if the sensor is already known to be connected or streaming.) Only turning the toggle switch off will invalidate the connection timer.

Streaming and shipping data around

Once streaming is running, the BKPCaptureMaster instance waits for the command to capture image (or image and depth) data. When it receives -performCapture from the instance of BKPCapturingViewController, the BKPCaptureMaster decides whether it needs to initiate a depth and color image capture or just a color image capture. The process is a bit different for each.

Color image capture

This one's straightforward.

  • The BKPCaptureMaster asynchronously dispatches -async_initiateColorImageCapture to itself. (The async_ prefix is my reminder to only call this method inside a dispatch_async() block.)
  • Inside -async_initiateColorImageCapture, we just ask AVCaptureStillImageOutput *_stillImageOutput to captureStillImageAsynchronouslyFromConnection:completionHandler:.
  • When we receive the response message, which is the BKPCaptureMaster-defined -async__finishedColorImageCapture:withError:, the BKPCaptureMaster sends its delegate (an instance of BKPCapturingViewController) the captureMasterDidOutputAVFColorBuffer: message.
  • The BKPCapturingViewController creates a new BKPScannedImageAndBricks with the CMSampleBufferRef buffer and displays the image in the detection preview.

Depth and color image capture

This one's a dance of delegates:

  • When a depth and color image capture is initiated, the Structure Sensor is already initialized and streaming. The instance of BKPCaptureMaster conforms to AVCaptureVideoDataOutputSampleBufferDelegate, so it receives -captureOutput:didOutputSampleBuffer:fromConnection: every time the video AVSession captures a frame. Inside this delegate method, we send the CMSampleBufferRef to _sensorController and ask it to synchronize the color image with a depth frame.
  • When that is done, the Structure SDK sends the message -sensorDidOutputSynchronizedDepthFrame:andColorFrame: to our BKPCaptureMaster.

Most of the time, -sensorDidOutputSynchronizedDepthFrame:andColorFrame: checks _delegateIsWaitingForCapture, finds it false, and does nothing. But, after our BKPCaptureMaster receives -performCapture:

  • The BKPCaptureMaster asynchronously dispatches -async_initiateDepthAndColorImageCapture to itself.
  • _delegateIsWaitingForCapture is set to true.
  • The very next time the Structure SDK sends us a synchronized pair of frames, we forward them to BKPCaptureMaster's delegate (an instance of BKPCapturingViewController) via -captureMasterDidOutputSTColorBuffer:andDepthFrame: and reset _delegateIsWaitingForCapture to false.
  • As in the color-only image capture, when the BKPCapturingViewController receives this message, it creates a new BKPScannedImageAndBricks, gives it the CMSampleBufferRef and STDepthFrame from the Structure SDK, and displays the image in the detection preview.

Changes to the public version

In the public version, BKPCaptureMaster still exists, but no longer needs the ability to communicate with the Structure Sensor. It's a testament to the power of encapsulation that all I had to do to pull out the Sensor was delete, delete, and delete:

  • removed -captureMasterDidOutputSTColorBuffer:andDepthFrame: from the CaptureMasterResultsDelegate protocol
  • removed all Structure SDK delegate methods from BKPCaptureMaster
  • removed all Structure connection code from BKPCaptureMaster, including NSTimer *_structureConnectionTimer and STSensorController *_sensorController.
  • removed -initWithSTColorBuffer:andDepthFrame: from BKPScannedImageAndBricks

With the Structure SDK gone, I also had to cut some code from BKPKeypointDetectorAndAnalyzer. This was the really fun stuff:

Estimating brick size

The entire purpose of using the Structure Sensor is to be able to estimate the size of each brick we find. Once we've done all the above work, and we have an instance of BKPScannedImageAndBricks, what happens next?

The instance of BKPScannedImageAndBricks contains the color image as a UIImage*, the depth data as an STFloatDepthFrame*, and an NSMutableArray *_keypointBrickPairs. Creating the first two from our inputs of CMSampleBufferRef and STDepthFrame is not an issue; sample code and documentation from Occipital indicate the conversion steps. Detecting the keypoints and assigning bricks (with color and size properties) is the responsibility of the BKPKeypointDetectorAndAnalyzer class.

This class contains three static methods:

  • +detectKeypoints:inImage:
  • +assignBricksToKeypoints:fromImage:
  • +assignBricksToKeypoints:fromImage:withDepthFrame:

The first of these uses OpenCV tools only. It initializes a SimpleBlobDetector with a few different sets of parameters and runs it on the image. For each keypoint that the detector finds, an instance of BKPKeypointBrickPair is created and appended to the mutable array _keypointBrickPairs. A BKPKeypointBrickPair contains the keypoint from OpenCV and the BKPBrick that Brickspace thinks is at that location in the image. Initially, the brick is nil. Creating and assigning bricks to keypoints is the responsibility of the last two methods in the list above.

+assignBricksToKeypoints:fromImage: simply calls +assignBricksToKeypoints:fromImage:withDepthFrame: with a nil third argument. The second method first instantiates "blank" BKPBricks and assigns them to the keypoints. Then, it creates an asynchronous dispatch queue and group. Each color and size estimation can be performed independently; these estimates do not depend on each other. Tasks are created to analyze the color and size (if a depth frame is available) of each keypoint's brick, and the method waits on all of the tasks to finish before ending.

Color assignment is done in +async_assignColorToBrickInKeypoint:inImageMatrix:. As mentioned earlier, the algorithm computes an average color for each keypoint and compares the color to a set of known colors. The best match is assigned to the brick.

Size assignment is done in async_assignSizeToBrickInKeypoint:inImageMatrix:withDepthFrame:.

Attempt 1: calculating a real volume estimate

(The code for this algorithm is visible in the GitHub repo at an older revision, starting on line 338.)

First, given the keypoint's location and size, the algorithm determines an area of interest around the center of the keypoint. If we graph the depth data from the STFloatDepthFrame in this area of interest, we get something like this:

3D graph created with OS X's Grapher.

The punched-out hole is the surface of the brick, and the points in the flat square are points on the table around the brick. Since this graph shows the distance from the Structure Sensor to the table for each point in the image, and the top of the brick is closer to the Sensor than the table, the depth measurements for the brick are lower than the surrounding measurements.

To determine programmatically which pixels in the depth frame are in the brick, the algorithm picks a depth threshold. The threshold is set to be the average of the minimum depth measurement and the maximum depth measurement, or the median of all the depth measurements. Depth pixels with a depth value less than the threshold are in the brick, and pixels with measurements greater than the threshold are in the table.

The algorithm then estimated the height of the brick by averaging the depth of the pixels in the table and the depth of the pixels in the brick, then taking the difference.

The worst part of the algorithm was next: attempting to determine the horizontal distance between depth pixels in the grid of the depth frame. To calculate this, we have to know the details of the Structure Sensor camera intrinsics, and do some mathematics that I honestly never understood. I'll even ashamedly admit that I copied and merged in some code from a project I found on the Structure forums to get the job done.

We now have (a.) the count of pixels that represent the top surface of the brick, (b.) the left-to-right real-world distance between pixels in the depth frame, (c.) the front-to-back real-world distance between pixels in the depth frame, and (d.) the height of the brick. Multiplying all four of these together gives us our grand estimate of the volume of the brick.

The values that this algorithm produced were mildly decent… meaning that they were within an order of magnitude of the actual volume of a brick. Here are the actual values for volume of the sizes of bricks I was working on:

  • 2x1 = 1,168.128 mm3
  • 2x2 = 2,336.256 mm3
  • 2x3 = 3,504.384 mm3
  • 2x4 = 4,672.512 mm3

Unfortunately, my calculated measurements for a 2x4 brick were anywhere from 1,000 mm3 to 10,000 mm3. With this lack of precision, it was impossible to distinguish between different sizes of brick.

Attempt 2: estimating volume from a standard distance and angle

I also noticed that the estimated volume of the brick varied greatly based on where the brick was in the captured image, the distance that the Sensor was from the table, and the angle that the iPad was making with the table. To standardize the capture distance and angle, I overlaid a box on the capture preview and asked the user to line it up with a piece of 8.5 x 11" paper on the table.

In order for the box to line up with the paper, the Sensor must be directly above the paper and at the correct distance. With these variables standardized, I took multiple measurements of each size of brick in a number of positions and at four different rotations (0°, 45°, 90°, and 135°). Here's what the scanning screen looks like again, and a reference image I made showing all of the places I put each brick and recorded its volume estimate.

That's 37 different measurement positions, times 4 rotations each, for 148 data points per brick. (Because of the rotational symmetry of the 2x2 brick, I only took measurements at 0° and 45° rotations.)

This method produced some nice average volume measurements, but the overall problem remained. My on-the-fly measurements varied too widely to be reliably matched to a particular volume.

Attempt 3: counting depth frame pixels

I realized that if I was standardizing the capture process, then maybe I could assign size classifications based on a different empirical observation—something other than a volume calculation. If the depth frame is always captured in the same way, then the number of pixels that land on the top surface of the brick should be about the same for each brick of the same size.

I redid my rounds of measurement for each brick, this time counting the number of pixels in the depth frame that landed on the surface of the brick. Here's how the size estimation improved:

Algorithm improvement

To measure the accuracy of each iteration of my size assignment algorithm, I set up eight bricks and captured the same scene with each version of the algorithm. I used two of each size of brick (2x1, 2x2, 2x3, and 2x4): one on the table and contained entirely within the sheet of paper required for Structure scanning, and one at least partially outside the paper. The app knows that each brick is 2x(something). For each trial, the Size column indicates the (something) that the app thinks should fill in the blank. The Diff column equals the absolute value of the difference from the actual size.

Brick size On paper? Old volume data New volume data Old depth pixel count data New depth pixel count data
Size Diff. Size Diff. Size Diff. Size Diff.
2x1 on 2 1 3 2 4 3 1 0
2x1 off 4 3 2 1 1 0 2 1
2x2 on 1 1 1 1 2 0 3 1
2x2 off 3 1 1 1 4 2 2 0
2x3 on 4 1 4 1 4 1 3 0
2x3 off 3 0 3 0 3 0 3 0
2x4 on 4 0 4 0 4 0 3 1
2x4 off 1 3 1 3 3 1 3 1
Percent correct and total difference 25% 10 25% 9 50% 7 50% 4

The depth-pixel method was a solid improvement over my shoddy volume estimation technique, and refining the way I captured the data improved size assignment accuracy under both techniques. But, even at its best, the algorithm could only assign half of the bricks the correct size. 50% is a greater success rate than random assignment (25%, with four possible brick sizes), but not quite good enough for my liking.

It was a fun adventure, but in the end, I just couldn't find a way to interpret the Structure Sensor's data to reliably determine the size of a Lego brick. I suspect that the Structure Sensor may not be accurate enough to distinguish the minor differences between the sizes of small bricks, except in perfectly consistent conditions or when many captures of the same brick are made and averaged.