Last fall, my computer graphics professor told us about a recently-funded Kickstarter project called the Structure Sensor, which is like a mini-Kinect for the iPad. He would be receiving a Structure soon and challenged the class to think of a research project using the sensor. I came up with an idea, inspired in part by Disney Infinity and Skylanders, and that idea became Brickspace.
Brickspace is an iPad app that helps you build new things with your Lego. To use the app, you spread out your bricks onto a table, then take a picture with the app. The image is run through OpenCV's
SimpleBlobDetector, which finds brick-like blobs in the image and labels each with a keypoint. The app then queries the pixel colors via OpenCV's
Mat matrix image format around the keypoint and calculates an average color. This average color is compared to a list of known brick colors, and the known color with the shortest Euclidean distance to the unknown color is pronounced the winner.
With the Structure Sensor attached, the app reads depth data from the sensor and attempts to determine the volume of the bricks. Measured volume gives us the size of the brick, or how many studs wide and long the brick is. I set up the Structure Sensor to be optional; without it, the app assumes that all bricks are 2x4. Brick size can be adjusted after capture and detection to any size between 1x1 and 10x10, but I haven't yet taught the app how to generate any models that use anything but 2x4 bricks.
I worked on a size-estimation algorithm that was meant to tell the difference between 2x1, 2x2, 2x3, and 2x4 bricks. After a couple of iterations of development, the best I was able to do was 50% accuracy with the incorrectly-sized bricks being only one unit too large or small. It was a good research exercise, but I wasn't sure about releasing the Structure functionality to the public. The accuracy wasn't great, most people trying out the app wouldn't have a Structure Sensor, and the app doesn't currently know how to build with anything other than 2x4 bricks. I decided to pull the Structure support from the app before submitting it to the iOS App Store.
This makes the code of the app a bit simpler, but I regret that it meant pulling out some of the most interesting code from the app. The code-frozen research version of the app is available on my GitHub, under the
structureEnabled branch, and an explanation of the removed code follows.
The Structure is first involved during the image capture process.
BKPCapturingViewController presents a camera preview to the user and displays the status of the camera and Structure Sensor. (I chose the
BKP class prefix to avoid
OpenCV for iOS contains a class called
CvPhotoCamera, a very handy wrapper around
AVFoundation. At first glance, it looks like
CvPhotoCamera is all we need. It's got
-takePicture to get the job done. (Here's a tutorial in the OpenCV documentation that describes how to use
CvVideoCamera, a related class.) This would work, but the snag is that the Structure SDK needs the raw
CMSampleBufferRef from the camera. The Structure SDK takes a
CMSampleBufferRef from the iPad camera, synchronizes it with a frame from the depth sensors, and returns the original
CMSampleBufferRef and a
STDepthFrame containing the depth data that matches the color image. If we're using
CvPhotoCamera, we lose access to the color sample buffers, and have nothing to send to the Structure SDK.
BKPCaptureMaster, the class whose job is to provide a clean interface to the iPad camera and Structure Sensor.
Connecting to the Sensor
When an instance of
BKPCaptureMaster is initialized with
-initWithCameraPreviewView:, the preview view and
BKPCaptureMaster are connected and ready for action.
BKPCaptureMasterinstance must then be sent
-startPreviewing, at which point the object dispatches an asynchronous request to itself to start running the camera.
- All of the
AVCaptureSessionsetup happens then, inside the private method
BKPCapturingViewControllerhas no choice but to wait for the camera to get started; it is notified of a camera (or Structure Sensor) status change via the required
Talking to the iPad camera is easy enough, but we don't always know when the Structure is connected, or when we should be looking for it. I also needed the connection process to be rock-solid, no matter whether the app loses focus, or I switch to another app using the Sensor, or the Sensor is only plugged in after the app is launched.
To solve this problem, I added a "Connect to Structure Sensor?" toggle switch to the capture screen UI. This switch defaults to
false and must be flipped on before the app will attempt to connect to the Structure. Aside from preventing the app from having to do continuous Structure Sensor connection attempts when the user likely doesn't have a Sensor, the switch provides feedback to the user by indicating whether Structure support is active or inactive.
- When the switch changes value (that's an
BKPCaptureMaster *_captureMasteris notified via
-setStructureSensorEnabled:method activates (or deactivates) the
_structureConnectionTimerfires (once per second), the
BKPCaptureMasterchecks to see if it needs to look for the sensor, and if it does, it sends
- At last,
STSensorController *_sensorControllersingleton to reach out to the Structure.
If the Structure SDK's connection attempt succeeds, then the command to begin streaming data from the Sensor is issued immediately. If the attempt fails, an error is
NSLog-ged. In either case, the timer continues firing; if the Sensor is pulled out while running, we want to try to reconnect to it in case it gets plugged in again. (
-structureConnectionTimerFired doesn't do anything if the sensor is already known to be connected or streaming.) Only turning the toggle switch off will invalidate the connection timer.
Streaming and shipping data around
Once streaming is running, the
BKPCaptureMaster instance waits for the command to capture image (or image and depth) data. When it receives
-performCapture from the instance of
BKPCaptureMaster decides whether it needs to initiate a depth and color image capture or just a color image capture. The process is a bit different for each.
Color image capture
This one's straightforward.
-async_initiateColorImageCaptureto itself. (The
async_prefix is my reminder to only call this method inside a
-async_initiateColorImageCapture, we just ask
- When we receive the response message, which is the
delegate(an instance of
BKPCapturingViewControllercreates a new
CMSampleBufferRef bufferand displays the image in the detection preview.
Depth and color image capture
This one's a dance of delegates:
- When a depth and color image capture is initiated, the Structure Sensor is already initialized and streaming. The instance of
AVCaptureVideoDataOutputSampleBufferDelegate, so it receives
-captureOutput:didOutputSampleBuffer:fromConnection:every time the video
AVSessioncaptures a frame. Inside this delegate method, we send the
_sensorControllerand ask it to synchronize the color image with a depth frame.
- When that is done, the Structure SDK sends the message
Most of the time,
_delegateIsWaitingForCapture, finds it
false, and does nothing. But, after our
_delegateIsWaitingForCaptureis set to
- The very next time the Structure SDK sends us a synchronized pair of frames, we forward them to
BKPCaptureMaster's delegate (an instance of
- As in the color-only image capture, when the
BKPCapturingViewControllerreceives this message, it creates a new
BKPScannedImageAndBricks, gives it the
STDepthFramefrom the Structure SDK, and displays the image in the detection preview.
Changes to the public version
In the public version,
BKPCaptureMaster still exists, but no longer needs the ability to communicate with the Structure Sensor. It's a testament to the power of encapsulation that all I had to do to pull out the Sensor was delete, delete, and delete:
- removed all Structure SDK delegate methods from
- removed all Structure connection code from
With the Structure SDK gone, I also had to cut some code from
BKPKeypointDetectorAndAnalyzer. This was the really fun stuff:
Estimating brick size
The entire purpose of using the Structure Sensor is to be able to estimate the size of each brick we find. Once we've done all the above work, and we have an instance of
BKPScannedImageAndBricks, what happens next?
The instance of
BKPScannedImageAndBricks contains the color image as a
UIImage*, the depth data as an
STFloatDepthFrame*, and an
NSMutableArray *_keypointBrickPairs. Creating the first two from our inputs of
STDepthFrame is not an issue; sample code and documentation from Occipital indicate the conversion steps. Detecting the keypoints and assigning bricks (with color and size properties) is the responsibility of the
This class contains three static methods:
The first of these uses OpenCV tools only. It initializes a
SimpleBlobDetector with a few different sets of parameters and runs it on the image. For each keypoint that the detector finds, an instance of
BKPKeypointBrickPair is created and appended to the mutable array
BKPKeypointBrickPair contains the keypoint from OpenCV and the
BKPBrick that Brickspace thinks is at that location in the image. Initially, the brick is
nil. Creating and assigning bricks to keypoints is the responsibility of the last two methods in the list above.
+assignBricksToKeypoints:fromImage: simply calls
+assignBricksToKeypoints:fromImage:withDepthFrame: with a
nil third argument. The second method first instantiates "blank"
BKPBricks and assigns them to the keypoints. Then, it creates an asynchronous dispatch queue and group. Each color and size estimation can be performed independently; these estimates do not depend on each other. Tasks are created to analyze the color and size (if a depth frame is available) of each keypoint's brick, and the method waits on all of the tasks to finish before ending.
Color assignment is done in
+async_assignColorToBrickInKeypoint:inImageMatrix:. As mentioned earlier, the algorithm computes an average color for each keypoint and compares the color to a set of known colors. The best match is assigned to the brick.
Size assignment is done in
Attempt 1: calculating a real volume estimate
(The code for this algorithm is visible in the GitHub repo at an older revision, starting on line 338.)
First, given the keypoint's location and size, the algorithm determines an area of interest around the center of the keypoint. If we graph the depth data from the
STFloatDepthFrame in this area of interest, we get something like this:
The punched-out hole is the surface of the brick, and the points in the flat square are points on the table around the brick. Since this graph shows the distance from the Structure Sensor to the table for each point in the image, and the top of the brick is closer to the Sensor than the table, the depth measurements for the brick are lower than the surrounding measurements.
To determine programmatically which pixels in the depth frame are in the brick, the algorithm picks a depth threshold. The threshold is set to be the average of the minimum depth measurement and the maximum depth measurement, or the median of all the depth measurements. Depth pixels with a depth value less than the threshold are in the brick, and pixels with measurements greater than the threshold are in the table.
The algorithm then estimated the height of the brick by averaging the depth of the pixels in the table and the depth of the pixels in the brick, then taking the difference.
The worst part of the algorithm was next: attempting to determine the horizontal distance between depth pixels in the grid of the depth frame. To calculate this, we have to know the details of the Structure Sensor camera intrinsics, and do some mathematics that I honestly never understood. I'll even ashamedly admit that I copied and merged in some code from a project I found on the Structure forums to get the job done.
We now have (a.) the count of pixels that represent the top surface of the brick, (b.) the left-to-right real-world distance between pixels in the depth frame, (c.) the front-to-back real-world distance between pixels in the depth frame, and (d.) the height of the brick. Multiplying all four of these together gives us our grand estimate of the volume of the brick.
The values that this algorithm produced were mildly decent… meaning that they were within an order of magnitude of the actual volume of a brick. Here are the actual values for volume of the sizes of bricks I was working on:
- 2x1 = 1,168.128 mm3
- 2x2 = 2,336.256 mm3
- 2x3 = 3,504.384 mm3
- 2x4 = 4,672.512 mm3
Unfortunately, my calculated measurements for a 2x4 brick were anywhere from 1,000 mm3 to 10,000 mm3. With this lack of precision, it was impossible to distinguish between different sizes of brick.
Attempt 2: estimating volume from a standard distance and angle
I also noticed that the estimated volume of the brick varied greatly based on where the brick was in the captured image, the distance that the Sensor was from the table, and the angle that the iPad was making with the table. To standardize the capture distance and angle, I overlaid a box on the capture preview and asked the user to line it up with a piece of 8.5 x 11" paper on the table.
In order for the box to line up with the paper, the Sensor must be directly above the paper and at the correct distance. With these variables standardized, I took multiple measurements of each size of brick in a number of positions and at four different rotations (0°, 45°, 90°, and 135°). Here's what the scanning screen looks like again, and a reference image I made showing all of the places I put each brick and recorded its volume estimate.
That's 37 different measurement positions, times 4 rotations each, for 148 data points per brick. (Because of the rotational symmetry of the 2x2 brick, I only took measurements at 0° and 45° rotations.)
This method produced some nice average volume measurements, but the overall problem remained. My on-the-fly measurements varied too widely to be reliably matched to a particular volume.
Attempt 3: counting depth frame pixels
I realized that if I was standardizing the capture process, then maybe I could assign size classifications based on a different empirical observation—something other than a volume calculation. If the depth frame is always captured in the same way, then the number of pixels that land on the top surface of the brick should be about the same for each brick of the same size.
I redid my rounds of measurement for each brick, this time counting the number of pixels in the depth frame that landed on the surface of the brick. Here's how the size estimation improved:
To measure the accuracy of each iteration of my size assignment algorithm, I set up eight bricks and captured the same scene with each version of the algorithm. I used two of each size of brick (2x1, 2x2, 2x3, and 2x4): one on the table and contained entirely within the sheet of paper required for Structure scanning, and one at least partially outside the paper. The app knows that each brick is 2x(something). For each trial, the Size column indicates the (something) that the app thinks should fill in the blank. The Diff column equals the absolute value of the difference from the actual size.
|Brick size||On paper?||Old volume data||New volume data||Old depth pixel count data||New depth pixel count data|
|Percent correct and total difference||25%||10||25%||9||50%||7||50%||4|
The depth-pixel method was a solid improvement over my shoddy volume estimation technique, and refining the way I captured the data improved size assignment accuracy under both techniques. But, even at its best, the algorithm could only assign half of the bricks the correct size. 50% is a greater success rate than random assignment (25%, with four possible brick sizes), but not quite good enough for my liking.
It was a fun adventure, but in the end, I just couldn't find a way to interpret the Structure Sensor's data to reliably determine the size of a Lego brick. I suspect that the Structure Sensor may not be accurate enough to distinguish the minor differences between the sizes of small bricks, except in perfectly consistent conditions or when many captures of the same brick are made and averaged.