- Correlation moves filter straight, convolution flips 180 degree.
- They are the same when the filter is symmetric
- Moving average
- Weighted Moving average
- Correlation filtering
- Properties of convolution
- Linear & shift invariant
- Commutative: $fg = gf$
- Associative: $(fg) * h = f * (gh)$
- Identity: Unit pulse:
$e = [..., 0,0,1,0,0,...]. f*e = f$ - Differentiation:
$\frac{\partial{}} {\partial{x}}(f*g) = \frac{\partial{f}} {\partial{x}}*g$
- Convolution separability
- If
$C * R = H$ , C is column vector, R is row vector $G = H * F = (C * R) * F = C * (R * F)$ - Computation drops from NNWW to 2 * N * WW. N is the dim is F, W is the width of H.
- If
- Sharpening filter
- Accentuates differences with local average
- Median filter
- Remove salt and pepper noise
- Preserve edges while mean filter doesn't
- Non linear filter
- Filter as template
- normalize correlation to find pattern inside an image, like detecting something
- In real image types of edges:
- Reflectance change: appearance information, texture
- Depth: discontinuity, object boundary
- Cast shows
- Discontinous change in surface orientation
- Note: stripe on a sign is not due to shape or illumination, it is color discontinuity.
- Derivatives and edges: An edge is a place of rapid change in the image intensity function.
- The gradient points in the direction of most rapid increase in intensity.
$\Delta f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}]$ -
Sobel detector
-
$sx$ 1/8 * $\begin{pmatrix} -1 & 0 & 1\ -2 & 0 & 2\ -1 & 0 & 1 \end{pmatrix}$ -
$sy$ 1/8 * $\begin{pmatrix} 1 & 2 & 1\ 0 & 0 & 0\ -1 & -2 & -1 \end{pmatrix}$ here y is up
-
- It is better to compute gradients using correlation since it preserves the intended directionality of gradients.
- For real world images with noise, we need to smooth first before finding edges. Otherwise, there will be lots of high value gradients due to noise.
-
Canny Edge detector
- Primary edge detection steps
- Smoothing derivatives to suppress noise and compute gradient
- Threshold to find regions of "significant" gradient
- "Thin" to get localized edge pixels
- Add link or connect edge pixels
- Canny edge detector
- Filter image with derivative of Gaussian
- Find magnitude and orientation of gradients
- Non-maximum suppression
- Thin multi-pixel wide "ridges" down to single pixel width
- Linking and thresholding (hysteresis)
- define two thresholds: low and high
- Use the high threshold to start edge curves and the low threshold to continue them.
- Primary edge detection steps
- Rough idea
- Each edge point votes for compatible lines
- Look for lines that get many votes
- Image space -> Hough (parameter) space
- A line in the image corresponds to a point in hough space
- A point in the image corresponds to a line in hough space
- Algorithm
- Let each edge point in image space vote for a set of possible parameters in Hough space
- Accumulate votes in discrete set of bins; parameters with the most votes indicate line in image space
- Use polar representation to avoid undefined parameters for vertical lines.
$\ro = x cos(\theta) + y sin(\theta)$ - Complexity of the Hough transform
- Space:
$k^n$ (n dimensions, k bins each) - Time
- Space:
- Hough Transform: Circles
$(x-a)^2 + (y-a)^2 = r^2$ - With known r, a point on a circle in x-y space is a circle in a-b space.
- With unknown r, a point on a circle in x-y space, is a cone in a-b-r 3d space.
- Pros:
- All points are processed independently, so can cope with occulsion
- Some robustness to noise: noise points unlikely to contribute consistently to any single bin
- Can detect multiple instances of a model in a single pass.
- Cons:
- Complexity of search time increases exponentially with the number of model parameters
- Non-target shapes can produce spurious peaks in parameter space
- Quantization: hard to pick a good grid size
- Generalized Hough Transform
- Non-analytical models
- Visual code-word based features
- Any periodic function can be rewritten as a weighted sum of sines and cosines of different frequencies.
- Usually, frequency magnitude is more interesting than the phase for CV because we are not reconstructing the image.
- Formal definition of FT
- Represent the signal as an infinite weighted sum of an infinite number of sinusoids
$F(u) = \int^\infty_{-\infty} f(x) e^(-2i\pi ux) dx$ $e^{ik} = cos k + isin k, i = \sqrt{-1}$ -
$F(u)$ exists if the function is integrable: $ \int^\infty_{-\infty} |f(x)| dx < \infty$ - Discrete FT:
$F(k) = \frac{1}{N} \sum^{x=N-1}_{x=0} f(x) e^{-i \frac{2\pi k x}{N}}$ -
$x$ is discrete and goes from the start of the signal to the end$(N-1)$ -
$k$ is the number of "cycles per period of the signal" or "cycles per image"
- Spatial domain: convolution <--> Frequency domain: multiplication, vice versa
- Pinhole camera model
- Aperture for pinhole camera
- Too small will have diffraction
- Too large will have too much light, get a blurred image
- Aperture for pinhole camera
- Thin Lense model
$1/||z'|| + 1/||z|| = 1/f$
- Depth of field
- Large aperture -> short depth of field
- Small aperture -> large depth of field
- Field of view
-
$\phi = tan^{-1}(\frac{d/2}{f})$ , d is the "retina" or sensor size, f focal length - Larger focal length -> smaller FOV
-
- Modeling projection
$(X, Y, Z) -> (-d\frac{X}{Z}, -d\frac{Y}{Z}, -d}$ - When X and Z are very large, the image doesn't change when X and Z change.
- Homoegeneous Coorindates
- 2D (x, y) -> (x, y, 1), 3D (x, y, z, 1)
- Convert back (x, y, w) -> (x/w, y/w)
- Persepective Projection
A = [[1 0 0 0], [0 1 0 0], [0 0 1/f 0]] X = [x y z 1]^T b = [x y z/f] AX = b b -> (f*x/z, f*y/z) -> (u, v)
- Geometric Properties of Projection
- Points go to points
- Lines go to lines
- Polygons go to polygons
- Vanishing points
- Sets of parallel lines on the same plane lead to collinear vanishing points
- The line is called the horizon for that plane.
- Sets of parallel lines on the same plane lead to collinear vanishing points
- Special case of perspective projection
- Orthographic: also called parallel projection
- (x, y, z) -> (x, y)
- Weak perspective
- Perspective effects, but not over the scale of individual objects since the object is far away from the camera, we can neglect the difference of z for different parts of the object
- (x, y, z) -> (fx/z0, fy/z0)
- Orthographic: also called parallel projection
- Why multiple views?
- Structure and depth are inherently ambiguous from single views
- Depth and shape cues:
- Shadows, texture (orientation recover), focus/blur (take several images with different camera parameters), parallel/converging lines, lighting/shading, occlusion, size/scale, depth from moving object in the image
- Stereo photography and stereo viewers
- Take two pictures of the same subject from two slightly different viewpoints and display so that each eye sees only one of the images.
- Human binocular fusion, is not based upon matching large-scale structures, or any individual process of the image, is actually based on the low level process that directly fuses the two images.
- Estimating depth with stereo
- Stereo: shape from "motion" between two views
- We'll need to consider
- Info on camera pose ("calibration")
- Image point correspondences
- If disparity is 0, that means Z is infinity. Disparity is inversely proportional to depth.
- Stereo correspondence constraints
- In perspective projection, lines project into lines.
- The line containing the center of projection and the point P in the left image must project to a line in the right image.
- Epipolar constraint:
- Geometry of two views constrain where the corresponding pixel for some image point in the first view must occur in the second view.
- Baseline: line joining the camera centers
- Epipolar plane: plane containing baseline and world point
- Epipolar line: intersection of epipolar plane with the image plane - come in pairs
- Epipole: point of intersection of baseline with image plane
- All epipolar lines converge to epipole.
- Special: for parallel image planes, epipole is at infinity. Epipolar lines are parallel.
- Epipolar constraint reduces the correspondence problem to a 1D search along an epipolar line
- Epipolar constraint doesn't solve the correspondence problem and is the hard constraint. Beyond it, we have "soft" constraints to help
- Similarity
- Uniqueness
- No more than one match in right image for every point in left image due to Occlusion.
- Ordering
- won't hold for transparent object or a narrow occluding surface.
- Disparity gradient is limited
- Dense correspondence search
- For each pixel / window in the left image
- compare with every pixel/window on same epipolar line in right image
- pick position with minimum match cost (e.g. dissimilarity: SSD (sum of squared differences), similarity: normalized correlation)
- For each pixel / window in the left image
- Better solutions
- Beyond individual correspondences to estimate disparities
- Optimize correspondence assignments jointly
- scanline at a time (DP)
- generates streaking artifacts since it doesn't consider previous line's result
- can't use dynamic programming to find spatially coherent disparities/correspondence on a 2D grid
- full 2D grid (graph cuts)
- Stereo matching as energy minimization
- Data term and smoothness term
- Stereo matching as energy minimization
- scanline at a time (DP)
- Optimize correspondence assignments jointly
- Beyond individual correspondences to estimate disparities
- What defines a good stereo correspondence?
- Match quality: want each pixel to find a good correspondence match in the other image
- Smoothness: of two pixels are adjacent, they should (usually) move about the same amount
- Challenges
- Low-contrast; textureless image regions
- Occlusions
- Violations of brightness constancy (e.g., specular reflections)
- Really large baselines (foreshortening and appearance change)
- Camera calibration errors
- We need the relationship between coordinates in world and coordinates in camera: geometric camera calibration
- Composed of 2 tranformations
- from some (arbitrary) world coordinate system to the camera's 3D coordinate system. Extrinsic parameters (camera pose)
- from the 3D coodinates in the camera frame to the 2D image plane via projection. Intrinsic parameters.
- Composed of 2 tranformations
$^B P = ^B_A R ^A P + ^B O_A$
- Real intrinsic parameters
- focal length is in "mm", but "Pixels" are in some arbitrary spatial units
- Maybe pixels are not square (scale factor:
$\alpha, \beta$ ) - We don't know the origin of our camera pixel coordinates (u0, v0)
- Maybe skew between camera pixel axes
- 5 DOF: focal length and aspect (f, a) [or pixel size (sx, sy)], principle points (x'c, y'c), and skew (s)
- Whole projection equation:
- M = intrinsic * projection * rotation * translation
- DOF: 5 + 0 + 3 + 3 = 11
- The camera matrix does not account for lens distortion because an ideal pinhole camera does not have a lens. To accurately represent a real camera, the camera model includes the radial and tangential lens distortion.
- Tangential, p1, p2
- Tangential distortion occurs when the lens and the image plane are not parallel. The tangential distortion coefficients model this type of distortion.
- Radial, k1, k2, k3
- Radial distortion occurs when light rays bend more near the edges of a lens than they do at its optical center. The smaller the lens, the greater the distortion.
- Calibration using known points
- Place a known object in the scene
- Identify correspondence between image and scene
- compute mapping from scene to image
- Place a known object in the scene
- Resectioning
- Estimating the camera matrix from known 3D points
- Direct linear calibration - homogeneous
- since m is only defined up to scale, solve for unit vector m*
- solution: m* = eigenvector of
$A^TA$ with smallest eigenvalues - works with 6 or more points, because every point has 2 equations, we have 11 unknowns.
- solution: m* = eigenvector of
- since m is only defined up to scale, solve for unit vector m*
- SVD (singular value decomposition)
- A = UDV^T (D diagonal, U and V orthogonal)
- "Gold standard" algorithm (Hartley and Zisserman)
- Linear Solution
- optional: normalization
- direct linear transformation minimization
- Minimize geometric error: using the linear estimate as a starting point minimize the geometric error.
- Linear Solution
- Finding the 3D camera center from M
- The center C is the null-space camera of projection matrix. So if find C such that: MC = 0, that will be the center.
- M = [Q|b],
$C = (-Q^{-1} b 1)^T$
- Multi-plane calibration using checkboard.
- only requires a plane
- don't have to know positions/orientations
- good code available online
- Special projective tranformations
- Translation
- preserves: lengths/areas, angles, orientation, lines
- Rotation
- preserves: lengths/areas, angles, lines
- Similarity (trans, rot, scale)
- preserves: angles, lines, scale lengths/areas
- 4 DOF, 1 rotation, 1 scale, 2 translation
- Affine transform
- preserves: parallel lines, ratio of areas, lines
- 6 DOF
- Translation
- General projective transform (or Homography)
- 8 DOF
- Preserves: lines, also cross ratios
- The projective plan
- What is the geometric intuition of using homogeneous coordinates
- A point in the image is a ray in projective space
- Each point (x, y) on the plane (at z=1) is represented by a ray ( sx, sy, s)
- What is the geometric intuition of using homogeneous coordinates
- Image reprojection
- Rather than thinking of this as 3D reproduction, think of it as a 2D image warp from one image (plane) to another (plane).
- Application: image mosaic (panorama)
- Basic procedure
- Take a sequence of images from the same position
- Rotate the camera about its optical center
- Compute transformation between second image and first
- Transform the second image to overlap with the first
- Blend the two together to create a mosaic
- Repeat for more images
- Take a sequence of images from the same position
- Basic procedure
- Mapping between planes is a homography
- Whether a plane in the world to the image or between image planes
- Image rectification
- Measurements on planes
- Approach: unwarp then measure
- If there is a planar rectangular grid in the scene you can map it into a rectangular grid in the image
- Measurements on planes
- Forward warping
- Send each pixel f(x,y) to its corresponding location (x', y') = T(x,y) in the second image
- What if the new location is between pixels?
- Inverse warping (correct way)
- For locations between pixels, use bilinear interpolation
- Point and Line Duality
- A line in image is a plane of rays through origin defined by the normal
$l = (a, b, c)$ , all rays$(x, y, z)$ satisfying$ax + by + cz = 0$ . - A line is also represented as a homogeneous 3-vector. Like a point.
- A line l is perpendicular to every point(ray) p on the line:
$l^T p = 0$ . - Points and lines are dual in projective space
- Given any formula, can switch the meaning of points and lines to get another formula
- To get intersection point and a line connecting two points, just use cross product.
- A line in image is a plane of rays through origin defined by the normal
- Ideal points and lines
- Ideal point, p (x, y, 0) - parallel to image plane, has infinite image coordinates
- Ideal line, l (a, b, 0) - normal is parallel to image plane
- corresponds to a line in the image (finite coordinates) goes through the image origin (principle point)
- 3D projective geometry
- Duality
- A plane N is also represented by a 4-vector N = (a,b,c,d)
- Points and planes are dual in 3D:
$N^T p = 0$
- Duality
- Use two calibrated cameras
- From geometry to algebra to help implementation on computers
-
$X'^T E X = 0$ , E is the essential matrix, X' and X are world coordinates in two cameras
- Weak calibration
- Estimate epipolar geometry from a (redundant) set of point correspondences between two uncalibrated cameras
-
$p_{im} = K_{int} \Phi_{ext} P_{w}$ ---->$p_{im} = K_{int} p_c$ -
$p_c = K_{int}^{-1} p_{im}$ gives us the ray the image point corresponds to - $(K^{-1}{int, right} p{im, right})^T E (K^{-1}{int, left} p{im, left}) = 0 $
- $p^T_{im, right} (K^{-1}{int, right}^T E (K^{-1}{int, left}) p_{im, left}) = 0 $
-
$p^T_{im, right} F p_{im, left} = 0$ F: fundamental matrix - simplified
$p^T F p' = 0$
- Properties of F
-
$p^T l = 0$ ->$l = Fp'$ is the epipolar line in the p image associated with p' -
$l' = F^T p$ is the epipolar line in the prime image associated with p
-
- Epipoles can be found by
$Fp' = 0$ and$F^Tp = 0$ . - F is singular (mapping from homogeneous 2D point to 1D family so rank is 2
- Fundamental matrix
- Relates pixel coordinates in two views
- More general form than essential matrix
- Remove the need to know intrinsics
- Compute F from correspondences
- One correspondence, one equation
$F = UDV^T$ - Set last singular value to zero in D,
$^F = U ^D V^T$ - Stereo image rectification
- Reproject image planes onto a common plane parallel to the line between the optical centers
- Pixel motion is horizontal after this transformation
- For 2-views, there is a geometric relationship that defines the relation between rays in one view and rays in the other - epipolar geometry
- there relationships can be captured algebraically as well
- calibrated: essential matrix
- uncalibrated: fundamental matrix
- This relation can be estimated by point corresondence
- there relationships can be captured algebraically as well
- We want local features
- Goal: find points in an image that can be
- found in other images
- found precisely - well localized
- found reliably - well matched
- Why?
- Want to compute fundamental matrix to recover geometry
- robotics/vision: see how a bunch of points move from one frame to another, allows computation of how camera moved -> depth -> moving objects
- Build a panorama
- Goal: find points in an image that can be
- Matching with featuress
- Detect features (feature points) in both images
- Match features - find corresponding pairs
- Use these pairs to align images
- Problems
- We need a repeatable detector
- We need a reliable and distinctive descriptor
- What makes a good feature
- Repeatability/Precision
- The same feature can be found in several images despite geometric and photometric transformations
- Saliency/Matchability
- Each feature has a distinctive descriptor
- Compactness and efficiency
- Many fewer features than image pixels
- Locality
- A feature occupies a relatively small area of the image; robust to clutter and occlusion
- Repeatability/Precision
- Basic idea
- "flat" region: no change in all directions
- "edge": no change along the edge direction
- "corner": significant change in all directions with small shift
- Harris Corners
- Change in appearance for the shift [u, v]
$E(u, v) = \sum_{x,y} w(x,y) [I(x+u, y+v) - I(x,y)]^2$
- We want to find out how this function behaves for small shifts (u, v near 0, 0)
- Second-order Taylor expansion of E(u,v) about (0,0) (local quadratic approximation for small u, v)
- Second moment matrix M
$E(u,v) = [u v] M [u v]^T$ - Diagonalization: $M = R^{-1} \begin{pmatrix} \lambda_1 & 0 \ 0 & \lambda_2 \end{pmatrix} R$
- The axis length of the ellipse are determined by the eignevalues and the orientation is determined by
$R$ . - The eigenvalues are cornerness
- The axis length of the ellipse are determined by the eignevalues and the orientation is determined by
- Harris corner response function
$R = det(M) - \alpha trace(M)^2 = \lambda_1 \lambda_2 - \alpha (\lambda_1 + \lambda_2)^2$ - R < 0, edge
- R > 0, corner
- |R| is small, flat region
- Change in appearance for the shift [u, v]
- Harris corner detector algorithm
- Compute gaussian derivatives at each pixel
- Compute second moment matrix M in a gaussian window around each pixel
- Compute corner response function R
- Threshold R
- Final local maxima of response function (non-maximum suppression)
- Harris detector properties
- Rotation invariant
- Mostly invariant to additive and multiplicative intensity changes (threshold issue for multiplicative)
- Not invariant to image scale!
- Scale invariant detection solution
- Design a function on the region (circle), which is "scale invariant" - not affected by the size but will be the same for "corresponding regions", even if they are different size/scales.
- One method:
- at a point, compute the scale invariant function over different size neighborhoods (different scales)
- choose the scale for each image at which the function is a maximum
- A "good" function for scale detection: has one stable sharp peak, for usual images, a good function would be a one which responds to contrast (sharp local intensity change)
- Function is just application of a kernel: f = Kernel * Image
- Laplacian of Gaussian - LOG
- Difference of Gaussians - DOG
- both are invariant to scale and rotation
- Function is just application of a kernel: f = Kernel * Image
- Key point localization
- Find robust extremum (maximum or minimum) both in space and in scale.
- SIFT: Scale Invariant Feature Transformation
- Specific suggestions: use DOG pyramid to find maximum values, then eliminates "edges" and pick only corners.
- SIFT (Lowe): find local maximum of difference of gaussian in space and scale.
- Harris-Laplacian:
- Find local maximum of
- Harris corner detector in space (image coordinates)
- Laplacian in scale
- Find local maximum of
- Criteria for point descriptors
- invariant
- distinctive
- Why can't we use harris detector and compare one by one using correlation
- Correlation is not rotation invariant
- Correlation is sensitive to photometric changes
- Normalized correlation is sensitive to non-linear photometric changes and even slight geometric ones.
- Could be slow - check all features against all features
- Idea of SIFT: image content is represented by a constellation of local features that are invariant to translation, rotation, scale, and other imaging parameters
- Overall SIFT procedure
- Scale-space extreme detection
- Keypoint localization
- Orientation assignment
- Keypoint description
- Nearest-neighbor matching to feature database
- SIFT uses best-bin-first modification to k-d tree algorithm
- Use heap data structure to identify bins in order by their distance from query point
- Wavelet-based hashing
- Compute a short (3-vector) descriptor from the neighborhood using a Haar "wavelet"
- Locality sensitive hashing
- 3D object recognition
- Train
- extract outlines with background subtraction
- compute "keypoints" interest points and descriptors
- Test
- Find possible matches
- Search for consistent solution, such as affine (needs 3 points to solve)
- Affine only need 3 points, can get recognition even with occlusion
- Train
- Feature-based alignment to find transforms
- Overall strategy
- Compute features - detect and describe
- Find useful matches: kd-tree, best-bin, hashing
- Compute and apply the best transformation, e.g. affine, translation, homography
- Algorithm
- Extract features
- Compute putative matches e.g. closest descriptor
- Loop until happy
- Hypothesize transformation T from some matches
- Verify transformation (search for other matches consistent with T) - mark best
- Overall strategy
- A better match way (Lowe, 1999)
- 1-NN (nearest neighbor): SSD (sum of squared difference) of the closest match
- 2-NN: SDD of the second closest match
- Look at how much better 1-NN is than 2-NN, e.g. 1-NN/2-NN
- Problem with typical "vertical" least squares which compute difference for y
- Not rotation-invariant
- Fails completely for vertical lines
- Total least squares
- Distance between point
$(x_i, y_i)$ and line$ax + by = d$ - Find $(a, b, d) to minimize the sum of squared perpendicular distances
- Distance between point
- Least squares as likelihood maximization
- Generative model
- line points are corrupted by Gaussian noise in the direction perpendicular to the line
- Non-robust to (very) non-gaussian noise
- because squared error heavily penalize outliers
- Solution:
- Robust estimator
$\rho (u, \sigma) = \frac{u^2} {u^2 + \sigma^2}$ - The robust function
$\rho$ behaves like squared distance for small values of the residual u but saturates for larges values of u
- Robust estimator
- Generative model
- RANSAC main idea
- If we had a proposed model, we could probably guess which points belong to that model: inliers
- RANdom SAmple Consensus:
- Randomly pick s points to to form a sample
- Instatiate the model
- Get consensus set
$C_i$ -- the points within error bounds (distance threshold) of the model - If
$|C_i| > T$ , terminate and return model - Repeat for
$N$ trials, return model with max$|C_i|$ -Choosing the parameters
- Initial number of points s
- Typically minimum number needed to fit the model
- Distance threshold t
- Choose t so probability for inlier is high
- Zero-mean gaussian noise with std. dev.
$\sigma$ ,$t^2 = 3.84 \sigma^2$
- Number of samples N
- Choose N so that with probability p, at least one random sample set is free from outliers.
- Need to set N based upon the outlier ratio e
- What to do when don't know inlier ratio e?
- Adaptively determining the number of samples
- Inlier ratio e is often unknown as a priori
- Pick worst case, e.g. 50% (e = 0.5) and adapt if more inliers are found
- Adaptively determining the number of samples
- RANSAC advantages
- Simple and general
- Applicable to many different problems, often works well in practice
- Robust to large number of outliers
- Applicable to large number of parameter than though transform
- Parameters are easier to choose than Hough transform
- Not-so-good
- Computational time grows with the number of model parameters
- Not as good for getting multiple fits
- Really not good for approximate models
- Photometry: measuring light
- Surface appearance
- Image intensity = f ( normal, surface reflectance, illumination)
- Surface reflection depends on both the viewing and illumination directions
- Radiometry
- Radiance (L): Energy carried by a ray
- Power per unit area perpendicular to direction of travel, per unit solid angle
- Units: Watss per square meter per steradian
$(Wm^{-2}sr{-1})$
- Irradiance (E): Energy arriving at a surface
- Incident power in a given direction, per unit area
- Units:
$Wm^{-2}$
- Radiance (L): Energy carried by a ray
- BRDF: bidirectional reflectance distribution function
- Helmholtz Reciprocity
$(f(\theta_i, \varphi_i; \theta_r, \varphi_r) = f(\theta_r, \varphi_r; \theta_i, \varphi_i)$ - Rotational symmetry (Isotropy)
- Helmholtz Reciprocity
- Reflection models
- Image intensity = Body reflection (diffuse) + surface reflection
- Diffuse reflection
- Lambertian BRDF: appear equally bright from all directions.
- BRDF is simply a constant - the albedo
$\rho_d$ - Surface radiance:
$L = \rho_d I cos \theta_i = \rho_d I(\overrightarrow{n} \cdot \overrightarrow{s})$
- BRDF is simply a constant - the albedo
- Lambertian BRDF: appear equally bright from all directions.
- Specular reflection and mirror BRDF
- All incident light reflected in a single direction
- Mirror BRDF is simply a double-delta function
- Specular reflection and Glossy BRDF
-
$L = I \rho_s (\overrightarrow{m} \cdot \overrightarrow{v})^k$ , m source, v view
-
- Phong reflection model
- The BRDF of many surfaces can be approximated by Lambertian + Specular model
- Ambiguity of lighting and reflectance
- The Mondrian world, Assumptions
- Light is slowly varying
- This is reasonable for planar world: nearby image points come from nearby scene points with same surface normal
- Within an object reflectance is constant
- Between objects, reflectance varies suddenly.
- Light is slowly varying
- The Mondrian world, Assumptions
- Land's Retinex Theory
- Goal: remove slow variations from the image
- Many approaches, one is to take logs of
$L(x,y) = R(x,y) * E(x,y)$ $log(L(x,y)) = log(R(x,y)) + log(E(x,y))$ - Hi-pass filter (say with derivative)
- Threshold to remove small low-frequencies
- Then invert process, take integral, exponentiate
- Human color and lightness constancy
- Color constancy: determine hue and saturation under different colors of lighting
- Lightness constancy: gray-level reflectance under differing intensity of lighting
- Surface normal
- Gaussian sphere and Gradient space projection
- Shape from shading: need more information
- Add more constraints: shape-from-shading, single image
- Take more images: photometric stereo
- Input: several images
- Same object
- Different lightings
- Same pose
- Input: several images
- Motion applications
- Background subtraction
- Shot boundary detection
- Motion segmentation
- Segment the video into multiple coherently moving objects
- Motion estimation techniques
- Feature-based methods
- Extract visual features (corners, textured areas) and track them over multiple frames
- Spare motion fields, but more robust tracking
- Suitable when image motion is large (10s of pixels)
- Direct, dense methods
- Directly recover image motion at each pixel from spatio-temporal image brightness variations
- Dense motion fields, but sensitive to appearance variations
- Suitable for video and when image motion is small
- Feature-based methods
- Optimal flow is the apparent motion of objects or surfaces
- How to estimate pixel motion from image
$I(x,y,t)$ to$I(x,y,t+1)$ ?- Solve pixel correspondence problem
- Given pixel in
$I(x,y,t)$ , look for nearby pixels of the same color in$I(x,y,t+1)$ .
- Given pixel in
- Key assumptions
- Color constancy: a point in
$I(x,y,t)$ looks the same in$I(x',y',t+1)$ - For grayscale images, this is brightness constancy
- Small motion: points do not move far away
- Color constancy: a point in
- Brightness constancy constraint equation
$I_x u + I_y v + I_t = 0$
- Solve pixel correspondence problem
- Aperture problem: you can only tell the motion locally in the direction perpendicular to the edge.
- For each pixel, we have 2 unknowns but only one equation
- Formulate error in optical flow constraint
$e_c$ - Add more constraints
- Smoothness constraint (global constraint): motion field tends to vary smoothly over the image,
$e_s$ - Penalized for changes in u and v over the image
- $ e = e_s + \lambda e_c$
- Smoothness constraint (global constraint): motion field tends to vary smoothly over the image,
- Solving the aperture problem
- One method (local constraint): pretend the pixel's neighbors have the same (u,v)
- If we use a 5x5 window, that gives us 25 equations per pixel.
- One method (local constraint): pretend the pixel's neighbors have the same (u,v)
- RGB version
- Note RGB alone at a pixel is not enough to disambiguate because R, G & B are correlated.
- Errors in Lucas-Kanade
- The motion is large (larger than a pixel) - Taylor doesn't hold
- Not-linear: iterative refinement
- Local minima: coarse-to-fine estimation
- Not tangent: Iterative refinement
- Iterative algorithm
- Estimate velocity at each pixel by solving Lucas-Kanade equations
- Warp
$I_t$ towards$I_{t+1}$ using the estimated flow field- Use image warping techniques
- Repeat until convergence.
- Iterative algorithm
- The motion is large (larger than a pixel) - Taylor doesn't hold
- Solve local minima problem
- Reduce the resolution
- Use gaussian pyramid of images, start from the very reduced level, run iterative L-K, warp and upsample to next level.
- Sparse L-K
- Basically hierarchical L-K applied to good feature locations
- General motion model
- $\begin{pmatrix} u(x,y) \ v(x,y) \end{pmatrix} = \frac{1} {Z(x,y)} A(x,y) T + B(x,y) \Omega$
- where
$T$ is the translation vector,$\Omega$ is the rotation - $A(x,y) = \begin{pmatrix} -f & 0 & x \ 0 & -f & y \end{pmatrix} \quad B(x,y) = \begin{pmatrix} (xy) / f & -(f+x^2) / f & y \ (f + y^2) / f & - (xy) / f & -x \end{pmatrix} $
- Affine motion: if Z is big, motion in Z can be neglected which simplifies homograhpy motion to affine motion.
- Layered motion
- Break image sequence into "layers" each of which has coherent motion
- Each layer is defined by an alpha mask and an affine motion
- Motion segmentation
- Tracking challenges
- Many places it's hard to compute optical flow
- There can be large displacements since could be moving rapidly
- Probably need to take dynamics into account
- Errors would compound - or drift
- Occlusions, disocclusions
- Shi Tomasi feature tracker
- Only compute motion where you should
- Find good features using eigenvalues of second-moment matrix
-
- Fom frame to frame, track with Lucas-Kanade and a pure translation model
-
- Check consistency of tracks by affine registration to the first (or earlier) observed instance of the feature
- Tracking with dynamics
- Given a model of expected motion, predict where objects will occur in next frame even before seeing the image
- Restrict search for the object
- Improved estimates since measurement noices is reduced by trajectory smoothness
- Assumption - continuous (modeled) motion patterns
- Objects do not disappear and reappear in different places in the scene
- Camera is not moving instantaneously to new viewpoint
- Gradual change in pose between camera and scene
- Given a model of expected motion, predict where objects will occur in next frame even before seeing the image
- Hidden state (X): true parameters we care about
- Measurement (Y): noisy observation of underlying state
- At each time stp t, state changes (from
$X_{t−1}$ to$X_t$ ) and we get a new observation$Y_t$ - Out goal: recover (estimate) most likely (distribution of) state
$X_t$ given- All observations so far
- Knowledge about dynamics of state transitions
-
Steps of tracking
- Prediction: what is the next state of the object given past measurements
- Correction: Compute an updated estimate of the state from prediction and measurements
- Tracking: The process of propagating this posterior distribution of state given measurements across time.
- Simplifying assumptions
- Only the immediate past matters
-
$P(X_t | X_0, ..., X_{t-1}) = P(X_t | X_{t-1})$ Dynamics model
-
- Measurements depend only on the current state
-
$P(Y_t | X_0, Y_0, ..., X_{t-1}, Y_{t-1}, X_t) = P(Y_t | X_t)$ Observation model
-
- Only the immediate past matters
- Linear models
- Linear dynamics model
- State undergoes linear transformation plus Gaussian noise
$x_t \sim N(D_t x_{t-1}, \sum_{d_t})$
- State undergoes linear transformation plus Gaussian noise
- Linear measurement model
- observation model: measurement is linearly transformed state plus Gaussian noise
$y_t \sim N(M_t x_t, \sum_{m_t})$
- observation model: measurement is linearly transformed state plus Gaussian noise
- Linear dynamics model
- Measurement will only reduce uncertainty.
- 1D KF intuition: new mean is the weighted average of prediction and measurement based on variances.
- KF is a method for tracking linear dynamical models in Gaussian noise contexts (dynamics and measurements).
- Predicted/corrected state densities are Gaussian
- You only need to maintain the mean and covariance
- The calculations are easy. (all integrals can be done in close form)
- Predicted/corrected state densities are Gaussian
- KF pros
- Simple updates, compact and efficient
- KF Cons
- Unimodal distribution, only single hypothesis
- Restricted class of motions defined by linear model
- Extensions called "Extended Kalman Filter"
- What should we do for non-gaussian distributions? propagation of general densities
- Particle filters
- Density is represented by both where the particles are and their weight.
- Bayes filters: framework
- Given
- Prior probability of the system state
$p(x)$ - Action (dynamical system) model:
$p(x_t | u_{t-1}, x_{t-1})$ - Sensor model (likelihood)
$p(z|x)$ - Stream of observations
$z$ and action data$u$ :$data_i = {u_1, z_2, ..., u_{t-1}, z_t}$
- Prior probability of the system state
- Wanted
- Estimate of the state
$X$ at time$t$ - The posterior of the state is also called belief:
$Bel(x_t) = P(x_t | u_1, z_2, ..., u_{t-1}, z_t)$
- Estimate of the state
- Given
- Bayes rule
$p(x|z) = \frac{p(z|x) p(x)} {p(z)} = \eta p(z|x) p(x) \propto p(z|x) p(x)$
- Bayes filter
$Bel(x_t) = \eta P(z_t | x_t) \int P(x_t | u_{t-1}, x_{t-1}) Bel(x_{t-1}) dx_{t-1}$ - z = observation, u = action, x = state
- Properties of the real world
- Position of robot is not discrete (a real number)
- Sensor is noisy, so some readings may be false.
- As a result, the predicted position given sensor readings is a probability distribution over space.
- Localization
- Assume a robot knows a 3D map of its world
- It has noisy depth sensors but whose sensing uncertainty is known.
- It moves from frame to frame.
- Particle filter: practical considerations
- Resampling
- Roulette wheel, binary search n logn
- Stachastic universal sampling
- systematic resampling
- Linear time complexity
- Easy to implement, low variance
- Resample only when necessary
- For highly peaked observations
- add noise to observation and prediction models
- better proposal distributions - e.g. perform kalman filter step to determine proposal
- Overestimating noise often reduce the number of required samples - always better to slightly over estimate than under.
- Recover from failure - resample
- Selectively add samples from observations
- Uniformly add some samples
- Resampling
- Mean-shift
- easiest to introduce when doing segmentation
- The ideas is to find the modes of a distribution, or a probability density
- The assumption is you have a set of instances drawn from a PDF and you want to find the mode.
- Mean-shift object tracking
- Start from the position of the model in the current frame
- Search neighborhood in next frame
- Find best by maximizing a similarity function
- Repeat the same process in the next pair of frames
- Representation, feature space, quantized color space, histogram
- Gradient
- Instead of uniform kernel, we could use a differentiable, isotropic, monotonically decreasing kernel, for example: Gaussian.
- You can move to the mode without blind search, but use gradient.
- We could use the similarity function as a sensor model for particle filtering.
- Tracking issues
- Initialization
- manual
- background subtraction
- detection
- Obtaining observation and dynamics model
- Dynamics model: learn from real data (pretty difficult), learn from "clean data" (easier), or specify using domain knowledge.
- Generative observation model - some form of ground truth required
- Prediction vs. Correction
- If the dynamics model is too strong, will end up ignoring the data.
- If the observation model is too strong, tracking is reduced to repeated detection.
- Data association
- What if we don't know which measurements to associate with which tracks?
- So far, we've assumed the entire measurement to be relevant to determining the state
- In reality, multiple objects or clutter (uninformative measurements)
- Solution:
- Simple: just pay attention to the closest
- More sophisticated strategy: keep track of multiple state/observation hypothesis
- Each particle is a hypothesis about current state.
- Drift
- Errors caused by dynamical model, observation model, and data association tend to accumulate over time.
- All trackers have some sort of model update.
- Initialization
- Object categorization
- Given a (small) number of training images of a category, recognize a-priori unknown instances of that category and assign the correct category label.
- Recognition challenges
- Robustness
- Illumination, object pose, clutter, occlusions, intra-class appearance, viewpoint
- Realistric scenes are crowed, cluttered, have overlapping objects.
- Importance of context
- Complexity
- About half of the cerebral cortex in primates is devoted to processing visual information.
- Robustness
- Supervised classification
- Given a collection of labeled examples, come up with a function that will predict the labels of new examples
- Two general strategies
- Use the training data to build representative probability model; separately model class-conditional densities and priors (generative)
- Directly construct a good decision boundary, model the posterior (discriminative)
- Risk of a classifier strategy
$S$ is expected loss. - At best decision boundary, either choice of label yields same expected loss.
- For a given measurement
$x$ and set of classes$c_i$ choose$c^*$ by:$c* = arg\ \underset{c}{max}\ p(c|x) = arg\ \underset{c}{max}\ p(c) p(x|c)$
- Summary of generative models
- Firm probabilistic grounding
- Allows inclusion of prior knowledge
- Parameteric modeling of likelihood permits using small number of examples
- New classes do not pertub previous models
- Others
- Can take advantage of unlabelled data
- Can be used to generate samples
- Bads
- Where did you get those priors?
- Why are modeling those obviously non-C points?
- The example hard cases aren't special
- If you have lots of data, doesn't help.
- Principal components are ll about the directions in a feature space along which points have the greatest variance.
- Algebraic interpretation
- Minimizing sum of squares of distances to the line is the same as maximizing the sum of squares of the projections on that line.
- max
$(x^T B^T B x)$ , subject to$x^T x = 1$ . - i.e. maximize
$E = x^T M x$ , subject to$x^T x = 1$ $E' = x^T M x + \lambda (1 - x^T x )$ $\frac{\partial E'} {\partial x} = 2 Mx + 2 \lambda x$ -
$\frac{\partial E'} {\partial x} = 0 \rightarrow Mx = \lambda x$ (x is an eigenvector of$B^TB$ )
-
$B^TB = \sum xx^T$ if about origin -
$B^TB = \sum (x - \bar{x}) (x - \bar{x})^T$ otherwise - outer product - So the principal components are the orthogonal directions of the covariance matrix of a set points.
- PCA may not be a good choice if the axis of the most variants for source distributions are not perpendicular. ICA (independent component analslysis) doesn't require the components or basis vectors to be orthgonal.
- We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images.
- PCA
- Given
$M$ data points$x_1, ..., x_M$ in$R^d$ where$d$ is big - We want some directions in
$R^d$ that capture most of the variation of the$x_i$ . THe coefficients would be$u(x_i) = u^T (x_i - \mu)$ ($\mu$ mean of data points) - Direction that maximizes the variance of the projected data
$var(u) = \frac{1}{M} \sum^M_{i=1} u^T (x_i - \mu)(u^T (x_i - \mu))^T\ \quad=u^T\left[ \frac{1}{M} \sum^M_{i=1} (x_i - \mu) (x_i - \mu)^T\right] u \ \quad = u^T \sum u$
- The direction that captures the maximum covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix
$\sum$
- Given
- The dimensionality trick
-
$C = AA^T$ huge$dxd$ matrix and$d >> n$ , - To get eigenvector of
$C$ , we can compute eigenvector of$A^TA$ which is much smaller$nxn$ , then suppose$v_i$ is an eigenvector of$A^TA$ $A^TAv_i = \lambda v_i$ - Premultiply by
$A$ $AA^TAv_i = \lambda Av_i$
- So
$Av_i$ are the eigenvectors of$C$ !
-
- How many eigenvectors are there?
- If
$M > d$ , that would be d - but for
$M << d$ , it is$M-1$ because we subtract the mean which removes one DOF
- If
- Eigenfaces
- Assume that most of face images lie on a low-dimensional subspace by the first
$k (k <<< d)$ directions of maximum variance - Use PCA to determine the vetors or "eigenfaces"
$u_1, u_2, ..., u_k$ that span that subspace - Represent all face images in the dataset as linear combinations of eigenfaces. Find the coefficients by dot product.
- Assume that most of face images lie on a low-dimensional subspace by the first
- Recognition with eigenfaces
- Given novel image
$x$ :- Project onto subspace
$[w_1, ...,w_k] = [u_1^T(x - \mu), ..., u_k^T (x - \mu)]$
- Optinal: check reconstruction error
$x - \hat{x}$ to determine whether image is really a face. - Classify as closest training face in k-dimensional subspace
- This is why it's a generative model.
- Project onto subspace
- Given novel image
- PCA limitations
- Global appearance method: not robust to misalignment, background variation
- The direction of maximum variance is not always good for classification
- More examples: eigen-gaits, eigen-bodyshape, non-linear kernel PCA
- Appearance-based tracking: Learn something about the possible target, by capturing a low-dimensional space to enable the ability of reconstruction, then you can track the target around by finding the place that you can reconstruct the best.
- Eigentracking [Black and Jephson 1996]
- Key insight: separate geometry from appearance - use deformable model
- But, need to solve nonlinear optimization problem
- And fixed basis set
- Incremental visual learning
- Adaptive visual tracker
- particle filter: draw samples from distributions of deformation
- Subspace-based tracking
- Learn to track the "thing" like the face in eigenfaces
- Use it to determine the most likely sample
- Incremental subspace update
- Does not need to build the model prior to tracking
- Handles variations in lighting, pose (and expression)
- Adaptive visual tracker
- State model
$L_t$ - Similarity transform: position
$(x_t, y_t)$ , rotation$\theta$ , scaling$s_t$ , 4 parameters - Or affine transform with 6 parameters
- Similarity transform: position
- Simple dynamics model:
$p(L_t | L_{t-1}$ - Each parameter is independently Gaussian distributed
- Observation model:
$p(F_t|L_t)$ - Use probabilistic PCA to model our image observation process
- Given a location
$L_t$ , assume the observed frame was generated from the eigenbasis
- Incremental subspace update
- Given initial eigenbasis
$B_{t-1}$ and new observation$w_{t-1}$ , compute a new eigenbasis$B_t$ . - Why? to account for appearance change due to space, illumination, shape variation
- Learn an appearance representation while tracking
- Essentailly, develop an update algorithm with respect to running mean, allowing for decay
- Given initial eigenbasis
- Occlusion handling
- An iterative method to compute a weight mask
- Estimate the probability a pixel being occluded
- An iterative method to compute a weight mask
- Model the decision boundaries probabilistically.
- Asssumptions in discriminative classification
- There are a fixed number of known classes
- Ample number of training examples of each class
- Equal cost of making mistakes - what matters is getting the label right
- Need to construct a representation of the instance but we don't know a prior what features are diagnostic of the class label.
- Train
- Build an object model - a representation describe training instances
- learn/train a classifier
- Test
- Generate candidates in new image
- Score the candidates
- Window-based models
- Simple holistic descriptions of image content
- grayscale / color histogram
- vector of pixel intensities
- Pixel-based representations sensitive to small shifts
- Color or grayscale-based description can be sensitive to illumination and intra-class appearance variation.
- Consider edges, contours, and (oriented) intensity gradients
- Summarize local distribution of gradients with histogram
- Locally orderless: offers invariance to small shifts and rotations
- Contrast-normalization: try to correct for variable illumination
- Simple holistic descriptions of image content
- Distriminative classifier construction
- Nearest neighbor
- Neural networks
- SVMs
- Boosting
- Random Forests
- Nearest neighbor
- Choose label of nearest training data point
- KNN
- for a new point, find the k closest points from the training data
- vote for the class
- Boosting
- Iterative learning method
- Initially weight each training example equally
- In each boosting round
- Find the weak learner that achieves the lowest weighted training error
- Raise weights of training examples misclassified by current weak learner
- General: compute final classified as linear combination of all weak learners (weight of each learner is directly proportional to its accuracy)
- Extract formulas for re-weighting and combining weak learners depend on the particular boosting scheme (e.g. AdaBoost)
- Integral image: the value at
$(x,y)$ is sum of pixels above and to the left of$(x,y)$ - Computing sum within a rectangle, sum = A - B - C + D. D is the top left corner, A is right bottom corner.
- Viola-Jones face detector
- Represent brightness patterns with efficiently computable "rectangular" features within window of interest
- Choose distriminative features to be weak classifiers/learners.
- Use boosted combination of them as final classifier
- Form a cascade of such classifiers, rejecting clear negatives quickly.
- Cascade
- Form a cascade with really low false negative rates early
- At each stage use the false positives from last stage as "difficult negatives"
- Summary of viola-jones face detector
- Key ideas:
- Rectangular features and integral image
- AdaBoost for feature selection
- Cascade
- Training is slow but detection is very fast.
- Key ideas:
- Boosting (general) advantages
- Integrates classification with feature selection
- Flexibility in the choice of weak learners, boosting scheme
- Complexity of training is linear in the number of training examples
- Testing is fast
- Easy to implement
- Disadvantages
- Needs many training examples
- Often found not work as well as an alternative distriminative classifier, SVM.
- Especailly for many-class problems.
- Linear 2D SVM
- Discriminative classifiers based on optimal separating line (2D case)
- Maximize the margin between positive and negative training examples
-
$x_i$ positive$(y_i = 1): x_i w + b >= 1$ -
$x_i$ negative$(y_i = -1): x_i w + b <= -1$ - For support vectors:
$x_i w + b = \pm 1$ - Distance between point and line
$\frac{|x_i w + b|} {||w||}$ - for support vectors:
$\frac{|x_i w + b|} {||w||} = \frac{\pm 1} {||w||}$ - margin
$M = \frac{2} {||w||}$
- for support vectors:
- Quadratic optimization problem
- Minimize
$\frac{1}{2} w^T w$ - subject to
$y_ (x_i w + b) >= 1$ - Solution:
$w = \sum_i \alpha_i y_i x_i$ -
$\alpha$ learned weights,$x_i$ support vector
-
- Minimize
- Non linear SVMs
- original input space can be mapped to some higher-dimensional space
- The "kernel" trick
- We saw linear classifier relies on dot product between vectors
- Define
$K(x_i, x_j) = x_i \cdot x_j = x_i^T x_j$ - Map
$\Phi : x \rightarrow \varphi(x)$ - the dot product becomes:
$K(x_i, x_j) => \varphi(x_i)^T \varphi(x_j)$ - A kernel function is a "similarity" function that corresponds to an inner product in some expanded feature space
-
$\sum_i \alpha_i y_i (x_i^T x) + b \rightarrow \sum_i \alpha_i y_i K(x_i^T, x) + b$ ,$x$ is the new point,$x_i$ is support vector
- Example of kernels
- Gaussian RBF:
$K(x_i,x_j) = exp\left( - \frac{||x_i-x_j||^2}{2\sigma^2}\right)$ - Number of dimensions: infinite
$exp\left( - \frac{1}{2} ||x-x'||^2_2\right) = \sum^\infty_{j=0} \frac{(x^Tx')^j}{j!} exp \left( -\frac{1}{2} ||x||^2_2\right) exp \left( -\frac{1}{2} ||x'||^2_2\right)$
- Histogram intersection
$K(x_i, x_j) = \sum_k min(x_i(k), x_j(k))$
- Gaussian RBF:
- SVMs for recognition:
- Training
- Define your representation
- Select a kernel function
- Compute pairwise kernel values between labeled examples
- Use this "kernel matrix" to solve for SVM support vectors & weights
- Prediction
- Compute kernel values between new input and support vectors
- Apply weights
- Check sign of output
- Training
- Multi-class SVMs
- Combine a number of binary classifiers
- one vs. all
- Traiing: learn an SVM for each class vs. the rest
- Testing: Apply each SVM to test example and assign it to the class of the SVM that returns the highest decision value
- one vs. one
- training: learn an SVM for each pair of classes
- testing: each learned SVM "votes" for a class to be assigned to the test example
- one vs. all
- Combine a number of binary classifiers
- SVMs pros
- many publicly available SVM packages
- kernel-based framework is very powerful, flexible
- Often a sparse set of support vectors - compact when testing
- Works very well in practice, even with very small training sample sizes
- SVMs cons
- No "direct" multi-class SVM, must combine two-class SVMs
- Can be tricky to select best kernel function for a problem
- Nobody writes their own SVMs
- Computation, memory
- during training time, must compute matrix of kernel values for every pair of examples
- learning can take a very long time for large-scale problem
- Indexing local features
- Each patch/region has a descriptor,which is a point in some high-dimensional feature space (e.g. SIFT)
- When we see close points in feature space, we have simliar descriptors, which indicates similar local content.
- Easily can have millions of features to search
- For text documents, an efficient way to find all pages on which a word occurs is to use an index
- We want to find all images in which a feature occurs.
- To use this idea, we'll need to map our features to "visual words"
- Bags of visual words
- Summarize entire image based on its distribution (histogram) of word occurences
- Analogous to bag of words representation commonly used for documents
- Comparing bags of words
- Rank by normalized scalar product between their (possibly weighted) occurrence counts -- nearest neighbor search for similar images.
- A video is a sequence of frames captured over time
- Now our image data is a function space
$(x,y)$ and time$(t)$
- Now our image data is a function space
- Background subtraction
- Needs static camera - still hard
- Simple approach
- Estimate background for time
$t$ - Subtract estimated background from current input frame
- Apply a threshold to the absolute difference to get the foreground mask
- Estimate background for time
- Frame differencing
- Background is estimated to be the previous frame
- Has problems for pixels which object that may have no change over time.
- Mean filtering
- Background is the mean of previous
$n$ frames - It will average objects as well into the background
- Background is the mean of previous
- Median filtering
- Assuming that the background is more likely to appear in a scene, we can use the median of previous
$n$ frames as the background model - Advantages
- Extremely easy to implement and use
- All pretty fast
- Corresponding background need not to be constant, they change over time
- Disadvantages
- Accuracy of frame differencing depends on object speed and frame rate
- Median background model - relatively high memory requirements
- Setting global threshold is hard
- Assuming that the background is more likely to appear in a scene, we can use the median of previous
- Background subtraction with depth, kinect
- Epipolar plane images
- EPI gait
- stack of images give us volume of data
- can be used to find out background of moving objects by checking the slopes of moving strips
- EPI gait
- Terminology
- Event: a single instant in time detection
- Actions or Movements: atomic motion patterns
- Often gesture-like
- Single clear-cut trajectory
- Single nameable behavior (e.g., sit, wave arms)
- Activity: Series or composition of actions
- E.g. interactions between people
- Model based action recognition
- Use human body tracking and pose estimation techniques, relate to action descriptions
- Major challenge: training data from different context than testing
- Model based activity recognition
- Given some lower level detection of actions (or events) recognize the activity by comparing to some structural representation of the activity
- Needs to handle uncertainty
- Major challenge: accurate tracks in spite of occlusion, ambiguity, low resolution
- Activity as motion, space-time appearance patterns
- Activity-recognition from a static image
- Motion History Images
- MHIs are a different function of temporal volume
- Pixel operator is replacement decay
- if moving:
$I_\tau(x,y,t) = \tau$
- otherwise:
$I_\tau(x,y,t) = max(I_\tau(x,y,t-1) - 1, 0)$
- MHI shows which things moved and also how things moved.
- MHIs are a different function of temporal volume
- MEI motion energy images
- binary image, roughly locates the area of motions of a given action from a given view
- Image moments
- Moments, summarize a shape given image
$I(x,y)$ $M_{ij} = \sum_x \sum_y x^i y^j I(x,y)$
- Central Moments are translation invariant
$$\mu_{pq} = \sum_x \sum_y (x-\bar{x})^p (y-\bar{y})^q I(x,y)$$ $\bar{x} = \frac{M_{10}}{M_{00}}\quad \bar{y} = \frac{M_{01}} {M_{00}}$
- Moments, summarize a shape given image
- Hu moments
- Translation and rotation and scale invariant
- Compute MEI and MHI based on 7 Hu moments to get feature vector
8D-L3 Hidden Markov Models
- Possible problems
- Time series classification
- Time series prediction/generation
- Outlier detection
- Time series segmentation
- Markov model
- Probability of moving to a given state depends only on the current state: 1st order Markovian
- Ingredients of a Markov Model
- States:
${S_1, S_2, \cdots, S_N}$ - State transition probabilities:
$a_{ji} = P(q_{t+1} = S_i | q_t = S_j)$ - Initial state distribution:
$\pi_i = P[q_1 = S_i]$
- States:
- Hidden Markov Models
- You can't observe the state, but you can observe observables (evidence)
- Emission probabilities
$b_j(k) = P(o_t = k | q_t = S_i)$ - probability of an observation given you are in a state
- Given some sequence of observations, what "model" generated those?
- 3 computational problems of HMMs
- Evaluation
$P(O|\lambda)$ : given the model$\lambda = (A,B,\pi)$ what is the probability of occurrence of a particular observation sequence$O = {o_1, \cdots, o_T}$ - Classification/recognition problem: I have a trained model for each of a set of classes, which one would most likely generate what I saw.
- Decoding: optimal state sequence to produce an observation sequence
$O = {o_1, \cdots, o_T}$ - Useful in recognition problems - helps give meaning to states
- Learning: Determine model
$\lambda$ , given a training set of observations- Find
$\lambda$ , such that P(O|\lambda) is maximal
- Find
- Evaluation
- HMMs general
- HMMs: generative probabilistic models of time series (with hidden state)
- Forward-backward: algorithm for computing probabilities over hidden states
- Given the forward-backward algorithms you can train the models
- Best known methods in speech, computer vision, robotics, thought for really big data CRFs winning.
- Good points about HMMs
- A learning paradigm that acquires spatial and temporal models and does some amount of feature selection
- Recognition is fast, training is not so fast but not too bad
- Not so good points
- Not great for on the fly labeling - e.g. segmentation of input streams. Requires lots of data to train for that - much like language
- Works well when problem is easy, less clear other times.
- Retina, Color matching function based on RGB
- Most spectral color can represented as a positive combination of these primary colors, but some spectral cannot, need to add some red.
- Color is a Psychological phenomenon
- Luminance vs color
- Human is more sensitive to brightness change than to color change.
- HSL and HSV
- Hue Saturation Lightness
- Hue Saturation Value
- Hue corresponds to color
- Treat hue as an angle, reds from booths ends of the spectrum now in proximity
- Better reflects the role of saturation (radius or distance from center)
- Color gamuts: subspace of color space for monitor/printer/etc...
- YUV color space make easier to separate colors.
- Easier clustering of pixels
- Efficient encoding by choma subsampling
- Recall that human vision is more sensitive to intensity changes
- Y channel can now use more bits
- Segmentation
- Segementation of coherent regions
- Figure ground segmentation
- Separate foreground object (figure) from the background (ground)
- Grouping of similar neighbors: superpixels
- Clustering
- Best cluster centers are those that minimize SSD between all points and their nearest cluster center
$c_i$ - K-means clustering, need to set k before clustering
- Clustering for an image can be thought of quantization of the feature space
- Simple clustering for image
- Intensity-based
- Color-based
- Best cluster centers are those that minimize SSD between all points and their nearest cluster center
- Kmeans cons
- Memory-intensive, need to pick k, sensitive to initialization, sensitive to outliers, only finds "spherical" clusters
- The mean shift algorithm seeks modes or local maxima of density in the feature space.
- Mean shift clustering
- Cluster: all data points in the attraction basin of a mode
- Attraction basin: the region for which all trajectories lead to the same mode
- Steps
- Find features (color, gradients, texture, Luv space, etc.)
- Initialize windows at individual feature points (sampled pixels)
- Perform mean shift for each window (pixel) until convergence
- Merge windows (pixels) that end up near the same "peak" or mode
- Pros
- Automatically find basin of attraction
- One parameter choice (window size)
- Does not assume (image) shape on clusters
- Generic technique
- Find multiple modes
- Cons
- Selection of window size
- Does not scale well with dimension of feature space
- Color, brightness, position alone are not enough to distinguish all regions.
- Grouping pixels based on texture similarity.
- Feature space: filter bank responses.
- Texture features
- Find "textons" by clustering vectors of filter bank outputs
- Describe texture in a window based on its texton histogram
- Images as graphs
- Fully-connnected graph
- 1 node (vertex) for every pixel
- A link between every pair of pixels,
$<p, q>$ - Affinity weight
$w_{pq}$ for each link (edge)-
$w_{pq}$ measures similarity: Inversely proportional to difference (in color and position...)
-
- Fully-connnected graph
- Segmentation by graph partitioning
- Break graph into segments
- Delete links that cross between segments
- Easest to break links with low affinity
- Break graph into segments
- Graph cut
- Set of edges whose removal makes a graph disconnected
- Cost of a cut
- Sum of weights of cut edges
- A graph cut gives us a segmentation
- Find minimum cut
- problems
- weight of cut proportional to number of edge in the cut
- tends to produce small, isolated components
- problems
- Normalized cut
- Fix bias of min cut by normalizing for size of segments
$Ncut(A,B) = \frac{cut(A,B)}{assoc(A,V)} + \frac{cut(A,B)}{assoc(B,V)}$ -
$assoc(A,V)$ = sum of weights of all edges that touch A
- Fix bias of min cut by normalizing for size of segments
- Pros
- Generic framework
- Flexible to choice of function that computes weights ("affinities") between nodes
- Does not require model of the data distribution
- Generic framework
- Cons
- Time complexity can be high
- Dense, highly connected graphs
- many affinity computations
- Solving eigenvalue problem
- Dense, highly connected graphs
- Preference for balanced partitions
- Time complexity can be high
- Useful operations
- Thresholding a gray-scale image
- Determining good thresholds
- Connected components analysis
- Binary mathematical morphology
- All sorts of feature extractors, statistics (area, centroid, circularity, ..)
- Thresholding
- Automatic thresholding: Otsu's method
- Assumption: the histogram is bimodal
- Method: find the threshold
$t$ that minimizes the weighted sum of within-group variances for the two groups that result from separating the gray tones at value$t$ .
- Automatic thresholding: Otsu's method
- Connected components methods
- Recursive tracking (almost never used)
- Parrallel growing (needs parallel hardware
- Row-by-row (most common)
- Classical algorithm
- Efficient Run-length algorithm (developed for speed in real industrial applications)
CC = 0 Scan across rows: If 1 and connected: Propogate lowest label behind or above(4 or 8 connected) Remember conflicts If 1 and not connected: CC++ and label CC If 0: Label 0 Relabel based on table (conflicts)
- Mathematical morphology
- two basic operations
- Dilation
- expands the connected sets of 1s of a binary image, can be used for growing features, filling holes and gaps
- Erosion
- shrinks the connected sets of 1s of a binary image, can be used for shrinking features, removing bridges, branches, protrusions
- Dilation
- two basic operations
- Structuring element
- A shape mask used in basic morphology operations
- Any shape, size that is digitally representable
- With a defined origin
- Given binary B and structuring element S, we can perform dilation using OR and erosion using AND.
- A shape mask used in basic morphology operations
- The two most useful binary morphology operations are opening and closing
- Opening is the compound operation of erosion followed by dilation (with the same structuring element)
- Can show that the opening of A by B is the union of all translations of B that fit entirely within A
- Opening is idempotent: repeated operations has no further effects.
- Intuitively, the opening is the area we can paint when the brush has a footprint B and we are not allowed to paint outside A.
- Closing is the compound operation of dilation followed by erosion (with the same structuring element)
- Can show that the closing of A by B is the complement of union of all translations of B that do not overlap A
- Closing is idempotent: repeated operations has no further effects.
- Intuitively, the closing the area we can not paint when the brush has a footprint B and we are not allowed to paint inside A.
- Opening is the compound operation of erosion followed by dilation (with the same structuring element)
- Boundary extraction
- Let
$A\oplus B$ denote the dilation of A by B and let$A\ominus B$ denote the erosion of A by B. - The boundary of A can be computed as
$$A - (A\ominus B)$$ - where B is a 3x3 square structuring element
- We subtract from A an erosion of it to obtain its boundary
- Let
- Sensing
- Passive 3D sensing
- Stereo camera to get depth
- Recover depth from focal length
- Active 3D Sensing
- Photometric stereo, control light
- Time of flight: LIDAR (Laser Range finder), SONAR (Sound transceiver)
- Structured light: like stereo, but replace one camera with a projector
- Infrared structured light: RGB-D camera, MS kinect
- Projected IR vs Natural Light Stereo
- Projected IR
- Works in low light conditions
- Does not rely on having textured objects
- Not confused by repeated scene textures
- Can tailor algorithm to produced patterns
- Natural Light Stereo
- Work outside, anywhere with sufficient light
- Resolution limited only by sensors, not projector
- Difficulties with both
- Very dark surfaces may not reflect enough light
- Specular reflection in mirrors or metal causes trouble
- Projected IR
- Passive 3D sensing
- Depth image vs Point cloud
- Depth image
- Advantages
- Dense representation
- Gives intuition about occlusion and free space
- Depth discontinuities are just edges on the image
- Disadvantages
- Viewpoint dependent, can't merge
- Doesn't capture physical geometry
- Need actual 3D locations of cameras
- Advantages
- Point clouds: take every depth pixel and put it out in the world
- Advantages
- Viewpoint independent
- Captures surface geometry
- Points represent physical locations
- Disadvantages
- Sparse representation
- Lost information about free space and unknown space
- Variable density based on distance from sensor
- Point clouds are sampled from object surfaces, the concept of volume is inferred, not perceived.
- Surfaces
- Tangent plane - defined by normal
- First-order approximation
- We can use surface normals of point clouds to find planes
- Advantages
- Depth image
- Visual field
- monocular visual field, each eye
$160^\circ$ (horizontal) - binocular visual filed,
$120^\circ$ (horizontal) - Total visual field: $200^\circ $ (horizontal) x
$135^\circ$ (vertical)
- monocular visual field, each eye
- Light detection: rods and cones
- Rods
- 120 million rods in the retina
- 1000x more light sensitive than cones
- Discriminate between brightness levels, in low illumination
- Short-wavelength sensitive
- Cones
- 6~7 millions cones in the retina
- Responsible for high-resolution vision
- Discriminate colords
- Three types of color sensors (64% red, 32% green, 2% blue)
- Sensitive to any combination of the three
- Rods
- Image capture
- Huge dynamic range
- overall:$10^{-6} - 10{+8} cd/m^2$ (candelas: unit of energy)
- static: at least 100:1 probably more
- a given scene in the real world: 100,000:1
- Everyone knows about the pupil, but it's actually the retina's ganglion cells that make this works.
- Huge dynamic range
- V1
- V2 and IT