What's the difference between these two ways of computing a LookAt matrix?

I have been trying to understand how the view matrix is constructed given the position of the camera, the point to look at and the up vector.

I found two tutorials, here and here, that explain this. However, they differ in the way the view matrix is built. In former they create a translation and a rotation then multiply them to get the view-matrix. In the latter, though, they just put the translation in the last row (column, depending on the convention row/col-major).

So my question is, why are these two ways? Isn't the lookAt matrix be unique? To my understanding and after reading and thinking a lot, it seems to me that the first blog has the correct way. Am I missing something?

Solution

(assuming row-major order, because that's what I usually use)

Think about it this way: at first, you have the untransformed world, with a camera at some position with some rotation. You know that, in order to get from that world to the transformed world where the camera is at the origin and pointing toward +z or -z or what-have-you, you need to do some sort of translation (because the camera isn't at the center) and some sort of rotation (because the camera could be pointing in any direction).

Since it's easiest to perform a rotation when the point you're rotating around (the camera) is at the origin, you first want to translate the camera so that it's located at the origin. Here's what the matrix would look like:

1 0 0 -camera_x
0 1 0 -camera_y
0 0 1 -camera_z
0 0 0     1

After you perform this rotation, the camera is at the center. Now you can rotate it so it's pointing the way you want. Rotations can be accomplished in many ways, so instead writing an actual matrix, I'll just give a placeholder:

a b c 0
d e f 0
g h i 0
0 0 0 1

Now, how can we combine these matrices to get our view matrix? The rule is that you multiply the matrices all together, in right-to-left order. So the calculation looks like this:

view = rotation * translation

because you're translating first, then rotating second.

To answer your question: if you were rotating first, and translating second, like this:

view = translation * rotation

then view would be equal to this:

a b c -camera_x
d e f -camera_y
g h i -camera_z
0 0 0     1

That's because, when you transform a matrix using a pure translation matrix, you get your original matrix with the xyz offsets added to the top three values in the last column. That might be what that second tutorial was trying to do.

However, matrix multiplication is not commutative. When assembling your view matrix the easier way (translate first, rotate second), you can't just subtract the camera position from the rotation matrix, because you're rotating a translation instead of translating a rotation, which is just a more complicated thing to do by hand. In this case, you need to multiply the two matrices together.