The surface code is unarguably the leading quantum error correction code for 2D nearest neighbor architectures, featuring a high threshold error rate of approximately 1%, low overhead implementations of the entire Clifford group, and flexible, arbitrarily long-range logical gates. These highly desirable features come at the cost of significant classical processing complexity. We show how to perform the processing associated with an n×n lattice of qubits, each being manipulated in a realistic, fault-tolerant manner, in O(n2) average time per round of error correction. We also describe how to parallelize the algorithm to achieve O(1) average processing per round, using only constant computing resources per unit area and local communication. Both of these complexities are optimal.