Thanks for engaging on this, Thomas. It’s been a tough one to ask questions about because ultimately there isn’t one isolated thing I can point to and ask about. Finding where to even look is the challenge for me

My goal, if I can figure this out, is to write both a Keras and PyTorch version and throw them up on GitHub with a README.md or blog post that details the issues I ran into and how I got it working. Of course, comparing against Yun Chen’s code feels like “cheating” (imagine if I were trying to replicate a detailed paper with no code to reference) but at this point, I’ll do whatever I can

We don’t share a common data loader. Mine is actually an absurd monstrosity and one of the things I’ve narrowed in on is that the anchor labeling code is wrong in some non-obvious way. It generates anchors that are similar but not identical to Chen’s. We have a lot of the same positive anchors but weirdly, due to some sort of slight numerical difference, I sometimes end up with a few more or less, or different aspect ratios. Originally I was too conservative with my anchor assignment. I intend to completely rewrite this at some point but to unblock myself, I’ve adapted Chen’s code to proceed (and that is what made me confident that our RPN functionality is the same).

The key difference in my code is that I don’t generate 2D tensors of anchor boxes, regression targets, etc. I actually prepare a large 5D tensor of shape (H,W,k,8), where k is the number of anchors and that last dimension is stuffed with ground truth data: anchor valid flag, object/not object, highest overlap GT object box class index (unused), and the 4 regression targets.

This complicates my loss functions a bit but I’ve written test programs that convert between my format and what his loss functions take and have convinced myself we are computing the same loss in the end, at least for the RPN and probably for the regression targets of the detector. The detector class score might be different.

I’ve encountered all kinds of very subtle bugs in my code over the last several weeks but none of them have made a large difference. Now, there doesn’t seem to be much left to examine, although I do suspect that the bounding box regression targets of the final detector stage might not be converging rapidly and I’m not sure why. They aren’t wrong enough to point to a problem in proposal labeling.

Initial weights for the layers taken from VGG-16 (not just the conv layers at the start but also the two fully connected layers in the final detector stage) may have an impact but I load the same weights that are loaded into Chen’s model (the Caffe weights, which assume images are preprocessed using the original VGG-16 procedure of ImageNet mean subtraction).

One thing the paper neglects to mention but which every implementation seems to do is scale the detector regression targets. I’m doing this too. During training, I also print out the statistics (mean and, more importantly, std dev) of each of the final regression targets: ty, tx, th, tw. I think that the prediction stats should converge to the ground truth target stats over time. They seem to do so in my model but slowly. RPN is much quicker. This is actually the one statistic I have not yet observed in Chen’s model and will do so tomorrow. It may confirm that is where the problem is or it could be a red herring.

But if so, if Chen’s regression target statistics converge faster, I’m at a loss to explain what is causing it. At this point, I am very confident that I am feeding in roughly the same number of positive/negative samples (at one point I discovered I was feeding in too few due to a bug in the anchor assignment code). I’ve made some visualizations of the proposal targets and how they evolve over time and both our models look the same at this point.

Thanks,

Bart