By Jordan Vrtanoski

Detecting the potential problems in the code before the product is released can prevent the problems in production and lower the cost of the system operation. The automated code review tools are relying on detecting code patterns that are know to cause problems. This methods are unable to find new type of problems.
We can apply machine learning to the problem. The question would be how to find a good training set. The answer to this question would be: the public open source repositories (GitHub, Bitbucket, etc.). With each commit, the developer is providing the code fragment (the modifications of the original file) and the information about the change (fix, new feature, merge, etc.).
We can train model that will use the defect fixes to learn the pattens in which the code can be broken. We can extract the features from the code based on NLP, however we can go a step further and enhance the features. Since the code follows strict syntactical rules, we can generate features based on the parsed code tree (ex. level of nesting, recursion, etc.).
The model would return sections of the code that have high probability of having defect fixes in future.


Please log in to add a comment.

Jordan Vrtanoski



Published: 28 Jan, 2021

Cc by