Cleaning up Messy Development Code¶
Data science development is characterized by nimble, iterative experimentation and exploratory data analysis, which often results in a messy development code. Code cleanup is hence an essential step for moving data science work into production. With the complete development history stored in artifacts, LineaPy enables automatic code cleanup, facilitating transition to production.
First, identify the variable of interest in the development code. For instance, we might be
interested in cleaning up the development code for model3
in it.
Then, store the variable as an artifact:
artifact = lineapy.save(model3, "best_model")
And simply ask for its cleaned-up code:
print(artifact.get_code())
This will return code relevant to the artifact only. That is, LineaPy has condensed the original code by removing extraneous operations that do not affect the artifact we care about (e.g., plotting and print statements).
Note
This does not mean that we lost other parts of the development code. We can still access
the artifact's full session code (including comments) with artifact.get_session_code()
.
This should come in handy when trying to remember or understand the original development
context of a given artifact.
Known Issues¶
We discuss few cases in which LineaPy's code cleanup might lead to issues, and what steps can be taken by the user to get around them.
Unvisited Conditional
Suppose we have a code which has conditionals in them, for example:
import lineapy
lst = []
var = 10
if var > 5:
lst.append(10)
var += 10
else:
lst.append(20)
var += 20
lineapy.save(var, 'variable')
In this example, if we try to obtain the cleaned up code for the artifact as follows:
artifact = lineapy.get('variable')
print(artifact.get_code())
We note that the cleaned-up code outputted is as follows:
var = 10
if var > 5:
var += 10
else:
lst.append(20)
var += 20
Note that in case we visit the else
branch in the cleaned-up code, we would encounter a Runtime Error
saying that the name lst
is not defined. The reason for this behavior is that while creating the Linea
Graph, LineaPy executes the user code to obtain run-time information, which enables LineaPy to create fairly
accurate cleaned-up code. However, in the case of conditionals, only one branch would be visited and hence we
would not have accurate run-time information about the instructions in the non-visited branch. We make an
approximation in this case, which is to include all the instructions in the branches which are not visited.
Since we are not able to perform analysis within the non-visited branch, we do not have the information required
to know that the variable lst
which is being included due to the non-included else
branch, is defined
outside the if
block, and hence the definition of lst
gets sliced out as it is not required for the
final collected artifact.
To get around this behavior, the user can manually edit the source code by deleting the instruction within the non-visited branch which does not contribute to the artifact, or as an alternative, the user can add dummy variable declarations to ensure the code does not crash, as shown below:
import lineapy
lst = []
var = 10
if var > 5:
lst.append(10)
var += 10
else:
var += 20
lineapy.save(var, 'variable')
or
import lineapy
lst = []
var = 10
if var > 5:
lst.append(10)
var += 10
else:
lst = []
lst.append(20)
var += 20
lineapy.save(var, 'variable')