My first Py

Recently I went out on site and collected two reasonably large data sets which could not be easily handled by proprietary software.  This rather fundamental need to be able to actually use my survey data inspired me to have a go at the first ArchaeoPy challenge and to start to get to grips with the Python language.  Having completed and uploaded some script, I was encouraged by Popefinn and Chrys Harris to share it with you in this post.

So let's remind ourselves of the original challenge and the subsequent clues: in summary, load a text file of XYZ data, grid it and for extra points, undertake a ZMT on the data. I have never programmed before, nor do I have any knowledge of Python beyond using short scripts for field calculations in ArcGIS so my script here is clumsy and poorly written.  It is mainly based upon the clues given in the challenge posts, the marvellous resource that is the Stackoverflow forum and several other python guides and tutorials easily found with proficient use of Google. I was most definitely undertaking some kinesthetic learning throughout the process!

For the ZMT, you might like to play with my play.csv file which is made-up data with big stripes.  The yxz.csv file is the CMD data turned round because the ZMT works along the x axis. In this post I am going to work through my script and try and explain what I have done and why I have done it...

Getting the data in:

entry In the challenge example above, the file name and file parameters were all specified within in the script.  This is fine if you only ever want to look at one file but ideally we want our script to be versatile enough to be used on any file, so I made use of the Tkinter module to open a file through a dialogue box and raw_input() to allow user input to set parameters throughout the script (e.g. delimiter, number of header rows etc).  Now when we load the text file, we use specified parameters, not hard coded values.

Note the skip = raw_input() followed by skipint = int(skip) - this ensures that the value entered by the user is turned from a string to an integer to allow the following arguments to work.  You could also use float(value) depending on what you subsequently wanted to do.

input

Gridding the data and getting some extra info:

grid_data In the challenge example above, the parameters to grid the data, np.linspace ()  were all entered into the script but we were given the rather cryptic clue of using numpy array indexing to extract the required info.  Time for some Googling.

min_max

At this point the data remains as 3 columns of x, y and z so I used the numpy.amin and numpy.amax functions to calculate mins and maxes for each of the relevant columns.  ([:,0] specifes the left most column (e.g. x), while ([0,:] would be the top most row.)

At this point it struck me that knowing something about the original data spacing would be really useful so I calculated that by specifying that an x value was a point in the first column [0] and then creating an ordered list of all the x values without duplicates.  The len(x_values) returns the number of unique  x values (i.e. number of unique measurement positions along the line) and this, in conjunction with the range of the data, lets us calculate the measurement spacing.

interp

In order to grid the data I made use of the mins and maxes and other values calculated above, together with further user input to specify the new grid spacing and the type of interpolation.  Because we are trying to view raw data, I didn't want to over interpolate it before I undertook any processing, hence why I calculated the original spacings above.  My idea was that the regridding would use spacings as close to the original values as possible to avoid inventing too many data points at this stage in the process.

Plotting:

There wasn't much of a clue given about how to plot the data but some hard core use of Google gave a few options  Because I started using medians and percentiles later on, it made no sense to use standard deviations for plotting levels at this point.  I wanted to display certain statistical values to help the user determine plotting ranges and set some defaults based upon these values. I think the most important thing to note here is the interpolation = 'none' within plt.imshow(). This means that there will be no automatic image interpolation and you will be viewing your data according to the specified data interpolation above.  The subplot() is used to allow the raw and manipulated data to be viewed together at the end of the routine.

plot_params There are a number of parameters we can set for our plot.  I have tried to incorporate user input to allow the script to be used on more than one type of data.  I have specified default values for most parameters.

Extra Credit?!

So, Popefinn left us with this little snippet:

extra

Firstly, I chose to use a Zero Median Traverse and not a Zero Mean Traverse because the former is much less affected by outliers than the latter.

zmt_params

Using a numpy array based on our regridded data,  I used the axis = n value to calculate the median for each axis of this array (i.e. each row), giving one array containing a value relating to each row of the data.  (e.g. ([5],[7],[4],[5],[4]) if the there were 5 rows of data).   I then transposed these returns so that it became an array in correct orientation for what I subsequently wished to do.  I repeated this for the 5 percentile and 95 percentile. By default, I used these 5 and 95 percentile arrays as the limits with which to remove outliers in the original data set, replacing the outliers with a 'nan' (a no data value).  Printing the array lets the user see that the outliers have been removed and to assess whether they have removed everything they might want to.  I have included a means of setting your own limits but I am not sure how well this is working.

Having removed the outliers, I repeated the process of calculating the median values for the rows before undertaking the actual 'Zero Median Traverse' process by deducting these new median values from the relevant row of the original data set.

zmt

Having specified the plotting parameters for the second data set, the raw and the ZMT value are displayed together allowing you to visually assess how well the process has worked.

images

I am not totally convinced the script is quite doing what I am trying to get it to do, although obviously, most of the stripes are removed from the data.  I would be keen to hear your thought on the best way to deal with the outliers prior to the ZMedT process, without clipping or editing the original data (that would be a despike!), is it merely a question of correctly estimating and specifying the limits or do we need to account for the outliers and generate new median values in an entirely different way?

So, despite the clear issues with what I have produced, I have certainly learnt a lot more about Python and I think there are a couple of points take away from the exercise:

By sharing our own scripts, we can optimize routines to manipulate and use our data in precisely the ways we want to, generating the output in the form we want for whatever subsequent processes we might wish to undertake.  This avoids the usual issues of trying to transform data between proprietary software.

Scripting functionality requires thinking about functionality.  This means that we gain a far better understanding of how we are manipulating our data and the caveats that are inherent in the processes we routinely apply.  This in turns means that we can understand, interpret and use our data to much greater effect.

Clearly I have everything still to learn about Python but already I can see numerous potential applications for my fieldwork heavy PhD.  In the meantime, I await Challenge 2!

Addendum:

Since writing this I have had another go at getting the correction values for the ZMedT process.  With a little bit of help, I tried a line by line iteration (which reassuringly gave the same result as the original method) but then further refined the median correction values by borrowing a bit of code from here which allowed me to interpolate over the 'nan' values and to then get a better estimate of the line medians without the presence of either the outliers or nan values which both skewed the result.  I have uploaded the script to the ArchaeoPY repository, with the new chuck outlined below:

ol_median = np.median(orig_data, axis = n)
print "\n"'Original median line values'"\n"
print ol_median

#New bit of code starts here
np.set_printoptions(threshold = 'nan', precision = 3)

print "\n"'Enter upper threshold or'
print 'use default of 95th percentile'
up_thresh = raw_input('95') or 95

print "\n"'Enter lower threshold or'
print 'use default of 5th percentile'
low_thresh = raw_input('5') or 5

P = 0
row = orig_data[P,:]
row_len = len(orig_data[0,:])
col_len = len(orig_data)
new_data = ()

for row in orig_data:
    rowcopy = np.copy(row)
    #Use limits to ignore outliers
    nine_five = np.percentile(rowcopy, up_thresh)
    five = np.percentile(rowcopy, low_thresh)
    rowcopy[rowcopy > nine_five] = np.nan
    rowcopy[rowcopy < five] = np.nan 
    #Interpolate over nan values to improve median estimates
    def nan_helper(y):
        return np.isnan(y), lambda z: z.nonzero()[0]
    y= rowcopy
    nans, x= nan_helper(y)
    y[nans]= np.interp(x(nans), x(~nans), y[~nans])
    y.round(2)
    new_data = np.append(new_data, y, axis = 0)
P += 1
#reshape array
new_data.shape = col_len, row_len
#New code ends here

#calculate new medians ignoring outliers
new_med = np.median(new_data, axis = n)
trans_median = np.array([new_med]).T
print 'Show row correction values:'
print "\n", trans_median

#Do ZMT correction
ZMT = orig_data - trans_median
print 'this is ZMT', ZMT