Moving a lot of data between databases

April 1, 2015, 5:22 am

≫ Next: How to create a monstrous gazette for relation extraction

Motivation

During one of our latest projects, we had to do some Data Warehousing for a client with a fairly large dataset. Everything is stored in a MySQL cluster, and given the sensitive nature of the data we were given only partial access to views based on the actual data.

We decided to implement the warehouse using PostgreSQL (we also had to build a Django site based on it, so it was the most natural solution), and initially everything went fine, but as the dataset grew, moving the data from MySQL to PostgreSQL proved challenging and time consuming.

In the past we have had to handle even larger datasets, but none coming from a database, so we had to adapt our solution to the new problem. We want to share some of the components of the solution we found with the community.

RowsTextIO

One of the fastest ways to load data into a PostgreSQL table is to use the COPY SQL command that copies data from a file or from a stream directly into the table. This command is often used when loading large amounts of data from a foreign source (usually CSV files) into the database. This command is also available on psycopg (PostgreSQL’s python driver), as the copy_from method on the cursor class.

rowstextio is a read-only unicode character and line based interface to stream I/O from the result of a database query. This stream can be given as a parameter to psycopg’s cursor copy_from method to load the data into the target table.

Usage

The following session shows the typical use case for the package.

>>>> import psycopg2>>>> import mysql.connector>>>> source_connection = mysql.connector.connect(**source_connection_settings)>>>> target_connection = psycopg2.connect(**target_connection_settings)>>>> from rowstextio import RowsTextIO>>>> source_cursor = source_connection.cursor()>>>> target_cursor = target_connection.cursor()>>>> f = RowsTextIO(source_cursor, 'SELECT * FROM source_table WHERE id <> %(id)s', {'id': 42})>>>> target_cursor.copy_expert('COPY target_table FROM STDIN CSV', f)>>>> target_cursor.close()>>>> source_cursor.close()

Assuming that the target table schema is compatible with the rows resulting from the query, the data should be loaded by now.

How does it work?

It works by requesting a fixed amount of rows from the source table to populate a buffer, and then reads from that buffer at the client’s request:

defread(self,n=None):read_buffer=StringIO()ifnisnotNone:# Read a fixed amount of bytesleft_to_read=nwhileleft_to_read>0:# Keep reading from the bufferread_from_buffer=self._buffer.read(left_to_read)iflen(read_from_buffer)>0:read_buffer.write(unicode(read_from_buffer))eliflen(read_from_buffer)<left_to_read:# We read less than the remaining bytes. Fetch more rows.self._fetch_rows()self._write_rows_onto_buffer()iflen(self._buffer.getvalue())==0L:# There are no more rows, break the loopbreakleft_to_read-=len(read_from_buffer)else:# Read all the rowswhileTrue:read_from_buffer=self._buffer.read()iflen(read_from_buffer)>0:read_buffer.write(read_from_buffer)else:# We emptied the buffer. Fetch more rows.self._fetch_rows()self._write_rows_onto_buffer()iflen(self._buffer.getvalue())==0L:# There are no more rows, break the loopbreakread_result=read_buffer.getvalue()read_buffer.close()returnread_result

Conclusions

This solution gave us the flexibility to load huge amounts of data from complex queries and speed up our ETL process.

As usual, any comments or suggestions are welcomed. We haven’t tried this with other databases, but we think it might be possible to make it work with any interface that takes a text stream as input. We’re interested to know if you managed to use it in a different environment.

↧

How to create a monstrous gazette for relation extraction

April 28, 2015, 6:09 am

≫ Next: Python for geospatial data processing

≪ Previous: Moving a lot of data between databases

The cornerstone of the small work done for getting the info for these great charts with IEPY, was to be able to catch mentioned companies.

The basic idea of relation extraction is to be able to detect mentioned things in text (so called Mentions, or Entity-Occurrences), and later decide if in the text is expressed or not the target relation between each couple of those things. In our case, we needed to find where companies were mentioned, and later determine if in a given sentence it was said that Company-A was funding Company-B or not.

In order to detect those funding we needed to be sure of capturing every mention of a company. And although the NER used catched most of them, there are always some folks that name their company #waywire or 8th Story, words that are not very easily trackable with a NER.

A good solution is to build a Gazetteer containing all the company names we can get. The idea of working with Gazettes, is that when using them, each time one of the Gazette entries is seen on a text, it’s automatically considered as a mention of a given object, ie, an Entity-Occurrence.

From an encyclopedic source; we got more than 300K entries.Great!

The next challenge was that… well, in the text to process, a company could be mentioned on a different way than the official one stated on the encyclopedic source. For instance, would be more natural to find mentions of “Yiftee” than “Yiftee Inc.”

So, after incorporating a basic schema for the alternative names (ie, substrings of the original long name), the number of entries grew up to 600K.

After that, when we felt confident about our gazette and wanted to start processing text, we faced several issues:

we weren’t able to handle that gazette size at reasonable speed
we were having tons of poor quality Entity-Occurrences (ie, most of the times a human reader would say that it was wrongly labeled as an Entity-Occurrence)
tons of poor quality Entity-Occurrences implied quadratic tons of potential funding evidences to check (roughly one per pair of Entity-Occurrences on the same sentence)

So, knowing that we were trading recall[1], we decided to add several levels of filters. Let the pruning start!

First step was to add a second encyclopedic source, not to augment, but instead to add confidence, keeping only the intersection of those 2 sources.

Next, with a precomputed count of words frequency, we filtered out all those company names that were too probable to occur as normal text (we used some threshold and tuned it a bit before leaving it fixed).

With that very same idea of words frequency, we pruned the companies sub-names (the substrings of the original long company names), with a higher threshold; so, for a company listed like “Hope Street Media” we didn't end up with a dangerous entry for “Hope”, but instead for “Yiftee Inc.” we did have “Yiftee” on the final list.

With all that done, we reduced the list to about 100k, which was still capturing a really good portion of the names to work with, but reducing a lot the mentioned issues above.

The last step was to pick a sample of documents, preprocess them, and simply hand-check the most used (found) Gazette-Items creating a blacklist for the cases where it was obvious that occurrences were most of the times not the company mentioned, but just natural usage of those words on the language.

We finished very satisfied with the results and also with the lessons learnt. Hopefully some of the tips above can help you.

[1] recall: (also known as sensitivity) is the fraction of relevant instances that are retrieved. http://en.wikipedia.org/wiki/Precision_and_recall

Want to read more related content? Follow us on Twitter @machinalis

↧

Python for geospatial data processing

March 15, 2016, 5:13 am

≫ Next: Python for Object Based Image Analysis (OBIA)

≪ Previous: How to create a monstrous gazette for relation extraction

Python for geospatial data processing

Python is undoubtedly one of the most popular, general purpose, programming languages today. There are many strong reasons for this but in my opinion the most important ones are: an Open Source Definition, the simplicity of its syntax, the batteries included philosophy and an awesome, global community.

One interesting example of an area where Python is being adopted massively is the scientific world. That explains the existence of communities like PyData or the Scipy ecosystem.

Another more specific area where I see an increasing interest and use of Python is for geospatial data processing. A proof of this are the many well known tools using it, such as GDAL, ArcGIS, GRASS, QGIS and more.

Then, the objective of this post is to show the advantages and power of the Python ecosystem in this particular ambit. I’m going to do this through an example of a complex task, which is typical in this field: satellite image classification.

Satellite Images Classification In Python

Satellite images are classified for an infinite number of reasons. It has uses in agriculture, geology, emergencies monitoring, surveillance, weather forecast, economical studies, social sciences and more.

For this reason, many GIS and geospatial data management systems include tools to perform classifications. But this approach has some limitations: the process is manual and you usually have a very small set of options regarding the classification technique and other hyperparameters.

Another possibility is to classify using implementations in domain-specific languages (“DSL”) such as R, IDL, MATLAB, Octave, etc. But in this case you are usually limited to an experimental context.

Python, thanks to its scripting features and rich ecosystem, provides most of the benefits of other DSLs so it’s great for doing research and quick prototyping. Moreover, being a widely adopted general purpose language, it is also useful to develop production-ready, efficient, maintainable, scalable, industrial scale classification systems.

Therefore, let’s see how easy it is to perform an image classification, making use of the tools existing in the Python ecosystem for geospatial data processing. In a hundred lines of code we are going to develop a script to:

process a Landsat 8 GeoTiff image (raster data),
extract training and testing data from Shapefiles (vector data)
train and classify using a modern Machine Learning technique
assess the results

To try and kiss to maintain the focus in the proper classification part of the issue, I’m not going to delve into the depths of data pre-processing (calibration, geographic transformations, etc.). I know that is a basic part of the job but it’s not where I want to expand now.

As usual, most of the magic will be done by the tools we are going to be using as main dependencies:

GDAL

GDAL is a translator library for raster and vector geospatial data formats. It presents a single raster abstract data model and vector abstract data model for all supported formats.

It is implemented in C/C++, so it is highly performant, and it provides Python bindings.

Installing this library may not be a trivial task, specially for those who are not very familiar with the process of installing Python dependencies. In any case, the GDAL site has got detailed instructions which are summarized in a README file in the code repository related to this post.

scikit-learn

scikit-learn is an open source machine learning library for Python. It features various classification, regression and clustering algorithms. It is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

It is largely written in Python, with some core algorithms written in Cython (using C/C++) to achieve performance.

Example data

Thanks to the GDAL api, the program that we are going to develop throughout this post works with many different kind of image formats and geographically corresponding vectors. But in case you don’t have a dataset at hand, you can download some sample data.

It includes part of a Landsat 8 image of an agricultural area and some synthetic vector data with samples of different crops. It is the kind of data used for precision farming’s products (for example, crop’s identification).

http://www.machinalis.com/static/media/uploads/input-raster.png

The file is a compressed data directory with three sub-directories:

image includes a Geotiff file with the crop of a L1T Landsat scene (LANDSAT 8, sensor OLI, path 229, row 81, 2016-01-19 19:14:02 UTC). Only bands 1 to 7 are present (Aerosol, VIS, NIR, SWIR. 30m resolution)
train has got shapefiles with vector data to be used for training. Each file defines a class, that is: all the points, poligons, etc. existing within a shapefile are used to define samples of one class. In our sample data we have 5 classes, named A, B, C, D and E (not too fancy, I know.)
test is similar to the test directory, but this samples are going to be used to verify the classification results.

Since this post will not focus on the results of the classification, I’m not going to go into any details about data quality, requirements or preparation. In case you want to know or discuss anything in this matter, please use the comments or send me an email.

Example program

Next, we will develop a script to classify geospatial data. A more pythonic and complete version of the program can be found in the repo. That version includes logging, docstrings, comments, pep-8, some error control and other good programming practices that we are not going to take into consideration in this post.

In the code repository you will also find the simpler version of the script, described in this post (here). Before you download or copy-paste these lines in a file or a Python interpreter, make sure that you install the following dependencies:

GDAL==2.0.1
numpy>=1.10,<1.11
scipy==0.17.0
scikit-learn==0.17
# Optionally, you can install matplotlib

Preliminars

Now we are ready to code. So, first things first: we import our main dependencies and define a list of colors:

importnumpyasnpimportosfromosgeoimportgdalfromsklearnimportmetricsfromsklearn.ensembleimportRandomForestClassifier# A list of "random" colors (for a nicer output)COLORS=["#000000","#FFFF00","#1CE6FF","#FF34FF","#FF4A46","#008941"]

From the last import line you can see that we are going to classify using the Random Forest technique. Thanks to Scikit-learn, we could easily experiment with or compare many different classification techniques such as stochastic gradient descent, support vector machines, nearest neighbors, AdaBoost, etc.

The list of colors will be embedded in the GeoTiff output. This will allow you to easily visualize it in any standard image viewer program.

Next, let’s define some useful functions that we are going to be using later. They are making heavy use of the GDAL api to manipulate raster and vector data (the code is pretty self explanatory).

defcreate_mask_from_vector(vector_data_path,cols,rows,geo_transform,projection,target_value=1):"""Rasterize the given vector (wrapper for gdal.RasterizeLayer)."""data_source=gdal.OpenEx(vector_data_path,gdal.OF_VECTOR)layer=data_source.GetLayer(0)driver=gdal.GetDriverByName('MEM')# In memory datasettarget_ds=driver.Create('',cols,rows,1,gdal.GDT_UInt16)target_ds.SetGeoTransform(geo_transform)target_ds.SetProjection(projection)gdal.RasterizeLayer(target_ds,[1],layer,burn_values=[target_value])returntarget_dsdefvectors_to_raster(file_paths,rows,cols,geo_transform,projection):"""Rasterize the vectors in the given directory in a single image."""labeled_pixels=np.zeros((rows,cols))fori,pathinenumerate(file_paths):label=i+1ds=create_mask_from_vector(path,cols,rows,geo_transform,projection,target_value=label)band=ds.GetRasterBand(1)labeled_pixels+=band.ReadAsArray()ds=Nonereturnlabeled_pixelsdefwrite_geotiff(fname,data,geo_transform,projection):"""Create a GeoTIFF file with the given data."""driver=gdal.GetDriverByName('GTiff')rows,cols=data.shapedataset=driver.Create(fname,cols,rows,1,gdal.GDT_Byte)dataset.SetGeoTransform(geo_transform)dataset.SetProjection(projection)band=dataset.GetRasterBand(1)band.WriteArray(data)dataset=None# Close the file

Now we have all that we need to perform the actual classification. Let’s create some variables to define our input and output:

raster_data_path="data/image/2298119ene2016recorteTT.tif"output_fname="classification.tiff"train_data_path="data/test/"validation_data_path="data/train/"

In the lines above, I assume we are using the sample data described before.

Training

Now, we will use the GDAL api to read the input GeoTiff: extract the geographic information and transform the band’s data into a numpy array:

raster_dataset=gdal.Open(raster_data_path,gdal.GA_ReadOnly)geo_transform=raster_dataset.GetGeoTransform()proj=raster_dataset.GetProjectionRef()bands_data=[]forbinrange(1,raster_dataset.RasterCount+1):band=raster_dataset.GetRasterBand(b)bands_data.append(band.ReadAsArray())bands_data=np.dstack(bands_data)rows,cols,n_bands=bands_data.shape

Next, we’ll process the training data: project all the vector data, in the training dataset, into a numpy array. Each class is assigned a label (a number between 1 and the total number of classes). If the value v in the position (i, j) of this new array is not zero, that means that the pixel (i, j) must be used as a training sample of class v.

files=[fforfinos.listdir(train_data_path)iff.endswith('.shp')]classes=[f.split('.')[0]forfinfiles]shapefiles=[os.path.join(train_data_path,f)forfinfilesiff.endswith('.shp')]labeled_pixels=vectors_to_raster(shapefiles,rows,cols,geo_transform,proj)is_train=np.nonzero(labeled_pixels)training_labels=labeled_pixels[is_train]training_samples=bands_data[is_train]

training_samples is the list of pixels to be used for training. In our case, a pixel is a point in the 7-dimensional space of the bands.

training_labels is a list of class labels such that the i-th position indicates the class for i-th pixel in training_samples.

So now, we know what pixels of the input image must be used for training. Next, we instantiate a RandomForestClassifier from Scikit-learn.

classifier=RandomForestClassifier(n_jobs=-1)classifier.fit(training_samples,training_labels)

There are many parameters that we can play around with here. I encourage you to read the related documentation and try different possibilities.

Normally, the fine tunning of these parameters depend on the data, the specific domain of study, memory and processing resources, expected accuracy, etc.

To stay focused and for the sake of simplicity, I’m not going to expand on this issue. As you can see, I’m only passing an option to use all the cores in my computer.

Classifying

And voila! believe it or not, that was the hard part. Now we have a trained model, able to classify (predict) the class of whatever pixels data we have. So let’s do that.

n_samples=rows*colsflat_pixels=bands_data.reshape((n_samples,n_bands))result=classifier.predict(flat_pixels)classification=result.reshape((rows,cols))

We used the trained object to classify all the input image. Our classifier knows how to train pixels and its predict function expects a list of pixels, not an NxM matrix. Because of that, we reshaped the bands data before and after the classification (so that the output looks like an image and not just a list of multi-dimensional pixels).

At this point, if you have matplotlib installed you can visualize the results:

frommatplotlibimportpyplotaspltf=plt.figure()f.add_subplot(1,2,2)r=bands_data[:,:,3]g=bands_data[:,:,2]b=bands_data[:,:,1]rgb=np.dstack([r,g,b])f.add_subplot(1,2,1)plt.imshow(rgb/255)f.add_subplot(1,2,2)plt.imshow(classification)

http://www.machinalis.com/static/media/uploads/input-output.jpg

But more important than merely watch our astounding, brand new classification is to save it to disk and asses its accuracy. So let’s do that.

For the first task, we already created an auxiliary function: write_geotiff (that you had already forgotten about, lost in the admiration of the colourful image). We could just save the pixel’s data using matplotlib.pyplot.imsave but then we would loose all the valuable geographic information (and other metadata) included in the GeoTiff format. And such information is essential for the GIS and other satellite data processing systems. So we’ll use our GDAL-powered function:

write_geotiff(output_fname,classification,geo_transform,proj)

That was simple. Now you should be able to open that new file that we just created, with any image viewer, GIS or remote sensing data’s processing system.

Assess the results

Finally closer to the end, before we can verify our classification’s accuracy, we need to pre-process our testing dataset in a fashion similar to what we did with the training data:

shapefiles=[os.path.join(validation_data_path,"%s.shp"%c)forcinclasses]verification_pixels=vectors_to_raster(shapefiles,rows,cols,geo_transform,proj)for_verification=np.nonzero(verification_pixels)verification_labels=verification_pixels[for_verification]predicted_labels=classification[for_verification]

There we have the expected label for the verification pixels, and the computed labels. So we can analyze the results. For that, our beloved scikit-learn provides many tools. So let’s use two of them

print("Confussion matrix:\n%s"%metrics.confusion_matrix(verification_labels,predicted_labels))

That should print something like this:

Confussion matrix:
[[ 82   0   6   0   0]
 [  0 180   0   0   0]
 [  0   0  65   0   0]
 [  0   0   2  89   0]
 [  0   0   0   0 160]]

Next, for precission and accuracy:

target_names=['Class %s'%sforsinclasses]print("Classification report:\n%s"%metrics.classification_report(verification_labels,predicted_labels,target_names=target_names))print("Classification accuracy: %f"%metrics.accuracy_score(verification_labels,predicted_labels))

Should print something like this:

Classification report:
             precision    recall  f1-score   support

    Class C       1.00      0.93      0.96        88
    Class D       1.00      1.00      1.00       180
    Class B       0.89      1.00      0.94        65
    Class A       1.00      0.98      0.99        91
    Class E       1.00      1.00      1.00       160

avg / total       0.99      0.99      0.99       584

Classification accuracy: 0.986301

Conclusions

In this post we developed a script that processes raster and vector data, performs a supervised classification using a sophisticated machine learning technique, visualized the output and assessed the results.

All of this in 100 lines of a general purpose, widely adopted language and using highly efficient tools.

At least for me, that proves the benefits and power of the Python ecosystem for geospatial data processing.

Unfortunately, because of NDAs, I cannot share more specific (and very interesting) real-life examples. Hopefully in the future I’ll be allowed to do so, in order to expand on the advantages and some cool Python tricks useful in this field.

↧

Python for Object Based Image Analysis (OBIA)

March 30, 2016, 5:48 am

≫ Next: Searching for aliens

≪ Previous: Python for geospatial data processing

Intro

A few weeks ago I posted about the benefits and power of the Python ecosystem for geospatial data processing. As an example, we analyzed a typical task in this field (satellite image classification) with an straightforward, machine-learning approach.

In this case, we will go further than that.

The projects that we (in Machinalis) work on a daily basis, applying machine-learning related methods and algorithms, require more than straightforward, out-of-the-box, approaches. But even with a higher order of sophistication and complexity, Python is still a perfect choice.

Obviously any production-ready, machine-learning based, image processing system require a whole lot of work on experimentation, hyperparameter’s tuning, optimization and stuff like that. We are not going to cover those issues in this post.

Still, we will somehow expand the aforementioned post. We are going to perform a new classification but this time based on a different paradigm: Geographic Object-Based Image Analysis.

OBIA: Object-Based Image Analysis

OBIA is a relatively new and evolving paradigm that builds on image analysis concepts that have been used for decades: segmentation, edge-detection, feature extraction and classification.

Basically, an OBIA approach for image classification would be

Perform an image segmentation
Classify the segments (instead of the pixels)

There already are many studies demonstrating that this technique can yield better results than more traditional, pixel-based, classification methods (see [1]).

Although the more obvious and clear advantages of the OBIA approach make sense with high resolution imagery, there are solid studies and evidence where this approach has been used with images of medium or coarse resolution. For example this paper [2] evaluates OBIA Classification of crops with images of 30 m and 15 m pixel resolution (somehow similar to the Landsat data).

But it is not in the scope of this post to discuss any further this concept or paradigm. There’s a lot of very interesting and specific literature about it (for example [3]).

Having said all this, let’s focus on the main objective of this post.

Python implementation of an object-based image classification

Code and data

First of all, the code related to this post can be found here: https://github.com/machinalis/satimg

The code shown in this post has been simplified. In the repo you’ll see a more complete and complex version.

For the Python enthusiasts, there’s a Jupyter Notebook available in the repo.

If you need sample data to play with, you can use the same as in the other blog post and you can download it from here.

Implementation

As described before, the strategy that we are going to describe has got two stages:

Segmentation: First, an image segmentation is performed to cluster contiguous areas of similar pixels. Each resulting segment is modeled aggregating the information of its pixels. So here we see a change in scale: from pixel level we move to a segment, cluster or object level.
Classification: Once the training dataset is defined (a subset of segments) a classifier is trained and used to classify all the segments.

The technological stack is similar to the one presented in the last post, so we are not going to expand on that. Although we will mention some new requirements.

Also, not all the code will be presented here. Only some specific parts of it. As mentioned before, the code can be found here: https://github.com/machinalis/satimg

In particular, we will skip describing the first part of the code, including reading the input data, since it is similar to what we did in the last post.

Segmentation

Segmentation of digital images using computers started over 50 years ago. By now, there are several techniques or methods that can be used (thresholding, clustering, edge detection, graph partitioning, etc.). What’s more, there are many efficient implementations in many programming languages.

In Python, many well known packages can be used for this: scikit-learn, OpenCV, scikit-image

As usual, the selection of the algorithm together with the fine tuning of its parameters is a complex task. So many variables must be taken into account that it goes far beyond the scope of this post. Right now, we are just taking a general, high-level look at the possibilities and features provided by the tools of the Python ecosystem.

In the course of a real project, the task of deciding the segmentation algorithm to use and its final configuration would take a considerable amount of work. Furthermore, thinking in a long term, automated system, we would probably need to develop support infrastructure to permanently monitor, optimize and update the configuration parameters.

So, for the sake of simplicity, we are just going to make a pseudo-arbitrary choice.

scikit-image

The scikit-image project includes the skimage.segmentation module which provides several useful algorithms: active_contour, Felsenszwalb, quickshift, random walker, slic.

Most of them support multi-band data. They call it multichannel 2D images. So for example, to use the Quickshift algorithm, we just do:

img=exposure.rescale_intensity(bands_data)clusters=quickshift(img,kernel_size=7,max_dist=3,ratio=0.35,convert2lab=False)segment_ids=np.unique(clusters)

A couple of details about scikit-image

the segmentation algorithms expect the image data to have values between 0.0 and 1.0. That’s why we use the rescale_intensity function in the exposure module.
Typically, multichannel images are assumed to be RGB. To avoid problems with that, the convert2lab=False parameter is used.
in line with scikit-learn, scikit-image works with Numpy arrays. So the obtained segments can be easily processed later, without needing to learn nothing new.

At this point, we have a segmented image: each pixel value is an integer number with a segment label.

This figure shows the input image next to the results of different segmentation algorithms: quickshift and Felsenszwalb

The code in the repo is actually using the Felsenszwalb segmentation results.

Classification

Next we have to perform a supervised classification. Before we can do that, we have to do the following:

Feature extraction: choose a model to represent our segments. A set of features that describe a segment’s homogeneity and, at the same time, distinguishes different segments.
Training dataset definition: determine which segments must be used to train the classifier.

Feature extraction

In order to classify segments instead of pixels we need to provide a segment representation.

Right now, a segment is defined by a set of pixels. A pixel is modeled by the 7-dimensional vector of radiometric data (one dimension per band).

Again, there’s a lot of work to do here and we are just going to use a very simple statistical model: with all the pixels of a given segment, we compute certain statistics for each band and use that as features:

defsegment_features(segment_pixels):"""    For each band, compute: min, max, mean, variance, skewness, kurtosis."""features=[]_,n_bands=segment_pixels.shapeforbinrange(n_bands):stats=scipy.stats.describe(segment_pixels[:,b])band_stats=list(stats.minmax)+list(stats)[2:]features+=band_statsreturnfeatures

Now a segment is defined by a vector of constant length 42 (7 bands x 6 statistics). So the representation of the segment is independent of its size. The gain is potentially very big: a segment with only 100 pixels used to be represented by 700 values (100 pixels x 7 bands).

The next step is to transform our input data, from a pixel representation (the full image) to objects (or segments).

objects=[]objects_ids=[]forsegment_labelinsegment_ids:segment_pixels=img[segments==segment_label]segment_model=segment_features(segment_pixels)objects.append(segment_model)# Keep a reference to the segment labelobjects_ids.append(segment_label)

In the objects variable we have each segment represented by its features’ vector, and no more by its pixels (to each vector we added a dimension to keep track of the segment’s number or id).

Training dataset definition

Now that we have a valid, concise representation of the segments, we need to define which of them are going to be used for training.

From the sample data we get labeled pixels. Then we have to match those pixels with the corresponding segments. There could be ambiguities to solve: samples of different classes could end up in the same segment (after all, the segmentation was unsupervised).

These sort of problems must be addressed and depend on the available data. The related code in the Jupyter Notebook does all that (including the verification that there are no conflicts in the test data).

training_labels=[]training_objects=[]forklassinclasses:class_train_objects=[vfori,vinenumerate(objects)ifobjects_ids[i]insegments_per_klass[klass]]training_labels+=[klass]*len(class_train_objects)print("Training samples for class %i: %i"%(klass,len(class_train_objects)))training_objects+=class_train_objects

This figure shows the resulting training segments, obtained from individual sample pixels

Classifier setup

Once we have our training data ready, we can classify all the objects. This step is very simple and similar to what was done in the previous post:

classifier=RandomForestClassifier(n_jobs=-1)classifier.fit(training_objects,training_labels)predicted=classifier.predict(objects)

Now that we have predicted a class for each segment it’s time to go back to the pixels dimension. To do this we are going to start from segmented image (where each pixel has got a segment identifier) and match it with the predicted data, which assigns a class label to each segment:

classification=np.copy(segments)forsegment_id,klassinzip(objects_ids,predicted):classification[classification==segment_id]=klass

And that’s it: we have a pixel-by-pixel classification.

This figure shows the resulting classification

To assess the results we could do part of what we did in the previous post but there’s a problem that must be dealt with: the verification data is given as pixel samples, just like our training samples. But then the segmentation process converted the training pixels into training objects (segments), thus growing the training regions. Now, using the validation data without proper care we risk validating with data that was actually used for training. That’s a common methodological error.

Closing up

The actual results of the classification are not relevant. They depend on a lot of factors, some of which have been mentioned briefly but none of them have been seriously developed in this post.

What we’ve seen here helps as an overview of a strong and solid machine-learning technique for images processing, that’s complex, highly sophisticated and not yet widely adopted.

But what’s really important to me is to show the power of the Python ecosystem for these kind of tasks.

Please, feel free to comment, contact us or whatever. I’m @py_litox on Twitter.

References

[1]	Blaschke T., 2009, Object based image analysis for remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing, 65 (2010), pp. 2–16

[2]	Peña et al., 2013, Object-Based Image Classification of Summer Crops with Machine Learning Methods. Remote Sens. 2014, 6, 5019-5041.

[3]	Blaschke et al.,, 2013, Geographic Object-Based Image Analysis – Towards a new paradigm. ISPRS Journal of Photogrammetry and Remote Sensing, 87 (2014), pp. 180–191

↧

Searching for aliens

August 24, 2016, 7:19 am

≫ Next: Integrating Pandas, Django REST Framework and Bokeh

≪ Previous: Python for Object Based Image Analysis (OBIA)

First contact

Have you ever seen through a plane’s window, or in Google Maps, some precisely defined circles on the Earth? Typically many of them, close to each other? Something like this:

Do you know what they are? If you are thinking of irrigation circles, you are wrong. Do not believe the lies of the conspirators. Those are, undoubtedly, proofs of extraterrestrial visitors on earth.

As I want to be ready for the first contact I need to know where these guys are working. It should be easy with so many satellite images at hand.

So I asked the machine learning experts around here to lend me a hand. Surprisingly, they refused. Mumbling I don’t know what about irrigation circles. Very suspicious. But something else they mentioned is that a better initial approach would be to use some computer-vision detection technique.

So, there you go. Those damn conspirators gave me the key.

Circles detection

So now, in the Python ecosystem computer vision means OpenCV. And as it happens, this library has got the HoughCircles module which finds circles in an image. Not surprising: OpenCV has a bazillion of useful modules like that.

Lets make it happen.

First of all, I’m going to use Landsat 8 data. I’ll choose scene 229/82 for two reasons:

I know it includes circles, and
it includes my house (I want to meet the extraterrestrials living close by, not those in Area 51)

Crop of the Landsat scene 229/82

The first issue I have to solve is that the HoughCircles function

finds circles in a grayscale image using a modification of the Hough transform

Well, grayscale does not exactly match multi-band Landsat 8 data, but each one of the bands can be treated as a single grayscale image. Now, a circle can express itself differently in different bands, because each band has its own way to sense the earth. So, the detector can define slightly different center coordinates for the same circle. For that reason, if two centers are too close then I’m going to keep only one of them (and discard the other as repeated).

Next, I need to determine the maximum and minimum circle’s radius. Typically, those circles sizes vary, from 400 mts up to 800 mts. That is between 13 and 26 Landsat pixels (30 mts). That’s a starting point. For the rest of the parameters I’ll just play around and try different values (not very scientific, I’m sorry).

So I run my script (which you can see in this Jupyter notebook) and without too much effort I can see that the circles are detected:

Crop of the Landsat 8 scene 229/82, with detected circles

Crop of the Landsat 8 scene 229/82, with the detected circles (the colors of the circles correspond to the size).

By changing the parameters I get to detect more (getting more false-positives) or less circles (missing some real ones). As usual, there’s a trade-off there.

Filter-out false positives

These circles only make sense in farming areas. If I configure the program not to miss real circles, then I get a lot of false positives. There are too many detected circles in cities, clouds, mountains, around rivers, etc.

Crop of the Landsat 8 scene 229/82, with detected circles and labels

That’s a whole new problem that I will need to solve. I can use vegetation indices, texture computation, machine learning. There’s a whole battery of possibilities to explore. Intuition, experience, domain knowledge, good data-science practices and ufology will help me out here. Unlucky enough, all that is out of the scope of this post.

So, my search for aliens will continue.

Mental disorder disclaimer

I hope it’s clear that all the search for aliens story is fictional. Just an amusing way to present the subject.

Once clarified that, the technical aspects in the post are still valid.

To help our friends of Kilimo, we developed an irrigation circles’ detector prototype. As hinted before, instead of approaching the problem with machine learning we attacked it using computer vision techniques.

Please, feel free to comment, contact us or whatever. I’m @py_litox on Twitter.

↧

Integrating Pandas, Django REST Framework and Bokeh

October 25, 2016, 5:16 am

≪ Previous: Searching for aliens

It’s no secret that we love Django REST Framework. We’ve written quite a few blog posts about it and it is our default framework for projects that require a web API.

Another package that we use a lot is Pandas (and NumPy by extension). It is fast, flexible, well documented and it has a very active and friendly community. It has made our lives a lot easier when working on data-driven projects.

We regularly deal with projects that require complex Web APIs and we also work on projects that involve intensive data processing but only rarely we combine those two. In the last few months, one of these projects came up and we were pressed for time so we had to chose between the convenience of DRF and the performance and expressiveness of Pandas. Eventually we leaned mostly on the Django/DRF side and took a hit on performance, but this raised an interesting question: What would have been the ideal solution for a project such as ours? The answer is always the same: We would have used (and maybe adapted) available packages that fit our requirements.

When we started thinking about the problem we realized that it wasn’t such a big deal. One of the things we love the most about DRF is its extensibility. It is quite simple to insert custom behavior at every layer of the framework. With this in mind, we started working on a way to integrate Pandas with Django REST Framework.

pandas-drf-tools

pandas-drf-tools is a set of serializers, viewsets and mixins that allows you to expose a Pandas DataFrame through a web API the same way DRF does it with Django querysets.

We tried to follow DRF’s architecture whenever possible, so pandas-drf-tools offers much of the same flexibility, which means that the user can chose the level of integration. You can just use the Serializers and provide a simple read-only view to a DataFrame, or you can use a DataFrameViewSet and provide RESTful read-write access to it. Let us explore the simpler use-case first.

The package provides several Serializers that render DataFrames using methods provided by Pandas. Let’s take a look at DataFrameIndexSerializer:

classDataFrameIndexSerializer(Serializer):defto_internal_value(self,data):try:data_frame=pd.DataFrame.from_dict(data,orient='index').rename(index=int)returndata_frameexceptValueErrorase:raiseValidationError({api_settings.NON_FIELD_ERRORS_KEY:[str(e)]})defto_representation(self,instance):instance=instance.rename(index=str)returninstance.to_dict(orient='index')

We’re using to_dict to convert a DataFrame to a dictionary which DRF can then turn it into the appropriate format (usually JSON). We use from_dict to convert a request payload into a DataFrame. Here’s how you could use this Serializer:

classDataFrameIndexSerializerTestView(views.APIView):defget_serializer_class(self):returnDataFrameIndexSerializerdefget(self,request,*args,**kwargs):sample=get_some_dataframe().sample(20)serializer=self.get_serializer_class()(sample)returnresponse.Response(serializer.data)defpost(self,request,*args,**kwargs):serializer=self.get_serializer_class()(data=request.data)serializer.is_valid(raise_exception=True)data_frame=serializer.validated_datadata={'columns':list(data_frame.columns),'len':len(data_frame)}returnresponse.Response(data)

This is a simple DRF APIView that provides a sample of a DataFrame upon receiving a GET request, and parses a POST request payload. It is pretty much the same thing you do when using any kind of custom Serializer. By default, only a very basic level of validation is provided, but you can always overwrite the is_valid method.

Besides DataFrameIndexSerializer, two more Serializers are provided: DataFrameListSerializer and DataFrameRecordsSerializer. The difference is in the methods use to serialize and de-serialize DataFrames, which in turn changes the way the data is rendered.

Besides serializers, pandas-drf-tools also provides a GenericDataFrameAPIView to expose a DataFrame using a view, the same way DRF’s GenericAPIView does it with Django’s querysets. This class will rarely be used directly. A GenericDataFrameViewSet class combined with custom list, retrieve, create, and update mixins turns into DataFrameViewSet (and ReadOnlyDataFrameViewSet) which mimics the behaviour of ModelViewSet.

Instead of setting a queryset field or overriding get_queryset, users of DataFrameViewSet need to set a dataframe field or override the get_dataframe method.

Let’s say we wanted to provide a read-only view of a DataFrame stored in a CSV file. This is quite easy to do:

importpandasaspdclassTestDataFrameViewSet(ReadOnlyDataFrameViewSet):serializer_class=DataFrameRecordsSerializerdefget_dataframe(self):returnpd.read_csv('dataframe.csv')

Simple. Now, what happens if you want to allow users of the web API to update the dataframe? You’ll need to use DataFrameViewSet instead. By default, this class implements all verbs (list, retrieve, create, update and destroy) but the methods that modify the underlying dataframe are read-only. In order to give the developers a chance to make the change permanent, an update_dataframe callback is provided. For this example, we’ll use pickle to read and write the DataFrame:

importpandasaspdclassTestDataFrameViewSet(DataFrameViewSet):serializer_class=DataFrameRecordsSerializerdefget_dataframe(self):returnpd.read_pickle('test.pkl')defupdate_dataframe(self,dataframe):dataframe.to_pickle('test.pkl')returndataframe

These viewsets can then be used the same way as any regular DRF viewset. To actually make them available, we can register them with a router:

fromrest_framework.routersimportDefaultRouterrouter=DefaultRouter()router.register(r'test',TestDataFrameViewSet,base_name='test')

The only caveat here is that, since there is no queryset associated with the viewset, DRF cannot guess the base name, so it has to be set explicitly.

An example

A complete example that shows most features of pandas-drf-tools is available on GitHub. It is a project that shows you how to use pandas-drf-tools with a live Django REST Framework site that generates Bokeh charts from information stored in Pandas data frames. The data is taken from the US Census Bureau site.

A Vagrantfile is provided if you want to test the live project by yourself.

States Population Estimates

The first part of the example shows what we think is going to be the most common use case for the pandas-drf-tools package, and that is taking an existing DataFrame and exposing it so a front-end application can make use of the data.

@lru_cache()defget_cc_est2015_alldata_df():try:data=path.join(path.abspath(path.dirname(__file__)),'data')cc_est2015_alldata_df=pd.read_pickle(path.join(data,'CC-EST2015-ALLDATA.pkl'))state_df=get_state_df()[['STATE','STUSAB']]cc_est2015_alldata_df=cc_est2015_alldata_df.merge(state_df,on=('STATE',))exceptFileNotFoundErrorase:raiseImproperlyConfigured('Missing data file. Please run the "download_census_data" management command.')fromereturncc_est2015_alldata_df@lru_cache()defget_state_df():try:data=path.join(path.abspath(path.dirname(__file__)),'data')state_df=pd.read_pickle(path.join(data,'state.pkl'))exceptFileNotFoundErrorase:raiseImproperlyConfigured('Missing data file. Please run the "download_census_data" management command.')fromereturnstate_df

The data is presented to the user using Bokeh charts. The charts are generated by two views. The first one shows the population of each state:

defget_state_abbreviations():alldata_df=get_cc_est2015_alldata_df()returnalldata_df['STUSAB'].drop_duplicates().tolist()defget_states_plot():source=AjaxDataSource(data={'STATE':[],'STNAME':[],'STUSAB':[],'TOT_POP':[],'TOT_MALE':[],'TOT_FEMALE':[]},data_url='/api/states/',mode='replace',method='GET')hover=HoverTool(tooltips=[("State","@STNAME"),("Population","@TOT_POP"),("Female Population","@TOT_FEMALE"),("Male Population","@TOT_MALE"),])plot=figure(title='Population by State',plot_width=1200,plot_height=500,x_range=FactorRange(factors=get_state_abbreviations()),y_range=(0,40000000),tools=[hover,'tap','box_zoom','wheel_zoom','save','reset'])plot.toolbar.active_tap='auto'plot.xaxis.axis_label='State'plot.yaxis.axis_label='Population'plot.yaxis.formatter=NumeralTickFormatter(format="0a")plot.sizing_mode='scale_width'plot.vbar(bottom=0,top='TOT_POP',x='STUSAB',legend=None,width=0.5,source=source)url="/counties/@STATE/"taptool=plot.select(type=TapTool)taptool.callback=OpenURL(url=url)returnplotclassStatesView(TemplateView):template_name='chart.html'defget_context_data(self,**kwargs):context_data=super().get_context_data(**kwargs)plot=get_states_plot()bokeh_script,bokeh_div=components(plot,CDN)context_data['title']='Population by State'context_data['bokeh_script']=bokeh_scriptcontext_data['bokeh_div']=bokeh_divreturncontext_data

This view illustrates the guidelines given for embedding Bokeh charts. The interesting parts lies in the usage of AjaxDataSource. When the chart is rendered in the front-end, it’ll make a GET request to the supplied URL to fetch the data. Here’s where pandas-drf-tools comes into play. The request is handled by a ReadOnlyDataFrameViewSet that exposes the dataframe constructed above.

classStateEstimatesViewSet(ReadOnlyDataFrameViewSet):serializer_class=DataFrameListSerializerpagination_class=LimitOffsetPaginationdefget_dataframe(self):alldata_df=get_cc_est2015_alldata_df()state_names_df=alldata_df[['STATE','STNAME','STUSAB']].set_index('STATE')\.drop_duplicates()latest_total_population=alldata_df[(alldata_df.YEAR==8)&(alldata_df.AGEGRP==0)]population_by_state=latest_total_population.groupby(['STATE']).sum().join(state_names_df)\.reset_index()returnpopulation_by_state[['STATE','STNAME','STUSAB','TOT_POP','TOT_MALE','TOT_FEMALE']]

The DataFrameListSerializer used in this view is compatible with the format that Bokeh uses, so there’s nothing else to do.

The only thing left to do is hook everything up with a template:

<!DOCTYPE html>
{% load static %}<html lang="en"><head>
        ...<link href="http://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css" rel="stylesheet" type="text/css"><link href="http://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css" rel="stylesheet" type="text/css"></head><body>
        {{ bokeh_div|safe }}
        ...<script src="http://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.js"></script><script src="http://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.js"></script>
        {{ bokeh_script|safe }}</body></html>

This is how it looks:

If you click on state column, a chart is going to be show with the top ten counties (by population) on said state. This view is handled with a regular Django view with an embedded Bokeh chart:

defget_counties_data_frame(state_fips_code):alldata_df=get_cc_est2015_alldata_df()county_names_df=alldata_df[['STATE','COUNTY','CTYNAME']].set_index('COUNTY') \.drop_duplicates()latest_total_population=alldata_df[(alldata_df.YEAR==8)&(alldata_df.AGEGRP==0)]population_by_county=latest_total_population.groupby(['COUNTY']).sum() \.join(county_names_df).reset_index()population_by_county=population_by_county[['STATE','COUNTY','CTYNAME','TOT_POP','TOT_MALE','TOT_FEMALE']]population_by_county=population_by_county[population_by_county.STATE==state_fips_code]population_by_county=population_by_county.sort_values('TOT_POP',ascending=False)[:10]returnpopulation_by_countydefget_counties_plot(data_frame):plot=Bar(data_frame,label='CTYNAME',values='TOT_POP',agg='max',plot_width=1200,plot_height=500,title='Population by County',legend=False)plot.xaxis.axis_label='County'plot.yaxis.axis_label='Population'plot.yaxis.formatter=NumeralTickFormatter(format="0a")plot.sizing_mode='scale_width'returnplotclassCountiesView(TemplateView):template_name='chart.html'defget_context_data(self,**kwargs):context_data=super().get_context_data(**kwargs)data_frame=get_counties_data_frame(kwargs['state_fips_code'])plot=get_counties_plot(data_frame)bokeh_script,bokeh_div=components(plot,CDN)context_data['title']='Population by County'context_data['bokeh_script']=bokeh_scriptcontext_data['bokeh_div']=bokeh_divreturncontext_data

Read-write example

The second part shows you how to manipulate a dataframe as if it were a queryset, allowing you not only to list rows of the dataset, but also creating new rows, and updating and deleting existing ones. This time we’re going to use a different data set that only contains state population estimates:

defget_nst_est2015_alldata_df():df=cache.get('nst_est2015_alldata_df')ifdfisNone:try:data=path.join(path.abspath(path.dirname(__file__)),'data')df=pd.read_pickle(path.join(data,'NST-EST2015-alldata.pkl'))df=df[df.SUMLEV=='040'][['STATE','NAME','POPESTIMATE2015']].reset_index(drop=True)cache.set('nst_est2015_alldata_df',df)exceptFileNotFoundErrorase:raiseImproperlyConfigured('Missing data file. Please run the "download_census_data" management command.')fromereturndf

The dataframe is then exposed through a DataFrameViewSet that illustrates how to make the changes stick by implementing the update_dataframe method. The index_row was overridden so we can reference the states based on their FIPS code instead of their position within the dataframe.

classTestDataFrameViewSet(DataFrameViewSet):serializer_class=DataFrameRecordsSerializerdefindex_row(self,dataframe):returndataframe[dataframe.STATE==self.kwargs[self.lookup_url_kwarg]]defget_dataframe(self):returnget_nst_est2015_alldata_df()defupdate_dataframe(self,dataframe):cache.set('nst_est2015_alldata_df',dataframe)returndataframe

This set-up allows us to list rows:

$ curl --silent http://localhost:8000/api/test/ | python -mjson.tool
{
    "columns": [
        "index",
        "STATE",
        "NAME",
        "POPESTIMATE2015"
    ],
    "data": [
        [
            0,
            "01",
            "Alabama",
            4858979
        ],
        ...
        [
            51,
            "72",
            "Puerto Rico",
            3474182
        ]
    ]
}

...,get the details of a specific row...

$ curl --silent http://localhost:8000/api/test/72/ | python -mjson.tool
{
    "columns": [
        "index",
        "STATE",
        "NAME",
        "POPESTIMATE2015"
    ],
    "data": [
        [
            51,
            "72",
            "Puerto Rico",
            3474182
        ]
    ]
}

...add new rows...

$ curl --silent -X POST -H "Content-Type: application/json" --data '{"columns":["index","STATE","NAME","POPESTIMATE2015"],"data":[[52,"YY","Mars",1]]}' http://localhost:8000/api/test/
{"columns":["index","STATE","NAME","POPESTIMATE2015"],"data":[[52,"YY","Mars",1]]}

...update existing rows...

$ curl --silent -X PUT -H "Content-Type: application/json" --data '{"columns":["index","STATE","NAME","POPESTIMATE2015"],"data":[[52,"YY","Mars",0]]}' http://localhost:8000/api/test/YY/

...and delete rows...

$ curl --silent -X DELETE http://localhost:8000/api/test/YY/

It provides pretty much the same functionality as regular DRM ModelViewSets.

Conclusions

If you analyze the code that we’ve presented, you’ll notice that there’s not much to it. This is a testament of how well designed Pandas, Django REST Framework and Bokeh are. In the end it was just a matter of connecting the dots and a little bit of elbow grease.

Feedback

As always, comments, tickets and pull requests are welcomed. You can reach me at abarto (at) machinalis.com or @m4rgin4l on Twitter; and you can check our Services page to see other cool things we do here at Machinalis.

↧