Zegami SDK

SDK blog

Zegami is a great tool for managing, sharing and getting insights from your data in an intuitive way that can be used and understood by anyone. However, as developers we often want to transform, manipulate and process the data we collect in a more sophisticated way, so that it can provide us with meaningful insights and applications. This might involve feeding images through a custom-trained neural net for new features and condensing them down using dimensionality reduction; writing a custom Zegami-driven Tensorflow callback; or it may be something as simple as replacing all of the underscores for spaces in filenames, to satisfy our desire for clean-data. Whatever the task, it’s invaluable to have a simple system for programmatically communicating with Zegami for maximum versatility.

In this post, we will be using an early version of our SDK (available for Python and JS), designed to enable you to access your uploaded images and data quickly and simply. Using a few lines of Python we’ll connect to a collection, retrieve some data and fetch some images and their associated metadata based on a criterion. In an upcoming post, we will be covering how this was used to make a custom Keras-based DataGenerator to train a machine-learning algorithm entirely from a collection and a reference to a class column in the uploaded data.

Note:

The SDK is a work-in-progress, so several collection-creation/upload-related features are still in development. More on this soon!

Getting Started

  1. Set up a Python 3.7 environment

This SDK has been developed and tested using Python 3.7(.7), so I recommend setting up a Python 3.7 environment first. Anaconda and its convenient Anaconda Prompt is great for managing installations. Using Anaconda, you can create a new Python 3.7 environment using:

conda create -n my_new_environment python=3.7

conda activate my_new_environment

2. Install the SDK package

Next, you need to install the Zegami client package using pip:

pip install zegami-sdk

The Task

If you want to follow along, make sure you have a collection created before you continue! Keeping it simple, we will demonstrate how to connect  to a collection containing many, many dogs, and retrieve a handful of those from the Cairn breed.

Preparing the Client

To interact with Zegami, begin by setting up an instance of the ZegamiClient class. This acts as a hub for the various collection commands and as authentication for accessing private collections:

from zegami_sdk.zegami_client import ZegamiClient

zc = ZegamiClient (username, password)

When first instantiating this client, a username and password will need to be provided. For convenience, unless disabled with allow_save_token=False, the client will save a file named “zegami.token” to your home directory so that you don’t have to bother with your credentials next time, and can later simply call zc = ZegamiClient(). Feel free to delete this token at any time (or never save it at all).

Great! On to collections.

Workspaces and Collections

Zegami features repositories for collections called “Workspaces”. You can invite people as guests or admins to your workspace, providing them with access to the collections within. You may have multiple workspaces, but a typical user will have just their first, default one. Here is an example of a workspace ID:

If you’re working in the confines of your default workspace then you don’t really need to worry about this part, but there is an optional argument in several methods for the ‘workspace_id’, and if you don’t provide it then the client will assume you just mean your normal default one. If you’re trying to look for or access collections in a non-default workspace, make sure you use this optional argument. I will be using a slightly different workspace from my regular one, so I will be including this optional step.

Selecting your Collection

If you’ve made a collection, you’ll see it appear in your workspace like this:

There are a couple of different ways to select your collection with the client, but the simplest is just to get it by name. As mentioned above, the specific workspace ID is provided here:

Simple enough – now let’s get a reference to this collection. It’s nothing fancy, just a dictionary containing information about the collection, recognised by other parts of the SDK.

Getting Data

Awesome, we’re in. From here, we can grab whatever data we’d like from our collection. The SDK returns data to you in the form of Pandas DataFrames, so bonus points if you’re familiar with these. For anyone who isn’t, these frames are highly manipulatable frameworks for dealing with all sorts of shapes of data.

First, let’s try pulling the entirety of the collection’s metadata down into a DataFrame:

Now we can access the data. We don’t want all of it though, we just want the Cairns. You can flex your own Pandas skills here if you like – the information is all in the frame – but for ease of use we can actually just grab the Cairns with a slightly different command of “get_rows”:

Notice that ‘Cairn’ is in a list – you can actually query many breeds at once (OR logic). We can also supply additional filter layers (AND logic) using additional key : value pairs. For example, we could filter: 

{ ‘Breed’ : [‘Cairn’, ‘Labrador’], ‘Sex’ : ‘Female’ }

to get all dogs that are:

Breed=(Cairn OR Labrador) AND Sex=(Female)

At the time of writing this filter, functionality is limited to these categorical types but numerical filtering will be added soon. If you need this functionality now, Pandas definitely has you covered. If you had an age column and wanted to filter numerically, you could grab all the data and try:

mask = (data[‘Age’] > 3) & (data[‘Age’] < 5)

three_to_fives = data[mask]

But where are the Dogs?

This data querying is centred around acquiring the relevant rows, which depending on your goals may be enough, but we came here for dogs! Lets download some images to see how it works. 

There are only two steps to this – getting image URLs and using those URLs to actually download the images. Let’s get the Cairn image URLs. The method needs a reference to the collection, and expects a DataFrame of the selection you want. Since we already have our ‘cairn’ subset, let’s use that:

Now we can obtain the image data using those URLs. We can download individual images, but often we want several at once. We could slowly download images in a chain, or use multiple threads to rapidly download batches of images:

It’s important to note that these are being downloaded directly into memory, each as a PIL.Image. You can save these to file or use them as you see fit, but if you download a whole collection of large images, be aware you only have finite RAM.

Wrap Up 

So there we go, we’ve installed the Zegami SDK package, instantiated an authenticated client, grabbed our collection, retrieved the metadata of a subset of the collection and retrieved the associated images. 

This is a basic-use case of the SDK but it can be utilised in much more powerful ways. In an upcoming post, we will share how we made a Keras-based ZegamiDataGenerator to power a machine-learning pipeline using only these simple commands, and further down the line we’ll release collection creation/updating functionality so that you can begin to add automation to your workflow.