Pagination issue when using Azure Data Factory ETL

TL;DR

I'm trying to query the list of tenants using the Partner API in Azure Data Factory (ADF). The pagination functionality within ADF relies on there being some sort of reference to the next page in the response to the current page, which isn't the case in this situation. If anyone has any thoughts or (even better) Sophos are able to add something into the API responses referring to the next page, I'd be really grateful.

Detailed explanation

Hi,

I am working on pulling in various bits of data into a data lake from the Sophos Central API. I am using Azure Data Factory (ADF), which is Microsoft's cloud-based ETL tool.

I am querying the partner/v1/tenants endpoint to get a list of customer tenants. This API uses the "pagination by offset" method described in the Sophos docs.

The issue I have is that this method of pagination is not supported by ADF's built-in pagination functionality. Unfortunately, all the methods of pagination within ADF rely on there being some sort of link or reference within a response to the next page, whereas the pagination method used by the Partner API relies on the caller implementing some basic logic to increment a page counter until it hits the total page count supplied.

See here for the relevant docs describing the pagination support within ADF: https://docs.microsoft.com/en-us/azure/data-factory/connector-rest#pagination-support.

I have found some discussions online regarding a convoluted workaround for such scenarios. Effectively, you implement the pagination yourself using variables and loops within ADF. See here for more details: Unable to use the Pagination Rules in REST sources in ADF (microsoft.com).

I am in the process of setting up a test of the workaround at the moment. However, it was mentioned that one significant downside of this method is that you end up creating separate JSON files within the data lake for each page, rather than a single file. These files then need put through some transformation to combine them. None of this is a show-stopper, but is definitely a lot more work - especially if the logic has to be implemented for every API call that uses the "pagination by offset" method.

It seems like this is perhaps a situation where both Sophos's API and ADF are both partly the issue - the Sophos API in the sense the pagination mechanism isn't as good as many other REST APIs, and ADF for having limited pagination support.

It's quite likely there is little option but to go with the workaround, but if anyone has any thoughts on better solutions, that would be great. I appreciate it's perhaps asking a lot, but if Sophos were able to implement a better pagination mechanism at some point, I'm sure that would help me and others in a similar situation.

Thanks in advance.



Error in post title
[edited by: mdi-db at 3:37 PM (GMT -7) on 28 May 2021]
Parents Reply Children
  • Hi Elias,

    Thanks for taking the time to reply. That's absolutely fine and I appreciate the reasoning behind the prioritisation of effort. I have been able to work around the problem for now in Data Factory. The workaround is a little clunky, but it works, so that's good enough for me for now!

    It's interesting that you mention the inconsistent paging experience. Since my original post, I've moved on to querying other APIs that use the key-based pagination mechanism. Data Factory's built-in pagination functionality works really well with that method with no hack needed; it's just the offset method that it can't cope with natively. Just mentioning it in case it's useful feedback for future planning.

    I did wonder why there were two very different pagination mechanisms. I couldn't see any obvious reason from a consumption point of view, so I figured it may be more of an issue around the age of the respective APIs or the technology used on the backend to implement them.

    Thanks again.