How to do more efficient focal statistics on huge dataset #10404

chenyangkang · 2025-06-06T01:21:17Z

chenyangkang
Jun 6, 2025

Hi community,

Not sure if similar questions has been posted (I didn't find one) before but I'm trying to figure out what is the most efficient way to do a focal statistics for 37 land cover type on a global 30m land cover dataset.

Objectives:
I'm trying to using orthogonal indexing to get the 30m circle focal statistic summary of the proportion of each of the 37 and cover type, given a batch of query longitude, latitude, and year. The data on disk is chunked into lon-lat tiles, which each tile containing 23 years of data.

Apparently it is hard to even load one whole tile into the RAM, so xarray lazy loading will be super helpful here.

The code is like this:

full_mosaic = xr.open_mfdataset(my_files, chunks={'longitude':2000, 'latitude':2000, 'year':1})
full_mosaic = full_mosaic['band_data']
full_mosaic = full_mosaic.sel(year=2022)
    
## Focal stats
GLC_FCS30D_land_cover_categories = get_GLC_FCS30D_categories()
kernal_radius=8 # 8 * 30m = 240m range

# construct the focal stats dataset
full_mosaic_focal_stats = xr.Dataset()

for category_index, category_line in GLC_FCS30D_land_cover_categories.iterrows(): # each category is a integer in the array
    category_name = category_line['Classification System']
    mask = (full_mosaic == int(category_line['LC id'])).astype(int)  # 1 where category matches
    kernel = circle_kernel(1, 1, kernal_radius) # circle_kernel and focal_stats functions are from xrspatial
    proportion = focal_stats(mask, kernel, stats_funcs=['mean']).sel(stats='mean') # what is proportion of this category
    full_mosaic_focal_stats[category_name] = proportion

I then query the full_mosaic_focal_stats using my data points:

longitude = xr.DataArray(my_data['LONGITUDE'].values, dims="points")
latitude = xr.DataArray(my_data['LATITUDE'].values, dims="points")

GLC_FCS30D_land_cover_features = GLC_FCS30D_land_cover.sel(
        longitude=longitude,
        latitude=latitude,
        method='nearest'
    )

GLC_FCS30D_land_cover_features.to_dataframe() # load the data

Now the problem is, xarray seems to be loading each category separately. That is, after it goes through the whole dataset to calculate the focal stats for category 1 for all points, it start again to calculate the focal stats for category 2. I think this is not ideal in terms of time use and I/O, especially when you have 37 categories. I found this issue when looking into the dask dashboard. I think it should load the data only once and compute all needed values.

Do people have a better way to do this?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to do more efficient focal statistics on huge dataset #10404

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

How to do more efficient focal statistics on huge dataset #10404

Uh oh!

Uh oh!

chenyangkang Jun 6, 2025

Replies: 0 comments

chenyangkang
Jun 6, 2025