-
Notifications
You must be signed in to change notification settings - Fork 4
dvc push to Azure Blob storage is very slow #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't think it's expected. Could you please share some additional info Could you as matter of an experiment also try to use the |
Hi shcheklein, thanks for your quick response! dvc doctor
pip freeze
Experiment 1: az copy vs DVC vs fsspec/adlfsA CSV file (size = 914.71 MiB) is uploaded to the same Azure Blob Container with 3 different ways (az copy, dvc and fsspec/adlfs) and the upload time is measured.
Code to upload with AzCopy./azcopy copy "/path/to/local/file" "<SAS token to azure blob container>" Code to upload with DVCdvc init
dvc remote add -d myremote azure://dvc/
dvc remote modify --local myremote connection_string 'BlobEndpoint=h...'
dvc add /path/to/file
dvc push Code to upload with fsspec/adlfsimport time
import dask.dataframe as dd
storage_options={'connection_string': 'BlobEndpoint=https://...'}
start_time = time.time()
ddf = dd.read_csv('/path/to/large.csv')
ddf.to_csv('abfs://dvc/large.csv', storage_options=storage_options)
end_time = time.time()
print('Upload time: {} seconds'.format(end_time - start_time)) Experiment 2: az copy vs DVC vs azure-sdk-for-pythonA WAV file (size = 659.18 MiB) is uploaded to the same Azure Blob Container with 3 different ways (az copy, dvc and azure-sdk-for-python) and the upload time is measured.
Code to upload with azure-sdk-for-pythonimport time
from azure.storage.blob import BlobServiceClient
connection_string = 'BlobEndpoint=https://...'
# Set up service client
service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = service_client.get_blob_client(container='dvc', blob='file_1.wav')
start_time = time.time()
blob_client.upload_blob(open('file_1.wav', 'rb').read())
end_time = time.time()
print('Upload time: {} seconds'.format(end_time - start_time)) |
I did an additional test with azure-sdk-for-python where i set the azure-sdk-for-python upload speed:File size = 659.18 MiB
I did not find a similar argument/ option for fsspec/adlfs yet.. |
Thanks for the research @mitjaalge . I think here is where https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L1617-L1623 I also don't see it using any additional configs. Even though @pmrowla @efiop I would assume in this case we would pass it to all |
It belongs in adlfs (like how s3fs uses multipart uploads). DVC There's an old issue in adlfs where they started testing |
It sounds about right (and no matter what it will need to support the
👍 |
It looks like s3fs doesn't actually try to balance the two, they just schedule the entire multipart upload for files (and only try to throttle the batching at the full file level) Since the chunked uploading is handled in the azure sdk (and we can't control how they schedule the indvidual chunk tasks) I think we can probably just get away with doing something like So in adlfs code it would be: from fsspec.asyn import _get_batch_size
await bc.upload_blob(max_concurrency=max(1, _get_batch_size() // 8))
await bc.download_blob(max_concurrency=max(1, _get_batch_size() // 8)) (everywhere they use upload/download_blob) for reference, the azure chunking is done here for uploads (the chunked download behavior is the same): |
thanks @pmrowla . One more question - do we by default override batch size when
is it pretty much a complete fix for this? :) (except an optional way to control it, etc) |
The default is set here: https://github.com/fsspec/filesystem_spec/blob/166f462180e876308eae5cc753a65c695138ae56/fsspec/asyn.py#L172 The fsspec functions basically work by doing
where In DVC we just call the fsspec methods with Given that s3fs doesn't check If we really wanted to just match s3fs behavior, we can dig more into the azure sdk to see what chunk size they use and then just set
Yeah it should be, unless maybe the adlfs maintainers are aware of some other azure concurrency related issue here that I missed |
@shcheklein Do you plan to try to find time for this? Sounds great, just want to make sure we actually have a plan to implement it. |
for reference, the default azcopy concurrency settings are documented here: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-optimize#increase-the-number-of-concurrent-requests
In our case, Since the default batch size in DVC/fsspec is 128, we probably end up with similar performance to azcopy for dir transfers (since we do concurrent file transfers), but we are slower for single file transfers (since right now we end up with a concurrency of 1). |
upstream PR fsspec/adlfs#420 @mitjaalge can you try installing the adlfs patch in your DVC env and then see if you get improved performance?
After the patch, the concurrency behavior is still naive in that fsspec and DVC will either batch multi-file transfers at the file level or single file transfers at the chunk level, but not both in the same operation. So for directory push/pull when the number of remaining files is less than |
on my machine (with a residential internet connection) for a 1.5GB file I get comparable performance between azcopy and DVC with the patch azcopy:
dvc main:
dvc with the patch:
|
@pmrowla: The following table shows the upload time for N files in a folder before and after the patch. Every file has a size of 691 MB. My internet speed is 482.51 Mbit/s down and 447.74 Mbit/s up.
BTW, your pip command didn't work for me, I had to use the following (if someone else wants to test it too): pip install git+https://github.com/fsspec/adlfs.git@refs/pull/420/head |
The jobs flag isn't applied for single file transfers with that patch, that will require changes in dvc and not adlfs. So your results are what I would expect for now. (So when n files is 1 it always uses the default adlfs max_concurrency regardless of --jobs) |
Should it not have given a bigger speedup for multiple files without the patch. 3 files 1 job is nearly the same as 3 files 3 jobs? |
Ah sorry, with the patch, max_concurrency is still applied to file chunks even when you use |
Ok, makes sense. |
This fix is now released in adlfs and dvc-objects (so The fix can be installed with
(it will also be available in the next DVC binary package release) |
Bug Report
dvc push: is very slow when pushing to Azure Blob Storage remote
Description
When pushing larger files (~700MB) to a Azure Blob Storage remote I'm experiencing very slow speeds (3-4 min for a single file = ~4 MB/s). The same file takes around ~10s to upload (~70 MB/s) when using AzCopy (https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs-upload).
Is this to be expected, or am I doing something wrong?
Reproduce
Expected
Fast upload speed ~ 70 MB/s.
Environment information
DVC Version: 3.1.0
Python Version: 3.8.10
OS: tried on MacOS Ventura 13.1 and Ubuntu 20.04.6 LTS
Thank you very much for the help!
The text was updated successfully, but these errors were encountered: