GPU资源服务商调研(特别是众包)

1. 信息收集

国外已经有大量采用这种思想的算力服务商了[1]:

Lots of these services:

If you want HPC specific cloud providers:

大多数本质上都是较为传统的 IaaS (Infrastructure as Service)的形式, 需要指定某一GPU型号和规格, 然后再使用单一的GPU资源.
一些服务商已经开始向 serverless发展.

petals 等在架构上创新, 不需要指定GPU, 它所有的只是抽象的算力网络, 这种方式应该是比较偏未来潮流的.

国内则有:

还有一些利用虚拟化技术搞GPU池化的, 这种方式显然更加自由, 只是这种方式更加适用于数据中心. 国内有:

我们从贡献要求, 租赁方式, 收益, 应用等来比较.

我们一个一个来, 看看他们采用的途经是怎么样的. 文章会慢慢更新.

这个似乎不面对个人用户提供算力租赁服务. 他有三种产品:

GPU选型UI 他的GPU租赁分两种机器:

Secure Cloud (Our trusted datacenter partners)
Unverified Machines (Machines haven’t been manually checked for reliability, functionality, compatibility. Use at user’s own risk.) 用户选择机器和模板后, 就可以开始使用算力. 模板大多由官方提供, 并且分多种特性, 如

关于服务类型, 它除了面向个人用户的算力租赁外, 还有面向商业用户的合作方案. 以两种方式使用算力资源:

Vast.ai | Hosting Vast is a GPU marketplace. Hosts sell GPU resources on the marketplace. Hosts are responsible for:

Setup: installing Ubuntu, creating disk partitions, installing NVIDIA drivers, opening network ports on the router and installing the Vast hosting software.
Testing and troubleshooting all issues that can arise, such as driver conflicts, errors, bad GPUs, and bad network ports. Vast does not offer support for getting your machine working. There is a host discord with helpful members and the host-general channel is searchable for specific errors.
Managing the listings and GPU offers for rentals, including setting pricing and end dates for the offers
Planning for maintenance so that no client jobs are affected

Nvidia:

Ubuntu 18.04 or newer (required)
Dedicated machines only - the machine shouldn’t be doing other stuff while rented
Fast, reliable internet: at least 10Mbps per machine.
10-series Nvidia GPU (older Nvidia not recommended).
At least 1 physical CPU core (2 hyperthreads) per GPU.
Your CPU must support AVX instruction set (not all lower end CPUs support this).
At least 4GBM of system RAM per GPU.
Fast SSD storage with at least 128GB per GPU.
at least 1X PCIE for every 2.5 TFLOPS of GPU performance.
All GPUs on the machine must be of the same type.
An open port range mapped to each machine. AMD:
Ubuntu 20.04 or newer (required); Ubuntu 22.04 or newer recommended
Dedicated machines only - the machine shouldn’t be doing other stuff while rented
Fast, reliable internet: at least 10Mbps per machine.
One of the following GPUs:
- MI25 or newer Radeon Instinct series GPU.
- Radeon VII or Radeon Pro VII.
- Radeon RX 7900 (GRE/XT/XTX); or Radeon Pro W7900/W7800. Other 6000 series or newer Radeon RX/Pro W series GPUs may be supported; but may not be searchable using standard filters for AMD ROCm.
At least 1 physical CPU core (2 hyperthreads) per GPU.
Your CPU must support AVX instruction set (not all lower end CPUs support this).
At least 4GBM of system RAM per GPU.
Fast SSD storage with at least 128GB per GPU.
at least 1X PCIE for every 2.5 TFLOPS of GPU performance.
All GPUs on the machine must be of the same type.
An open port range mapped to each machine.

基础:

额外功能:

Min Bid Price, run background jobs to utilize free time.

Create a background job to run on the machine, along with a floor price. This will create a low priority job that runs only where there is no higher priced job to run. The floor price will determine the minimum clients will have to pay to rent your gpus. 听起来是可以在GPU出价不符合Host期望时, 采用挖矿等方式补偿收益. 当然, 也可以选择不挖矿. 挖矿的收益方是Host方.

用户只需要注册专门用于Hosting的账号, 并且配置好环境后, 机器便会以Unverified Machines的类型出现在Vast.ai的GPU选型界面.

我觉得值得注意的有硬件维护和技术支持部分, 这一部分是完全交由Hosting方进行的, 并且维护不当还可能有罚款. 付款周期在每期内14天内支付, 并且Hosting方承担税费.

很显然, Vast.ai的主要Community Host用户应该是大型的数据中心用户或拥有大量的GPU资源的个体或团体, 并且是Full-time Hosting.

Vast.ai自己或许也具有大量的数据中心GPU, 但他的算力主要来源就是可信的数据中心.

[2]