Let's discuss more about using Google Cloud Storage, instead of the native Hadoop file system, or HDFS. Hadoop was built on Google's original MapReduce design from the 2004 white paper, which was written in a world where data was local to the compute machine. Back in 2004, network speeds were originally pretty slow, and that's why data was kept as close as possible to the processor. Now, with petabit networking speeds, you can treat storage and compute independently, and still move traffic quickly over the network. Although your on-premise Hadoop clusters need local storage on its disk, since the same server runs, computes, and storage jobs in the Cloud, that's one of the first areas for optimization. Why is that? Well, naturally, you can run HDFS in the Cloud just by lifting and shifting your Hadoop workloads to Cloud Dataproc. This is often the first step to the Cloud, and requires no code changes, it just works, but HDFS on the Cloud is a sub-par solution in the long run. This is because of how HDFS works on the clusters, with block size, the data locality, and the replication of the data in HDFS. For block size in HDFS, you're tying the performance of input and output to the actual hardware the server is running on. Again, storage is not elastic in this scenario, you're on the cluster. If you run out of persistent disk space on your cluster, you'll need a re-size, even if you don't need the extra compute power. For data locality, there are similar concerns about storing data on individual persistent disks. This is especially true when it comes to replication. In order for HDFS to be highly available, it replicates three copies of each block out to storage. Wouldn't it be better to have a storage solution that's separately managed from the constraints of your cluster, than where the computing is done? What would it take to achieve such a solution? Well first, you need a super-fast way to connect, compute, and storage, if they're not in the same place. Google's network enables new solutions for big data. The Jupiter networking fabric within a Google data center delivers over one petabit per second of bandwidth. To put that in perspective, that's about twice the amount of traffic exchanged on the entire public internet. If you draw a line somewhere in a network, bisectional bandwidth is the rate of communication at which each server on one side of the line can communicate with servers on the other side. With enough bisectional bandwidth, any server can communicate with any other server at full network speeds, and with petabit bisectional bandwidth, the communication is so fast that it no longer makes sense to transfer files and store them locally on a single server. Instead, it makes sense to use the data from where it's stored separately. Here's how that equates to Google Cloud Platform's products. Inside of a Google datacenter, the internal name for the massively distributed storage layer is called Colossus, and the network inside the datacenter is Jupiter. Cloud Dataproc clusters get the advantage of scaling up and down VMs that they need to do the compute, while passing off persistent storage needs with the ultra-fast Jupiter network to a storage product like GCS, which is ran by Colossus behind the scenes. So you've seen this graph before, and now here's where running Hadoop in the Cloud, and the next generation for Hadoop workloads really gets you those benefits. With Cloud native tools like Cloud Dataproc, they encourage companies to leave the storage and the processing power to Google's massive datacenters, and that gives you the advantage of scale and cost efficiency. As I mentioned before, to take advantage of Cloud Storage instead of HDFS, you can simply change your Hadoop job code from HDFS// to GS// as you see here. Additionally, consider using BigQuery for your Data Processing and analytical workloads, instead of performing them on cluster. Remember, one of the biggest benefits of Hadoop in the cloud is that separation of compute and storage. With Cloud Storage as the back end, you can treat clusters themselves as ephemeral resources, which allows you not to pay for compute capacity when you're not running any jobs. Also, Cloud Storage is its own completely scalable and durable storage service, which is connected to many other GCP projects. You can even persist HDFS Data in GCS, inquiry it directly from BigQuery via federated query. So let's recap. Cloud storage could be a drop-in replacement for your HDFS, back end for Hadoop. The rest of your code would just work. Also, you can use the Cloud Storage connector manually on your non-cloud Hadoop clusters if you didn't want to migrate your entire cluster the Cloud yet, maybe just Storage. Cloud Storage is optimized for large, bulk parallel operations, it has very high throughput, but it has significant latency. This is definitely something to consider. This is where if you have large jobs that are running lots of tiny little blocks, you may be better off with HDFS. Additionally, you want to avoid iterating over sequentially many nested directories in a single job. Another thing to keep in mind is that Cloud Storage is at its core an object store, it only simulates a file directory. So directory renames in HDFS are not the same as they are in Cloud Storage, but new objects store oriented output committers mitigate this, as you see here. Lastly, to get your data to the cloud, you can use DistCp. In general, you want to use a push-based model for any data that you know you'll need, while a pull-based may be a useful model if there's a lot of data that you might not ever need to migrate.