nicky_nick

New Member

Download miễn phí Khóa luận A parallel implementation on modern hardware for geo-Electrical tomographical software





INTRODUCTION 1
CHAPTER 1. HIGH PERFORMANCE COMPUTING ON MODERN HARDWARE 3
1.1 An overview of modern parallel architectures 3
1.1.1 Instruction-Level Parallel Architectures 4
1.1.2 Process-Level Parallel Architectures 5
1.1.3 Data parallel architectures 8
1.1.4 Future trends in hardware 13
1.2 Programming tools for scientific computing on personal desktop systems 15
1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++ 16
1.2.2 GPU programming with CUDA 21
1.2.3 Heterogeneous programming and OpenCL 26
CHAPTER 2. THE FORWARD PROBLEM IN RESISTIVITY TOMOGRAPHY 28
2.1 Inversion theory 28
2.2 The geophysical model 30
2.3 The forward problem by differential method 35
CHAPTER 3 SOFTWARE IMPLEMENTATION 39
3.1 CPU implementation 39
3.2 Example Results 40
3.3 GPU Implementation using CUDA 42
CONCLUSION 46
REFERENCES 47
 
 



Để tải bản Đầy Đủ của tài liệu, xin Trả lời bài viết này, Mods sẽ gửi Link download cho bạn sớm nhất qua hòm tin nhắn.
Ai cần download tài liệu gì mà không tìm thấy ở đây, thì đăng yêu cầu down tại đây nhé:
Nhận download tài liệu miễn phí

Tóm tắt nội dung tài liệu:

Each core can run a Linux OS independently. The processor also has Dynamic Distributed Cache technology which provides a fully coherent shared cache system across an arbitrary sized array of tiles. Programming can be done normally on a Linux derivative with full support for C and C++ and Tilera parallel libraries. The processor utilizes VLIW (Very Long Instruction Word) with RISC instructions for each core. The primary focus of this processor is for networking, multimedia and clouding computing with a strong emphasis on integer computation to complement GPU’s floating point computation.
From all these trends, it would be reasonable to assume that in the near future, we will be able to see new architectures which resemble all current architectures, such as many-core processors where each core has a CPU core and stream-processors as co-processors. Such systems would provide tremendous computing power per processor that would cause major changes in the field of computing.
1.2 Programming tools for scientific computing on personal desktop systems
Traditionally, most scientific computing tasks have been done on clusters. However, with the advent of modern hardware that provide great level of parallelism, many small to medium-sized tasks can now be run on a single high-end desktop computer in reasonable time. Such systems are called “personal supercomputers”. Although they have variable configurations, most today employ multicore CPUs with multiple GPUs. An example is the Fastra II desktop supercomputer [3] at University of Antwep, Belgium, which can achieve 12 Tflops computing power. The FASTRA II contains six NVIDIA GTX295 dual-GPU cards, and one GTX275 single-GPU card with a total cost of less than six thousands euros. The real processing speed of this system can equal that of a cluster with thousands of CPU cores.
Although these systems are more cost-effective, consume less power and provide greater convenience for their users, they pose serious problems for software developers.
Traditional programming tools and algorithms for cluster computing are not appropriate for exploiting the full potential of multi-core CPUs and GPUs. There are many kinds of interaction between components in such heterogeneous systems. The link between the CPU and the GPUs is through the PCI Express bus. The GPU has to go through the CPU to access system memory. The inside of the multicore CPU is a SMP system. As each GPU has separate graphics memory, the relationship between GPUs is like in a distributed-memory system. As these systems are in early stages of development, programming tools do not provide all the functionalities programmers need and many tasks still need to be done manually. Algorithms also need to be adapted to the limitations of current hardware and software tools.
In the following parts, we will present some programming tools for desktop systems with multi-core CPUs and multi GPUs that we that we consider useful for exploiting parallelism in scientific computing. The grouping is just for easy comparison between similar tools as some tools provide more than one kind of parallelization.
1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, and Cilk++
Windows and Linux (and other Unixes) provide API’s for creating and manipulating operating system threads using WinAPI threads and POSIX threads (Pthreads), respectively. These threading approaches may be convenient when there's a natural way to functionally decompose an application - for example, into a user interface thread, a compute thread or a render thread.
However, in the case of more complicated parallel algorithms, the manual creating and scheduling thread can lead to more complex code, longer development time and not optimal execution.
The alternative is to program atop a concurrency platform — an abstraction layer of software that coordinates, schedules, and manages the multicore resources.
Using thread pools is a parallel pattern that can provide some improvements. A thread pool is a strategy for minimizing the overhead associated with creating and destroying threads and is possibly the simplest concurrency platform. The basic idea of a thread pool is to create a set of threads once and for all at the beginning of the program. When a task is created, it executes on a thread in the pool, and returns the thread to the pool when finished. A problem is when the task arrives and the pool has no thread available. The pool then suspends the task and wakes it up when a new thread is available. This requires synchronization such as locks to ensure atomicity and avoid concurrency bugs. Thread pools are common for the server-client model but for other tasks, scalability and deadlocks still pose problems.
This calls for concurrency platforms with higher levels of abstraction that provide more scalability, productivity and maintainability. Some examples are OpenMP, Intel Threading Building Blocks, and Cilk++ .
OpenMP (Open Multiprocessing) [25] is an open concurrency platform with support for multithreading through compiler pragmas in C, C++ and Fortran. It is an API specification and compilers can provide different implementations. OpenMP is governed by the OpenMP Architecture Review Board (ARB). The first OpenMP specification came out in 1997 with support for Fortran, followed by C/C++ support in 1998. phiên bản 2.0 was released in 2000 for Fortran and 2002 for C/C++. phiên bản 3.0 was released in 2008 and is the current API specification. It contains many major enhancements, especially the task construct. Most recent compilers have added some level of support for OpenMP. Programmers can inspect the code to find places that require parallelization and insert the pragmas to tell the compiler to produce multithreaded code. This makes the code have a fork-join model, in which when the parallel section has finished; all the threads join back the master thread. Workloads in one loop are given to threads using work-sharing. There are four kinds of loop workload scheduling in OpenMP:
Static scheduling, each thread is given an equal chunk of iterations.
Dynamic scheduling, the iterations are assigned to threads as the threads request them. The thread executes the chunk of iterations (controlled through the chunk size parameter), then requests another chunk until there are no more chunks to work on.
Guided scheduling is almost the same as dynamic scheduling, except that for a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations, divided by the number of threads, decreasing to 1. For a chunk size of “k” (k > 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations.
Runtime scheduling, if this schedule is selected, the decision regarding scheduling kind is made at run time. The schedule and (optional) chunk size are set through the OMP_SCHEDULE environment variable.
Beside scheduling clauses, OpenMP also has clauses for data sharing attribute, synchronization, IF control, initialization, data copying, reduction and other concurrent operations.
A typical OpenMP parallelized loop may look like:
#pragma omp for schedule(dynamic, CHUNKSIZE)
for(int i = 2; i <= N-1; i++)
for(int j = 2; j <= i; j++)
for(int k = 1; k <= M; k++)
b[j]+=a[i-1][j]/k+a[i+1][j]/k;
Figure 9 OpenMP fork-join model.
Intel’s Threading Building Blocks (TBB) [4] is an open source C++ template library developed by Intel for writing task based multithreaded applications with ideas and models inherited from many previous languages and libraries. While OpenMP uses the pragma approach for parallelization, TBB uses the library approach. The first phiên bản came out in August 2006 and since then TBB has seen widespread use in many applications, especially game engines such as Unreal. TBB is available with both a commercial license and an open source license. The latest phiên bản 2.2 was introduced in August 2009. The library has also received Jolt Productivity award and InfoWorld OSS award.
It is a library based on generic programming, requires no special compiler support, and is processor and OS independent. This makes TBB ideal for parallelizing legacy applications. TBB has support for Windows, Linux, OS X, Solaris, PowerPC, Xbox, QNX, FreeBSD and can be compiled using Visual C++, Intel C++, gcc and other popular compilers.
TBB is not a thread-replacement library but provides a higher level of abstraction. Developers do not work directly with threads but tasks, which are mapped to threads by the library runtime. The number of threads are automatically managed by the library or set manually by the user, just like the case with OpenMP. Beside basic loop parallelizing parallel_for constructs, TBB also have parallel patterns such as para...
 

Các chủ đề có liên quan khác

Top