AMD 2014 Roadmap Discussion (CPU/APU)


Recommended Posts

So who keeps an eye on AMD products should already have seen the new 2014 roadmap which consists mostly of new gen APU's. 

 

Now this is an speculation and trying to "debunk" the rumors on the internet.

 

A lot of talk is going around that AMD is not going to release new CPU's this year since there's no mark of Steamroller core coming to CPU's, which actually is a bit disappointing to say the least. I'm no real big fan of APU's. I'm one of the people who have an mind set of "Moar cores, moar ghz, hell with the TDP". But, it doesn't say anywhere that AMD won't release new CPU's with Vishera cores. Sure, the 9950 is an beast and 8350 is actually more than enough already but I'm still hoping for something new.

 

Also, since Keller is back at AMD (the original Phenom 2 architect) I doubt they will just leave the FX line hanging just like that and will make some changes to bring the CPU's back into game. Maybe a bit late with next gen arch but this is my opinion.

 

What do you guys think? :) 

Link to comment
https://www.neowin.net/forum/topic/1196275-amd-2014-roadmap-discussion-cpuapu/
Share on other sites

No Steamroller FX is disappointing to say the least. That said, Kaveri is basically the Xbox One/PS4 architecture coming for desktop and it's bigger than people think. It's currently very hard to gain any performance improvement using GPGPU, except in very specific scenarios, simply because of the synchronizing costs between the two memories (VRAM and system RAM). This restriction vanishes with a unified memory architecture, meaning the GPU can be used to accelerate even small ordinary tasks. It's not unthinkable that we'll see auto-"GPU"ing code optimizers in the same way we already have auto-vectorizing optimizers.

 

On that note, we really need new programming languages designed from the ground up with parallelism and asynchrony in mind, rather than based on 70s era assumptions (all the C-family) (/editorial).

 

CPUs will not get much faster than Sandy Bridge in the foreseeable future, so heterogenous computing is really where the big performance gains will be. Kaveri has a huge advantage on traditional CPUs on that front. I can only see Intel catching up with AMD.

 

Also, Kaveri will simply kill the low-end discrete GPU, and probably even mid-range GPUs on laptops. AMD demoed it at 30fps in BF4 with medium settings using Mantle.

No Steamroller FX is disappointing to say the least. That said, Kaveri is basically the Xbox One/PS4 architecture coming for desktop and it's bigger than people think. It's currently very hard to gain any performance improvement using GPGPU, except in very specific scenarios, simply because of the synchronizing costs between the two memories (VRAM and system RAM). This restriction vanishes with a unified memory architecture, meaning the GPU can be used to accelerate even small ordinary tasks. It's not unthinkable that we'll see auto-"GPU"ing code optimizers in the same way we already have auto-vectorizing optimizers.

 

On that note, we really need new programming languages designed from the ground up with parallelism and asynchrony in mind, rather than based on 70s era assumptions (all the C-family) (/editorial).

GPU auto-tuners (and more generally load balancing frameworks where you write one generic code and run everywhere(CPU,GPU,XeonPhi, etc.)) are a point of research even with current non-unified memory architecture. One of the things that I think is important to consider is that at the end of the day from an HPC perspective, you will never see a global unified memory architecture. You may get it on single nodes (even then it is NUMA), but you won't get it beyond that because it simply doesn't scale.

 

On the topic of languages, I was at a conference last year where this Intel guy (Tim Mattson) discussed languages (he's a software guy, not a hardware guy just to be clear). Essentially, he said that languages are a dime a dozen because people continuously create new parallel programming models instead of using or improving existing models. In the video below, about 20 minutes in he has this discussion (this is a year before I saw him). It's pretty interesting:

 

 

EDIT: on a side note, he also asked me how I foresee handling of soft-errors in HPC during my talk at said conference and then really did not like my answer and was very vocal about it -- he's very vocal about his opinions and beliefs. I'm still a bit miffed about that, but I digress.

 

 

Also, Kaveri will simply kill the low-end discrete GPU, and probably even mid-range GPUs on laptops. AMD demoed it at 30fps in BF4 with medium settings using Mantle.

There's no surprise there. I expect the mid-range component market to stop existing eventually. The general trend is that feature-set is more and more migrated on-die as time goes on and process shrinks occur.

GPU auto-tuners (and more generally load balancing frameworks where you write one generic code and run everywhere(CPU,GPU,XeonPhi, etc.)) are a point of research even with current non-unified memory architecture. One of the things that I think is important to consider is that at the end of the day from an HPC perspective, you will never see a global unified memory architecture. You may get it on single nodes (even then it is NUMA), but you won't get it beyond that because it simply doesn't scale.

Oh, certainly not, I'm just talking about personal computers, not distributed computing.

 

On the topic of languages, I agree with the guy in the talk that automatic parallelisation of scalar code will never work. The more fundamental problem is that our programming languages are by nature scalar. Why is it that in 2014, there's no programming language (AFAIK) where the sum of two arrays is an array containing the sum of their respective elements? Oh sure you can call into a specific library that tries to shoehorn parallel constructs on top, but at the language level there's still very little. There's async in C#/F#/VB which is a step in the right direction, but we're far from general vectorizable/parallelizable language constructs allowing us to naturally write parallel code. In particular the whole .NET runtime knows nothing of vector or massively parallel operations. We should be able to treat collections like we treat any other variable; why should a variable be something that fits into a scalar register?

 

Meanwhile our SSE/AVX registers and thousands of programmable shaders sit idle, and that's if you're lucky enough that your program actually uses all CPU cores because doing that with threads and locks is mind-boggling hard.

 
Have you ever written OpenCL code? Man that is horrible. This is like writing assembly code; it's low-level, error-prone and highly remote from the actual logic you're trying to implement. Somehow I can't believe a computer cannot be more effective than a human at figuring this stuff out.
 
I realize that I'm only proposing vague ideas and haven't done any research on their feasability, but I just feel like there's something wrong with basing our modern languages on assumptions made in the 70s. After all C was just syntactic sugar on popular assembly-level patterns; we need syntatic sugar that matches current-day good assembly code, and for-looping over a collection one element at time isn't a generally good solution anymore.

On the topic of languages, I agree with the guy in the talk that automatic parallelisation of scalar code will never work. The more fundamental problem is that our programming languages are by nature scalar. Why is it that in 2014, there's no programming language (AFAIK) where the sum of two arrays is an array containing the sum of their respective elements? Oh sure you can call into a specific library that tries to shoehorn parallel constructs on top, but at the language level there's still very little. There's async in C#/F#/VB which is a step in the right direction, but we're far from general vectorizable/parallelizable language constructs allowing us to naturally write parallel code. In particular the whole .NET runtime knows nothing of vector or massively parallel operations. We should be able to treat collections like we treat any other variable; why should a variable be something that fits into a scalar register?

 

Meanwhile our SSE/AVX registers and thousands of programmable shaders sit idle, and that's if you're lucky enough that your program actually uses all CPU cores because doing that with threads and locks is mind-boggling hard.

Languages tend to reflect the hardware they run on. For the most part commodity processors are not really vector processors except in floating point workloads so languages tend to reflect that in their scalar nature. There are some languages that are capable of expressing vector operations: Fortran and Matlab for example. Fortran is by-far the language that the majority of useful HPC applications are coded in. 

 

I would caution that there are fundamentally two different topics when it comes to parallelization: data parallelism and task parallelism. The former category is arguably easier to express but much more restrictive in terms of application than the latter (read as: task parallelism is an open research topic). SIMD parallelization in the form of vectorization or shaders is a form of data-parallelism. Semantically, they are fairly easy to express at the language level, and as such, it could be done. But I'm not sure it needs to be. Vectorization can be and is done at the compiler level and it is generally considered to be a compiler optimization. On the GPU front, we have CUDA/OpenCL instead of direct shader programming (technically CUDA is a language extension if we are being strict).

 

So, I said above that task parallelism is an open research topic. Let me expound upon that. For a few decades, in the commodity sector, we were running on sequential machines, so there was simply no need to extend parallel language semantics (or runtime semantics). We had simple threading libraries/semantics which were enough at that point. What we are seeing now is extensions that are driven by well researched task parallelism semantics (async, futures, promises, etc.) that are a logical extension to the existing threading functionality. You are not likely to see much else because paradigm shifts, simply put, are difficult. Getting programmers to understand or use fundamentally different programming semantics is almost impossible (in HPC people are still using MPI and fortan for that reason). Recent research in the last 5-6 years has been on how to express dependencies between tasks, but, again, this reflects fundamental changes ins how programmers express parallelism in applications so, again, this is difficult. In the consumer sphere, I do not think we are going to see either languages or runtimes that deviate much from what is currently available. What you'll see is simply more logical extensions to existing frameworks and that's about it.

 

 

Have you ever written OpenCL code? Man that is horrible. This is like writing assembly code; it's low-level, error-prone and highly remote from the actual logic you're trying to implement. Somehow I can't believe a computer cannot be more effective than a human at figuring this stuff out.

 
I realize that I'm only proposing vague ideas and haven't done any research on their feasability, but I just feel like there's something wrong with basing our modern languages on assumptions made in the 70s. After all C was just syntactic sugar on popular assembly-level patterns; we need syntatic sugar that matches current-day good assembly code, and for-looping over a collection one element at time isn't a generally good solution anymore.

 

I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though  :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

Languages tend to reflect the hardware they run on. For the most part commodity processors are not really vector processors except in floating point workloads so languages tend to reflect that in their scalar nature. There are some languages that are capable of expressing vector operations: Fortran and Matlab for example. Fortran is by-far the language that the majority of useful HPC applications are coded in. 

Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.
 
Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?
 
So, I said above that task parallelism is an open research topic. Let me expound upon that. For a few decades, in the commodity sector, we were running on sequential machines, so there was simply no need to extend parallel language semantics (or runtime semantics). We had simple threading libraries/semantics which were enough at that point. What we are seeing now is extensions that are driven by well researched task parallelism semantics (async, futures, promises, etc.) that are a logical extension to the existing threading functionality. You are not likely to see much else because paradigm shifts, simply put, are difficult. Getting programmers to understand or use fundamentally different programming semantics is almost impossible (in HPC people are still using MPI and fortan for that reason). Recent research in the last 5-6 years has been on how to express dependencies between tasks, but, again, this reflects fundamental changes ins how programmers express parallelism in applications so, again, this is difficult. In the consumer sphere, I do not think we are going to see either languages or runtimes that deviate much from what is currently available. What you'll see is simply more logical extensions to existing frameworks and that's about it.

 

async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top.  Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.
 
I'm probably not being very constructive, but at least I get to express my thoughts at someone that understands them, so thanks for the conversation.
 
I haven't done OpenCL specifically, but I've done CUDA and Cell (the latter is much worse). Everything I work with these days tends to be worse than those though   :laugh:. I think the problem is is that we don't know where to go paradigm wise for task parallelism. We have data parallelism pretty much down (compiler and semantic expression wise). It may not be "easy" from a programmer stand point, but it IS doable. You may see CUDA/OpenCL as horrible, but they are really not all that bad of a way to express data-parallelism (remember it is not meant for anything but HPC programming). For task parallelism, we really have no idea.

 

How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?

A lot of talk is going around that AMD is not going to release new CPU's this year since there's no mark of Steamroller core coming to CPU's, which actually is a bit disappointing to say the least. I'm no real big fan of APU's. I'm one of the people who have an mind set of "Moar cores, moar ghz, hell with the TDP". But, it doesn't say anywhere that AMD won't release new CPU's with Vishera cores. Sure, the 9950 is an beast and 8350 is actually more than enough already but I'm still hoping for something new.

I don't know if their dedicated procs are actually done.  They implied recently it wasn't, they just had nothing to announce.

 

As to the other discussion, did OpenCL 2.0 change much?  I haven't looked at it as there isn't enough OpenCL stuff yet for me to get terribly interested.

Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?

 

I'm not 100% sure on this but I think the JVM can generate SIMD instructions, a cursory search seems to agree - but I'm no expert, just interested in the topic.

I'm not 100% sure on this but I think the JVM can generate SIMD instructions, a cursory search seems to agree - but I'm no expert, just interested in the topic.

You're right, it looks like the HotSpot VM supports it (source). Then again that article argues in the same direction as I do: giving programmers better tools at the language level to make use of SIMD instructions.

 

Floating-point and integer workloads. It really bugs me that the highest-level option we currently have for writing vector code is dropping down to instrinsics in C++. I should be able to just add two arrays, or I should be able to express matrix multiplication in such a way that I know this will compile to vectorized code. For now auto-vectorisation only exists as an optimization in C++, and even then it's far from obvious that a given piece code can be auto-vectorized. You can't even express "please give me a compiler error if the following code is not vectorizable". The problem being that the piece of code is scalar code that has to be reverse-engineered into parallel code, rather than parallel code that can naturally be translated into parallel instructions, just like scalar C code translates naturally into scalar assembly code.

Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.

 

Compiler wise, you can usually make low-level language compilers tell you if they are promoting operations to vector equivalents (http://gcc.gnu.org/projects/tree-ssa/vectorization.html, ICC and VS have options also). When you say that you have to reverse engineer a code to be parallel, I don't really the see the effective difference whether you have SIMD operations in the language or not. Suppose, they have language extensions and you can natively do vector operations in the language. In order to use those, you need to be able to decompose your problem into a parallel SIMD form. Similarly, even without the language extensions you are still need to decompose your problem into a parallel SIMD form. The only effective difference is that you have to explicitly write the loop and the compiler will generate the vector operation for you as opposed you writing out the operation yourself. There are generally intrinsic instructions if you didn't want to do the loop yourself though. The point though is that that the process of decomposition and expression is there regardless.

 

 

Nevermind the fact that apart from C/C++ there's no auto-vectorization anywhere. Not on .NET, not on Java (AFAIK), not in Javascript, nowhere. And I don't really blame the authors of these runtimes, because auto-vectorizing scalar code is a really hard problem and you can rarely guarantee the behavior remains the same (C++ actually adds restrict keyword just to give hints to the compiler). But if our languages allowed us to express parallel data operations, then the translation to vectorized code would be much more obvious, no?

Again, these are high level languages that aren't purposed for high performance (in fact the ISAs lack any notion of vectorization support). They were explicitly designed to forgo low-level control for automatic optimization and memory management. Manual vectorization would be be against those goals. So, to answer the question, yes, you could feasible enable vectorization opportunities through language extensions, but that goes against the design of the languages. Really the point is that it is the wrong class of languages for such semantics. If you want low-level control for optimization opportunities, then use or interface with a low-level language.

 

 

async really changes the way programmers think about task parallelism because now they can write concurrent code just like sequential code; no more callbacks, natural exception handling, etc; the transformation is done systematically by the compiler. That's a great development, but at the same time it's an extension on top of essentially sequential languages. I wonder how farther we could go by designing a language to be concurrent from the ground up, rather than sequential first and concurrent sprinkled on top.  Perhaps that stuff is already well figured out in the academia, but I'm still using a language which name starts with "C" and staring at a switch case inside a for loop right now, and I know that none of that can ever be automatically parallelized.

Async/await is just fork-join-parallelism where you spawn a task and then later join back to the task. It's the typical style of parallelism that most languages and threading libraries support (pthreads, omp, windows threads, etc.). The difference here is that it has simplified semantics compared to traditional threading libraries so it's easier to start/manage/join to tasks. I don't see why this changes the way programmers think about parallelism, Regardless, the programmer has to go though the process of identifying fork-join parallelism and decomposing programs into tasks. It isn't as if they can get away with thinking sequentially.

 

If you want languages that are concurrent from the ground up, then look into functional languages (haskell, Lisp, etc.). These languages consist of statements that are semantically concurrent and thus given infinite resources could be perfectly executed in parallel (as much as data dependencies allow). The problem is they don't map well to real machines. Real machines have limited resources and the granularity of tasks is important for performance. By switching your language to be inherently parallel you are giving the compiler the task of automatically (auto-magically) mapping parallelism to limited resources or more strictly put: the compiler is responsible for optimizing perfectly parallel code to fit on machine with limited parallel resources (chunking things into tasks and simd instructions). This doesn't necessarily end well because it is actually a fairly difficult problem from a compiler standpoint. Hand-tuned manual parallelism using control-flow languages tends to yield better results. And that's where you always end up, because you don't have have infinite resources so you are either trying to (1) parallelize a sequential program, (2) letting a compiler parallelize a sequential program, or (3) letting a compiler sequentialize a parallel program.

 

 

How is it "not all that bad"? It's like writing assembly code: low-level, error-prone, and highly remote from the actual domain logic. We got rid of the need to write assembly code in the 70s, and by the 90s programs had gotten so complex that it was downright impossible to write large programs in assembly code. OpenCL is the assembly code of parallel programming. We need to develop some language on top of it so anyone can write complex programs. Right now everyone is writing his own framework on top of it, but isn't that proof that none of these are satisfactory solutions?

If you are going to do comparisons on difficulty then this is more valid comparison: x86 SIMD assembly is akin to PTX assembly (NV assembly). Those are around similar levels of difficulty. OpenCL/CUDA is not on level of either of those things because you don't need to know specific ISA details to write OpenCL/CUDA code. Most things are abstracted away from you and it has actual high-level language concepts like variables, memory management, tasking, etc. Such concepts don't exist in assembly. Of course, these languages are designed to closely map to the architectures they run on and give you relatively low-level control so they aren't exactly super easy to use.

 

You seem to be making two arguments to me: you want more control in higher level languages but you want less control in lower level languages. Complex programs require complex languages and you aren't going to get a silver bullet of easy-to-use-with-lots-of-power-and-runs-well language. The fact is, no-one has an answer for what the perfect language or runtime is: it's simply an open research topic. Currently, from what I've seen, people think the best alternative to do a layered approach. At the low level you have languages that map well to your architecture and a higher level you have languages that are transformed into lower-level languages. You expose architectural details in the low level languages, but not in the high level languages. On top of that you expose some sort of hinting system  at the high level to give help with parallel optimizations (e.g. group these tasks). This is just a vague trend I'm alluding to that has been occurring in HPC.

Architecturally speaking, for many generations Intel abandoned enhancements to the integer portions of their SIMD pipelines. Basically what we've simply had is SSE front-end instructions shoehorned onto half of the the floating-point AVX data-path. Personally, I've always considered it to be kludge to maintain backwards compatibility with SSE more than anything. With AVX2 (in Haswell) though, Intel has finally has proper support across workloads though.

 

I take it what you mean here is the fact that the SSE ops under AVX still only address xmm instead of ymm, correct?

 

Excuse the newbie question by the way, but I have to say I absolutely love these kinds of discussions and it's a shame there aren't more like it.

I take it what you mean here is the fact that the SSE ops under AVX still only address xmm instead of ymm, correct?

 

Excuse the newbie question by the way, but I have to say I absolutely love these kinds of discussions and it's a shame there aren't more like it.

Yeah, it adds additional features like vector shifts and non-adjacent element accesses into the ISA also. I'm not exactly being fair in my assessment, but Intel really didn't seem to care about it all that much. It seems to me to be more of an after thought thing because they got die space in the redesigns to properly do it probably now.

Yeah, it adds additional features like vector shifts and non-adjacent element accesses into the ISA also. I'm not exactly being fair in my assessment, but Intel really didn't seem to care about it all that much. It seems to me to be more of an after thought thing because they got die space in the redesigns to properly do it probably now.

 

AVX-512 seems to support this from the start, so they seem to have "learned from their mistake" so to speak.

 

I'm curious however, how well does doubling register size / element count scale? To my mind logic would dictate a 2x speedup per increase, but reality is often not best-case.

 

This seems to be another case for having standardisation of ISE development/implementation in my opinion (likewise with the mess of FMA3/4 etc), not that Intel would allow that mind you.

AVX-512 seems to support this from the start, so they seem to have "learned from their mistake" so to speak.

 

I'm curious however, how well does doubling register size / element count scale? To my mind logic would dictate a 2x speedup per increase, but reality is often not best-case.

 

This seems to be another case for having standardisation of ISE development/implementation in my opinion (likewise with the mess of FMA3/4 etc), not that Intel would allow that mind you.

It is double the performance in well optimized code. You can't necessarily pack instructions perfectly (or maybe at all in some cases) though unless you are doing something simple like matrix multiplication though so in real-world applications it is less.

 

EDIT: speaking of FMA3/4. Yeah, that business was a headache. Though, I don't think it was Intel's fault persay. I was under the impression that both Intel and AMD switched half way through to the opposite. It looks like FMA3 is the standard going forward so it should be settled.

 

EDIT2: Also to be fair, that doesn't necessarily get rid of all issues. Some AMD cores use to shared FMAC units (e.g. 1 unit per 2 core) where you could either issue a single 256-bit width FMA instruction on a single core or two 128-bit width FMAs on adjacent cores. In practice you saw better performance by doing the latter. The point being is that you don't end up with a unification in optimizations regardless.

It is double the performance in well optimized code. You can't necessarily pack instructions perfectly (or maybe at all in some cases) though unless you are doing something simple like matrix multiplication though so in real-world applications it is less.

 

EDIT: speaking of FMA3/4. Yeah, that business was a headache. Though, I don't think it was Intel's fault persay. I was under the impression that both Intel and AMD switched half way through to the opposite. It looks like FMA3 is the standard going forward so it should be settled.

 

EDIT2: Also to be fair, that doesn't necessarily get rid of all issues. Some AMD cores use to shared FMAC units (e.g. 1 unit per 2 core) where you could either issue a single 256-bit width FMA instruction on a single core or two 128-bit width FMAs on adjacent cores. In practice you saw better performance by doing the latter. The point being is that you don't end up with a unification in optimizations regardless.

 

Yeah, unfortunate for people with bdver1 based chips, but bdver2+ supports FMA3.

 

As to the FMAC stuff, yeah I hinted to as much in my initial post referencing improvements in Excavator/bdver4. (Presumably 2x 256-bit FMACs?)

 

What I don't understand though, is with the changes that are landing post-Bulldozer (Like Steamroller's additional decoder), aren't AMD essentially just edging slowly back to having a "standard" architecture? Seems rather odd to make that change just when their HSA stuff is reaching early maturity.

Yeah, unfortunate for people with bdver1 based chips, but bdver2+ supports FMA3.

 

As to the FMAC stuff, yeah I hinted to as much in my initial post referencing improvements in Excavator/bdver4. (Presumably 2x 256-bit FMACs?)

 

What I don't understand though, is with the changes that are landing post-Bulldozer (Like Steamroller's additional decoder), aren't AMD essentially just edging slowly back to having a "standard" architecture? Seems rather odd to make that change just when their HSA stuff is reaching early maturity.

Do you mean why 2x 256-bit FMACs? Probably because it's easier to design, layout, and uses less die space. I'm sure they aren't wasting any space in their design.

 

But yeah they'll continue to move forward with less and less shared parts of the pipeline in future designs. It's always going to be a balance of what more you can shove into the die at the end of the day though. Isn't AMD still a process size behind Intel? If so, that'd probably be a constraint in continuing to share resources.

Do you mean why 2x 256-bit FMACs? Probably because it's easier to design, layout, and uses less die space. I'm sure they aren't wasting any space in their design.

 

But yeah they'll continue to move forward with less and less shared parts of the pipeline in future designs. It's always going to be a balance of what more you can shove into the die at the end of the day though. Isn't AMD still a process size behind Intel? If so, that'd probably be a constraint in continuing to share resources.

 

The presumably part was in regards to what constituted Excavator's "FPU Improvments", I don't think they've been detailed yet. All that's known thus far is it supports AVX2 amongst other things.

 

As far as process size goes I had read that the issue was due to contractual obligations (Soon to end) with the spun-off GloFo, the Radeon 7xx0 parts fabbed at TMSC are/were 28nm vs GloFo's 32nm.

The presumably part was in regards to what constituted Excavator's "FPU Improvments", I don't think they've been detailed yet. All that's known thus far is it supports AVX2 amongst other things.

 

As far as process size goes I had read that the issue was due to contractual obligations (Soon to end) with the spun-off GloFo, the Radeon 7xx0 parts fabbed at TMSC are/were 28nm vs GloFo's 32nm.

Eh, they are still a few years behind Intel in that regard then (that's always going to be the case at this point me thinks). It's funny because Intel doesn't necessarily have to make great designs. They can just ride the benefits from a better process and do easy improvements. That's not to say that they don't do good designs, but they'd stay afloat even if they didn't.

Eh, they are still a few years behind Intel in that regard then (that's always going to be the case at this point me thinks). It's funny because Intel doesn't necessarily have to make great designs. They can just ride the benefits from a better process and do easy improvements. That's not to say that they don't do good designs, but they'd stay afloat even if they didn't.

Or pay OEMs to ignore the competition when they can't match them.

This topic is now closed to further replies.
  • Posts

    • How to Do More with Less: Future-Proofing Yourself in an AI-driven Economy —was $28 now FREE by Steven Parker Claim your complimentary copy (worth $28) of "How to Do More with Less: Future-Proofing Yourself in an AI-driven Economy" for free, before the offer ends on June 30. Description In today’s workplace, headlines about artificial intelligence can feel overwhelming. With headlines swinging between promises of utopia and warnings of mass unemployment, for most knowledge workers, the truth feels unclear. In this book, Sharon Gai cuts through the noise. Drawing from real-world examples and global insights, she explains how AI is reshaping the way we work—without hype or fearmongering. Instead of choosing between blind optimism or outright pessimism, she offers a practical, balanced perspective that helps readers make sense of the rapidly evolving AI landscape. You’ll learn how to: Reskill and future-proof your career in the face of AI disruption Identify which parts of your role can be automated, and which require human creativity and judgment Use proven frameworks to evaluate AI’s impact on your work and your organization Apply actionable tips and tools to boost productivity, make smarter decisions, and do more with less Gain clarity as a parent, leader, or professional navigating what this means for the next generation Whether you’re an employee anxious about your future, a parent concerned about your children’s opportunities, or a leader managing a lean team with tight budgets, this book provides the strategies and mindset you need to adapt so you can stop worrying and start preparing. How to download for free Please ensure you read the terms and conditions to claim this offer. Complete and verifiable information is required in order to receive this free offer. If you have previously made use of these offers, you will not need to re-register. Was $28, but is now FREE | Below free offer link expires on June 30. How to Do More with Less: Future-Proofing Yourself in an AI-driven Economy The below offers are also available for free in exchange for your (work) email: The Vibe Coding Playbook: Building Your Tech Business with AI ($35 Value) FREE - Expires 6/23 The Persuasion Engine: How Any Business Can Use AI-Powered Neuromarketing to Understand and Win Customers ($28 Value) FREE - Expires 6/24 How to Do More with Less: Future-Proofing Yourself in an AI-driven Economy ($28 Value) FREE - Expires 6/30 Cloud Security Fundamentals: Building the Foundations for Secure Cloud Platforms ($131.95 Value) FREE - Expires 7/1 The Complete Free AI Learning: Master ChatGPT, Claude, Gemini & More ($21 Value) FREE How to Build an AI Design Workflow with Gamma ($21 Value) FREE The Ultimate Linux Newbie Guide – Featured Free content Python Notes for Professionals – Featured Free content Learn Linux in 5 Days – Featured Free content Quick Reference Guide for Cybersecurity – Featured Free content We post these because we earn commission on each lead so as not to rely solely on advertising, which many of our readers block. It all helps toward paying staff reporters, servers and hosting costs. Other ways to support Neowin The above deal not doing it for you, but still want to help? Check out the links below. Check out our partner software in the Neowin Store Buy a T-shirt at Neowin's Threadsquad Subscribe to Neowin - for $14 a year, or $28 a year for an ad-free experience Disclosure: An account at Neowin Deals is required to participate in any deals powered by our affiliate, StackCommerce. For a full description of StackCommerce's privacy guidelines, go here. Neowin benefits from shared revenue of each sale made through the branded deals site.
    • Microsoft admits one of the most crucial Outlook features is currently broken by Sayan Sen Microsoft is making some decent progress when it comes to Windows 11. Recently we have confirmed reports of some rather useful improvements landing in the next version of the OS, 26H2, wherein GPU driver TDR crashes may finally be fixed, plus the company is also allowing users to disable web content on the Search. On the Outlook front though things have not been so rosy. Last month in May we reported several problems affecting basic functionalities on the app. These included a problem where documents would open blank or corrupt themselves. Following that, Quick Steps, a very useful feature, would no longer work correctly, and finally, Microsoft acknowledged a problem wherein images would fail to load up properly inside the email. Microsoft had resolved those bugs later and almost exactly a month after we reported on them, the company has now admitted a new similarly basic issue, this time on Macs. Users recently started noticing that Outlook would no longer display email threads properly as the original message itself was not displayed. An affected user Tsoumpas, C (ngmb) nicely described the problem in a forum post they made on Microsoft's site. They wrote: "Description of the issue: After updating Outlook for Mac [Version 16.110 (26061317)] on 18/6/2026, replying to any email no longer includes the original message in the reply window. Prior to the update, replies correctly contained the original email text below my response. Expected behavior: The original message should be included in the reply, as in previous Outlook versions and according to the configured reply settings. Actual behavior: The reply window contains only a blank composition area (or only my response), with none of the original email text included." Obviously this must be a highly frustrating for users as noted by several in that thread. The post, at the time of writing, has also been upvoted by more than 40 users indicating that is a fairly widespread bug. Thankfully Microsoft seems to have acknowledged the problem right around that time as it opened a new issue on its official website. In the support article, the company recommends switching to Outlook for Mac from the legacy app, where the problem appears to be happening.
    • PotPlayer 260622 by Razvan Serea PotPlayer is an extremely light-weight multimedia player for Windows. It feels like the KMPlayer, but is in active development. Supports almost every available video formats out there. PotPlayer contains internal codecs and there is no need to install codecs manually. Other key features include WebCam/Analog/Digital TV devices support, gapless video playback, DXVA, live broadcasting. Distinctive features of the player is a high quality playback, support for all modern video and audio formats and a built DXVA video codecs. A wide range of subtitles are supported and you are also able to capture audio, video, and screenshots. A comprehensive video and audio player, that also supports TV channels, subtitles and skins. Its been described on the Internet as The KMPlayer redux, and it pretty much is. Daum PotPlayer 260622 (1.7.22963) changelog: Removed Kakao TV Added pause function when navigating via the navigation bar Significantly improved internal stability Fixed an issue where colors appeared strange during RGB24 processing Improved playback for some HTTP streams Improved sync processing for the built-in audio renderer Fixed an issue where certain MP4 files behaved abnormally during playback Download: Daum PotPlayer (64-bit) | 54.7 MB (Freeware) Download: Daum PotPlayer (32-bit) | 61.1 MB View: Daum PotPlayer Home Page | Screenshot Get alerted to all of our Software updates on Twitter at @NeowinSoftware
    • Tixati 3.44 is out.
    • Speccy 1.34.084 by Razvan Serea Speccy will give you detailed statistics on every piece of hardware in your computer. Including CPU, Motherboard, RAM, Graphics Cards, Hard Disks, Optical Drives, Audio support. Additionally Speccy adds the temperatures of your different components, so you can easily see if there's a problem! Processor brand and model Hard drive size and speed Amount of memory (RAM) Graphics card Operating system At first glance, Speccy may seem like an application for system administrators and power users. It certainly is, but Speccy can also help normal users, in everyday computing life. If you need to add more memory to your system, for example, you can check how many memory slots your computer has and what memory's already installed. Then you can go out and buy the right type of memory to add on or replace what you've already got. Download: Speccy 1.34.084 | 20.5 MB (Freeware) View: Speccy Website | Screenshot Get alerted to all of our Software updates on Twitter at @NeowinSoftware
  • Recent Achievements

    • Dedicated
      tuben earned a badge
      Dedicated
    • Week One Done
      mnsgroup earned a badge
      Week One Done
    • Conversation Starter
      sumytbe earned a badge
      Conversation Starter
    • One Year In
      B4dM1k3 earned a badge
      One Year In
    • One Year In
      DarkWun earned a badge
      One Year In
  • Popular Contributors

    1. 1
      +primortal
      522
    2. 2
      +Edouard
      199
    3. 3
      PsYcHoKiLLa
      94
    4. 4
      Michael Scrip
      82
    5. 5
      neufuse
      69
  • Tell a friend

    Love Neowin? Tell a friend!